A Unified CV, OCR & NLP Model Pipeline for Document Understanding at DocuSign
DocuSign is the world’s largest e-signatures provider and serves more than 500,000 customers and hundreds of millions of users worldwide.
This talk summarizes lessons learned from building a deep-learning model pipeline that automatically understands where fillable fields are in each document – such as signature boxes, checkboxes, initial fields, and text fill-in fields.
A major challenge is learning to handle the immense variety of documents that DocuSign processes, covering just about every business transaction from loans, insurance, healthcare, education, government, real estate, manufacturing, telecom, retail, employment, to enterprise-driven legal affairs.
This task requires a modeling pipeline that unifies three usually separate problems:
- Computer Vision – since many documents are scanned or uploaded as photographs, the system must handle image correction / pre-processing, page segmentation, and object detection.
- OCR – text from documents must be extracted in order to correlate fields and their corresponding tag types – this information goes on to inform the “type” of tag that it is
- NLP – text from OCR, along with field coordinates from Computer Vision is used to infer the “type” of a given field, EG signature, or text
We’ll cover experiments and learnings from:
- Applying DDS, ResNet, Inception, and Faster R-CNN.
- Object Character Recognition – classifying fields requires “reading” the text around them, hence training OCR, line detection, and page segmentation models are required.
- Natural Language Processing – we’ll discuss results and lessons learned from training, annotating, feature engineering, and hyper-parameter tuning of the NLP models that classify field types with documents.
Critically, the unified pipeline must satisfy these four requirements:
- State-of-the-art accuracy of the entire pipeline – in contrast to local optimization of each model
- Interpretability – visually showing what the model inferred on each document, to enable debugging and manual corrections
- Scalability – the ability to scale training and inference to many millions of documents per week
- Compliance – given the high sensitivity of many business documents that DocuSign handles, the entire training & inference infrastructure must be locked down and run on bare metal, within custom data centers. The deep learning libraries and models utilized will all be open source and based on the TensorFlow ecosystem.
This talk is intended for data scientists looking for best practices and lessons learned on building real-world, large-scale document understanding systems based on these technologies, with insights also relevant to applications of generative AI in healthcare.
Patrick is a scientist and engineer using machine learning to tackle environmental and healthcare challenges. He is currently a member of the Agreement Intelligence team at DocuSign. He holds a Ph.D. in neuroscience from the Center for Neuroscience at the University of Pittsburgh and the Center for the Neural Basis of Cognition at Carnegie Mellon University.
His dissertation research focused on applying machine learning to decode neural data. Prior to that, he earned an MS in Logic from Carnegie Mellon University focusing on recurrent neural networks and causal inference.
Michael Chertushkin has an experience in building Computer Vision and Natural Language Processing. He graduated from Ural Federal State University in 2011 with Bachelor of Science in CS and since then is working as a machine learning engineer.
In 2020 he graduated from Boston University with master of Science in CS. His fields of interests are: computational complexity, robust theory of statistics and machine learning.