Accepted Papers/Posters

Papers

(in alphabetical order; authors marked by * are the corresponding authors)

#TitleSession
1CHARTER: heatmap-based multi-type chart data extraction1
2Data-Efficient Information Extraction from Form-Like Documents2
3Detection Masking for Improved OCR on Noisy Documents1
4Efficient Document Image Classification Using Region-Based Graph Neural Network1
5Generating and evaluating simulated medical notes: Getting a Natural Language Generation model to give you what you want3
6HYCEDIS: HYbrid Confidence Engine for Deep Document Intelligence SystemBest Paper
7Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR documents1
8Position Masking for Improved Layout-Aware Document Understanding2
9SpecToSVA: Circuit Specification Document to SystemVerilog Assertion Translation3
10Text Analysis via Binomial Tails2

Details

  • CHARTER: heatmap-based multi-type chart data extraction
    • Authors: Joseph Shtok (IBM-Reseach)*; Sivan Harary (IBM-Research); Ophir Azulai (IBM-Research); Adi Raz Goldfarb (IBM Research); Assaf Arbelle (IBM Research AI); Leonid Karlinsky (IBM-Research)
    • Abstract: The digital conversion of information stored in documents is a great source of knowledge. In contrast to the documents text, the conversion of the embedded documents graphics, such as charts and plots, has been much less explored. We present a method and a system for end-to-end conversion of document charts into machine readable tabular data format, which can be easily stored and analyzed in the digital domain. Our approach extracts and analyses charts along with their graphical elements and supporting structures such as legends, axes, titles, and captions. Our detection system is based on neural networks, trained solely on synthetic data, eliminating the limiting factor of data collection. As opposed to previous methods, which detect graphical elements using bounding-boxes, our networks feature auxiliary domain specific heatmaps prediction enabling the precise detection of pie charts, line and scatter plots which do not fit the rectangular bounding-box presumption. Qualitative and quantitative results show high robustness and precision, improving upon previous works on popular benchmarks.
  • Data-Efficient Information Extraction from Form-Like Documents
    • Authors: Beliz Gunel (Stanford University)*; Navneet Potti (Google); Sandeep Tata (“Google, USA”); James B Wendt (Google); Marc Najork (Google); Jing Xie (Google)
    • Abstract: Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare. The key challenge is that form-like documents in these business workflows can be laid out in virtually infinitely many ways; hence, a good solution to this problem should generalize to documents with unseen layouts and languages. A solution to this problem requires a holistic understanding of both the textual segments and the visual cues within a document, which is non-trivial. While the natural language processing and computer vision communities are starting to tackle this problem, there has not been much focus on (1) data-efficiency, and (2) ability to generalize across different document types and languages. In this paper, we show that when we have only a small number of labeled documents for training (50), a straightforward transfer learning approach from a considerably different larger labeled corpus yields up to a 27 F1 point improvement over simply training on the small corpus in the target domain. We improve on this with a simple multi-domain transfer learning approach, that is currently in production use, and show that this yields up to a further 8 F1 point improvement. We make the case that data efficiency is critical to enable information extraction systems to scale to handle hundreds of different document-types, and learning good representations is critical to accomplishing this.
  • Detection Masking for Improved OCR on Noisy Documents
    • Authors: Daniel Rotman (IBM Research)*; ophir azulai (IBM-Research); Inbar Shapira (IBM); Yevgeny Burshtein (IBM Research); Udi Barzelay (IBM)
    • Abstract: Optical Character Recognition (OCR), the task of extracting textual information from scanned documents is a vital and broadly used technology for digitizing and indexing physical documents. Existing technologies perform well for clean documents, but when the document is visually degraded, or when there are non-textual elements, OCR quality can be greatly impacted, specifically due to erroneous detections. In this paper we present an improved detection network with a masking system to improve the quality of OCR performed on documents. By filtering non-textual elements from the image we can utilize document-level OCR to incorporate contextual information to improve OCR results. We perform a unified evaluation on a publicly available dataset demonstrating the usefulness and broad applicability of our method.Additionally, we present and make publicly available our synthetic dataset with a unique hard-negative component specifically tuned to improve detection results, and evaluate the benefits that can be gained from its usage.
  • Efficient Document Image Classification Using Region-Based Graph Neural Network
    • Authors: Jaya Krishna Mandivarapu (Georgia State University)*; Eric Bunch (American Family Insurance); Glenn M Fung (American Family Insurance); Qian You (American Family Insurance)
    • Abstract: Document image classification remains a popular research area because it can be commercialized in many enterprise applications across different industries. Recent advancements in large pre-trained computer vision and language models and graph neural networks has lent document image classification many tools. However using large pre-trained models usually requires substantial computing resources which could defeat the cost-saving advantages of automatic document image classification. In the paper we propose an efficient document image classification framework that uses graph convolution neural networks and incorporates textual, visual and layout information of the document.
  • Generating and evaluating simulated medical notes: Getting a Natural Language Generation model to give you what you want
    • Authors: Robert Horton (Microsoft Corporation)*; Maryam T Tavakoli Hosseinabadi (Microsoft Corporation); Alexandre Vilcek (Microsoft); Wolfgang M Pauli (Microsoft); Mario Inchiosa (Microsoft)
    • Abstract: Strong restrictions on sharing healthcare data pose a significant barrier to developing and applying machine learning (ML) technologies in this field. Significant effort has been invested in generating “realistic but not real” Electronic Medical Record (EMR) data that can be used to facilitate many aspects of the digital transformation effort in healthcare [17]. Here we demonstrate a transformer-based Natural Language Generation approach to supplement the structured EMR data produced by the open-source Synthea𝑇𝑀 simulation system with narrative text fields (‘History of present illness’) that are semantically consistent with the structured attributes for a given simulated patient encounter. One central hyperparameter for text generation is top_p, which determines the trade-off between diversity of generated text, while maintaining fluency and coherency. We evaluate the generated text via BERT-based text classification, regular expression matching, domain-specific entity recognition, and semantic embedding for repetition detection and study the impact of top_p on these metrics. Our observations show that increasing top_p improves some quality measures while worsening others. Input from domain experts will be required to find an optimal top_p for a specific task. This is preliminary work toward a larger goal of producing simulated text data suitable for developing and demonstrating various NLP-based ML approaches in EMR applications.
  • (BEST PAPER) HYCEDIS: HYbrid Confidence Engine for Deep Document Intelligence System
    • Authors: Sinh Nguyen (Cinnamon AI Inc)*; Bach Tran (Cinnamon AI Inc); Tuan Anh Nguyen Dang (Cinnamon); Duc Nguyen (Cinnamon AI Inc); Hung Le (Deakin University)
    • Abstract: Measuring the confidence of AI models is critical for safely deploying AI in real-world industrial systems. One important application of confidence measurement is information extraction from scanned documents. However, there exists no solution to provide reliable confidence score for current state-of-the-art deep-learning-based information extractors. In this paper, we propose a complete and novel architecture to measure confidence of current deep learning models in document information extraction task. Our architecture consists of a Multi-modal Conformal Predictor and a Variational Cluster-oriented Anomaly Detector, trained to faithfully estimate its confidence on its outputs without the need of host models modification. We evaluate our architecture on real-wold datasets, not only outperforming competing confidence estimators by a huge margin but also demonstrating generalization ability to out-of-distribution data.
  • Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR documents
    • Authors: Amit D Gupte (Microsoft)*; Alexey Romanov (Microsoft); Sahitya Mantravadi (Microsoft); Dalitso Banda (Microsoft); Jianjie Liu (Microsoft); Raza Khan (Microsoft); Lakshmanan Ramu Meenal (Microsoft); Benjamin Han (Microsoft); Soundararajan Srinivasan (Microsoft)
    • Abstract: Document digitization is essential for the digital transformation of our societies, yet a crucial step in the process, Optical Character Recognition (OCR), is still not perfect. Even commercial OCR systems can produce questionable output depending on the fidelity of the scanned documents. In this paper we demonstrate an effective framework for mitigating OCR errors for any downstream NLP task, using Named Entity Recognition (NER) as an example. We first address the data scarcity problem for model training by constructing a document synthesis pipeline, generating realistic but degraded data with NER labels. We measure the NER accuracy drop at various degradation levels and show that a text restoration model, trained on the degraded data, significantly closes the NER accuracy gaps caused by OCR errors, including on an out-of-domain dataset. For the benefit of the community, we have made the document synthesis pipeline available as an open-source project.
  • Position Masking for Improved Layout-Aware Document Understanding
    • Authors: Anik Saha (Rensselaer Polytechnic Institute)*; Catherine Finegan-Dollak (IBM TJ Watson Research Center); Ashish Verma (IBM Research)
    • Abstract: Natural language processing for document scans and PDFs has the potential to enormously improve the efficiency of business processes. Layout-aware word embeddings such as LayoutLM have shown promise for classification of and information extraction from such documents. This paper proposes a new pre-training task called position masking that can improve performance of layout-aware word embeddings that incorporate 2-D position embeddings. We compare models pre-trained with only language masking against models pre-trained with both language masking and position masking, and we find that position masking improves performance by over 5% on a form understanding task.
  • SpecToSVA: Circuit Specification Document to SystemVerilog Assertion Translation
    • Authors: Saurav Nanda (Synopsys Inc)*; Ganapathy Parthasarathy (Synopsys Inc); Parivesh Choudhary (Synopsys Inc); Pawan Patil (Synopsys Inc)
    • Abstract: Assertion-based verification is a technique to ensure that a microelectronic circuit design conforms to its specification and helps detect errors early in the design process. These assertions are complex and difficult to write since verification engineers must manually translate a natural language specification to a formal assertion language, such as SystemVerilog Assertions (SVA). SVA is a regular language that can be compiled and automatically checked. This incurs significant costs in the hardware design and verification cycle in terms of productivity and time, sometimes as much as 50% of the hardware design costs. We propose a machine-learning approach to alleviate this problem that automatically converts content in English language specifications to SystemVerilog Assertions. Our experimental results demonstrate an average precision of 64% on a data set created from proprietary Integrated Circuit (IC) specifications documents.
  • Text Analysis via Binomial Tails
    • Authors: Omid Madani (Cisco Tetration Analytics)*
    • Abstract: We show that several tasks in text processing, in particular co-occurrence analysis, term weighting in documents, and document similarity, can be modeled by the binomial tail. The tail yields easy to interpret significance scores, and can make finding a good cut-off threshold simpler, or improve ranking tasks and similarity spaces. Because the tail can be efficiently approximated, it is a basic tool that should find applications in text analysis.

Posters

(in alphabetical order; authors marked by * are the corresponding authors)

#TitleSession
1Few-Shot Learning for Structured Information Extraction From Form-Like Documents Using a Diff Algorithm2
2Medical Report Generation with Multi-Attention for Abnormal Keyword Description and History Report3
3Multi-Stage Framework to Boost Optical Character Recognition Performance on Low Quality Document Images1
4The Law of Large Documents: Understanding the Structure of Legal Contracts Using Visual Cues3
5Towards Semantic Search for Community Question Answering for Mortgage Officers3

Details

  • Few-Shot Learning for Structured Information Extraction From Form-Like Documents Using a Diff Algorithm
    • Authors: Nerya Or (Google)*; Shlomo Urbach (Google)
    • Abstract: We present a novel approach for extracting structured data from a collection of similarly-structured scanned documents (e.g., multiple instances of the same form, or printouts from a database). Documents are not required to have a fixed layout; the position of some elements may shift vertically, and groups of fields can appear repeatedly. We are robust against OCR errors and other noise. Our training stage requires only a handful of sample documents, one of which is annotated for fields of interest. Using this training data, we are able to extract data from other similar documents. Extraction of data is performed using a diff-like algorithm over boilerplate text tokens of the documents, which is leveraged to find areas in the input documents which correspond to areas in the annotated document.
  • Medical Report Generation with Multi-Attention for Abnormal Keyword Description and History Report
    • Authors: HaiHan Yao (Donghua University)*; Mei Wang (Donghua University); YanXia Qin (Donghua University)
    • Abstract: This paper proposes an automatic medical report generation framework based on both current medical image and a previous history report. A keyword list describing the abnormal or special observations from the medical image is used to represent the image. In the proposed method, sentence-level structure information of the history report is extracted with the sentence level embedding. Then we construct two attention components. One is used to learn important semantic and sequential information from the keyword list, the other is used to learn the correlation between the current keyword list and the history report. Finally, all above information is combined together to help generating the current report. We conduct experiments on a practical ultrasound text dataset collected from a reputable hospital in Shanghai, China. The experimental results show that the reports generated by the proposed method are more accurate and smooth compared with a strong baseline method.
  • Multi-Stage Framework to Boost Optical Character Recognition Performance on Low Quality Document Images
    • Authors: Nitin Gupta (IBM Research); Shashank Mujumdar (IBM Research, India)*; Abhinav Jain (None); Doug Burdick (IBM Research); Hima Patel (IBM Research)
    • Abstract: In order to extract text from good quality document images, the state-of-the-art (SOA) Tesseract Engine (TE) performs: (i) image processing, (ii) page segmentation to extract text lines and (iii) apply Optical Character Recognition (OCR) on text lines to extract the text tokens. However, TE fails miserably on complex document images with low resolution, colored text regions, tables, charts etc. which presents the need to optimize the TE performance. In this paper, we propose a novel multi-stage pipeline to address the shortcomings of the TE and boost the OCR performance for challenging document images. Specifically, we propose an approach - (i) for page segmentation to extract text lines, (ii) to detect and binarize colored text regions and (iii) to detect and correct the image quality. We rigorously test the pipeline on 5 datasets and show the improvement in the OCR performance against the standard TE and SOA baselines.
  • The Law of Large Documents: Understanding the Structure of Legal Contracts Using Visual Cues
    • Authors: Allison Hegel (Lexion)*; Marina Shah (Lexion); Genevieve Peaslee (Lexion); Brendan Roof (Lexion); Emad Elwany (Lexion)
    • Abstract: Large, pre-trained transformer models like BERT have achieved state-of-the-art results on document understanding tasks, but most implementations can only consider 512 tokens at a time. For many real-world applications, documents can be much longer, and the segmentation strategies typically used on longer documents miss out on document structure and contextual information, hurting their results on downstream tasks. In our work on legal agreements, we find that visual cues such as layout, style, and placement of text in a document are strong features that are crucial to achieving an acceptable level of accuracy on long documents. We measure the impact of incorporating such visual cues, obtained via computer vision methods, on the accuracy of document understanding tasks including document segmentation, entity extraction, and attribute classification. Our method of segmenting documents based on structural metadata out-performs existing methods on four long-document understanding tasks as measured on the Contract Understanding Atticus Dataset.
  • Towards Semantic Search for Community Question Answering for Mortgage Officers
    • Authors: Amir Reza Rahmani (Zillow Group)*; Linwei Li (Zillow Group); Shourabh Rawat (Zillow Group); Brian Vanover (Zillow Group); Colin Bertrand (Zillow Group)
    • Abstract: Community Question Answering (CQA) has gained increasing popularity in many domains. Mortgage is a complex and dynamic industry, and a flexible and efficient CQA platform can potentially enhance the quality of service for mortgage officers significantly. We have built a dynamic CQA platform with a state of the art semantic search engine based on recent Natural Language Processing (NLP) techniques to dynamically and collectively capture and transfer the maturity and tribal knowledge of the more experienced workforce to less experienced ones. The search engine allows for both keyword and natural language queries and is based on a fine-tuned domain-adapted Sentence-BERT encoder linearly composed with a TF-IDF vectorizer, and reciprocal-rank fused with a BM25 vectorizer. Domain adaptation and fine-tuning is based on publicly available mortgage corpora. Evaluation is performed on an internally annotated dataset using standard information retrieval metrics such as normalized discounted cumulative gain (nDCG), precision/recall at n, mean reciprocal rank, and mean average precision (MAP). The results indicate that our hybrid, fine-tuned, domain-adapted search engine is a more effective approach in responding to the information needs of our mortgage officers compared to traditional search techniques. We aim to publish the internally-annotated evaluation and training datasets in the near future.