Accepted Papers
Papers
(authors marked by * are the corresponding authors)
Details
- Title: DeeperDive: The Unreasonable Effectiveness of Weak Supervision in Document Understanding
- Authors: Emad Elwany (Lexion)*; Allison Hegel (Lexion)
- Abstract: Weak supervision has been applied to various Natural Language Understanding tasks in recent years. Due to technical challenges with scaling weak supervision to work on long-form documents, spanning up to hundreds of pages, applications in the document understanding space have been limited. At C1, we built a weak supervision-based system tailored for long-form (10-200 pages long) PDF documents. We use this platform for building dozens of language understanding models and have applied it successfully to various domains, from commercial agreements to corporate formation documents.
We demonstrate the effectiveness of weak supervision in a situation with limited time, workforce, and training data. We built 8 high quality machine learning models in the span of one week, with the help of a small team of just 3 annotators working with a dataset of under 300 documents. We share some details about our overall architecture, how we utilize weak supervision, and what results we are able to achieve.
- Title: BusiNet - a Light and Fast text Detection Network for Business Documents
- Authors: Oshri P Naparstek (IBM)*; ophir azulai (IBM Research); Daniel Rotman (IBM Research); Yavgeny Burshtein (IBM Research); Peter W J Staar (IBM Research); Udi Barzelay (IBM)
- Abstract: For digitizing or indexing physical documents, optical character recognition (OCR), the process of extracting textual information from scanned documents, is a vital technology.
When a document is visually damaged or contains non-textual elements, existing technologies can yield poor results, as erroneous detection results can greatly affect the quality of OCR.
In this paper we present a detection network dubbed BusiNet aimed at OCR of business documents.
Business documents often include sensitive information and as such they cannot be uploaded to a cloud service for OCR.
BusiNet was designed to be fast and light so it could run locally preventing privacy issues.
Furthermore, BusiNet is built to handle scanned document corruption and noise using a specialized synthetic dataset.
The model is made robust to unseen noise by employing adversarial training strategies.
We perform an evaluation on publicly available datasets demonstrating the usefulness and broad applicability of our model.
- Title: Graph Attention Networks for Efficient Text Line Detection on Receipt-Layout Documents
- Authors: David Montero Martín (NielsenIQ)*; Mukul Kumar (NielsenIQ); Javier Yebes (NielsenIQ)
- Abstract: Text line detection from OCR detections is an essential step in many information-extraction processes, particularly when working with unstructured documents such as purchase receipts, where utilizing this information is crucial for matching key-value pairs that are on the same line. Existing models, however, are limited to structured documents and do not generalize well to unstructured ones. To address this issue, we have created a GNN-based line detection model that is optimized for receipt-layout documents. Experiments show that the proposed method outperforms other approaches in accuracy, processing time and resource consumption.
- Title: Document Summarization with Text Segmentation
- Authors: Lesly Miculicich (Microsoft)*; Benjamin Han (Microsoft)
- Abstract: In this paper, we exploit the innate document segment structure for improving the extractive summarization task. We build two text segmentation models and find the most optimal strategy to introduce their output predictions in an extractive summarization model. Experimental results on a corpus of scientific articles show that extractive summarization benefits from using a highly accurate segmentation method. In particular, most of the improvement is in documents where the most relevant information is not at the beginning thus, we conclude that segmentation helps in reducing the lead bias problem.
- Title: FlowchartQA: The First Large Scale Benchmark for Reasoning Over Flowcharts
- Authors: Simon Tannert (Institute for Natural Language Processing, University of Stuttgart)*; Marcelo G Feighelstein (Data Science Research Center, University of Haifa); Jasmina Bogojeska (IBM Research); Joseph Shtok (IBM-Reseach); Assaf Arbelle (IBM Research AI); Peter W J Staar (IBM Research); Anika Schumann (IBM Research); Jonas Kuhn (University of Stuttgart); Leonid Karlinsky (IBM Research)
- Abstract: Flowcharts are a very popular type of diagram in many kinds of documents, conveying large amounts of useful information and knowledge (e.g. on processes, workflows, or causality).
In this paper, we propose FlowchartQA – a novel, and first of its kind, large-scale benchmark with close to 1M flowchart images and 6M question-answer pairs.
The questions in FlowchartQA cover different aspects of geometric, topological, and semantic information contained in the charts, and are carefully balanced to reduce biases.
We accompany our proposed benchmark with a comprehensive set of baselines based on text-only, image and graph and qualitative analysis in order to establish a good basis for future work.
- Title: Revisiting How to Focus: Triplet Attention for Joint Entity and Relation Extraction
- Authors: Debraj D Basu (Adobe)*; Meghanath MY (Adobe); Deepak Pai (Adobe)
- Abstract: We propose a method for extracting entities and relations from natural language. When put together, this results in fact-triplets of the form {\it subject}, {\it predicate} and {\it object} as knowledge units. Our method benefits from memory-efficient triplet attention in addition to conventional self-attention as a feature refinement mechanism. We do this by explicitly facilitating contextual cues for every candidate entity span and {\it subject} and {\it object} pairs, which are allowed to attend to each token of the sentence besides attention between any two tokens. In conjunction with sharing information between the two tasks and the benefits of transfer learning, our method exhibits competitive performance in strict evaluation, compared to the previous state-of-the-art for different public datasets, with improvements up to 2.6\% and 3.4\% in micro and macro-F1 for entity recognition, as well as 6.9\% and 5.9\% in micro and macro-F1 respectively for relation extraction.
- Title: Domain Agnostic Few-Shot Learning For Document Intelligence
- Authors: Jaya Krishna Mandivarapu (Georgia State University)*; Eric Bunch (American Family Insurance); Glenn M Fung (American Family Insurance)
- Abstract: Few-shot learning aims to generalize to novel classes with only a few samples with class labels. Research in few-shot learning has borrowed techniques from transfer learning, metric learning, meta-learning, and Bayesian methods. These methods also aim to train models from limited training samples, and while encouraging performance has been achieved, they often fail to generalize to novel domains. Many of the existing meta-learning methods rely on training data for which the base classes are sampled from the same domain as the novel classes used for meta-testing. However, in many applications in the industry, such as document classification, collecting large samples of data for meta-learning is infeasible or
impossible. While research in the field of the cross-domain few-shot learning exists, it is mostly limited to computer vision.
- Title: Scientific Comparative Argument Generation
- Authors: Mengxia Yu (University of Notre Dame)*; Wenhao Yu (University of Notre Dame); Meng Jiang (University of Notre Dame)
- Abstract: In this work, we introduce a new yet important NLP task in scientific domain that is generating comparative arguments that aim to present an invention’s technical novelty by comparing it to one or multiple prior works. Any success on this task is a fundamental step towards the goal of enabling machines to think and write like scientists. So we create and release a dataset of good quality and size for benchmarking. We report and analyze the results of advanced text generation models, which uncover the unique challenge of this task compared to traditional argument generation tasks: there is a significant topic gap between inputs and output when the output is comparing instead of summarizing the inputs. We study the impact of the topics on the generation performance and investigate the possibility of learning, predicting, and utilizing the topics. Finally, this work discusses promising directions to achieve the goal.
- Title: Autonomous Character Score Fusion for Word Detection in Low-contrast Camera-captured Handwriting Text
- Authors: Sidra Hanif (Temple University)*; Longin Jan Latecki (Temple University)
- Abstract: Word detection is considered an object detection problem. The handwritten text varies in spacing between characters, making word detection harder than object detection. Moreover, characters are more easily identifiable than words in the handwritten text for low-contrast camera-captured images.
Nevertheless, considering the only character and ignoring a word’s entirety does not cope with overlapping words common in handwriting text. Therefore, we propose the fusion of character estimation with word detection in this work. Since the character level annotations are not available for handwritten text, we estimate the character region scores in a weakly supervised manner. Therefore, we fuse character region scores and handwriting images to detect words in camera-captured handwriting images. Fusion of character region score with image has a higher recall of 88.4(+1.2) and outperforms the state of the state-of-the-art object detector with 92.2(+0.4) mAP@0.5 and 64.0(+0.4) mAP@0.5:0.95.