End-to-end pipeline for document recognition

Document AI

Tasks

Document Layout Analysis: object detection task for document images
- Definition: process of automatic analysis, recognition and understanding of images, text, table/figure/chart information and positional relationships in the document layout
- Faster R-CNN, Mask R-CNN, SSD, YOLO
- basic units: e.g. headings, paragraphs, tables, figures, charts
- large-scale weakly supervised document layout analysis datasets: PubLayNet, PubTabNet, TableBank, DocBank

Visual Information Extraction
- Definition: extract entities and their relationships from a large amount of unstructured content in a document
- Visually rich documents: text data whose semantic structure is not only determined by the content of the text, but also related to visual elements, such as layout, typesetting formats as well as table/figure structures
Document VQA
- similar goal to RAG
Document Image Classification

Visual Information Extraction (VIE)

Deep-learning-based VIE methods can be catergorized as:

Grid-based methods
GNN-based methods
Large scale pre-trained models
End-to-end methods
Few-shot methods
LLM-based methods

Grid-based methods

Treat the document as a 2D matrix, with text embeddings filling the pixels inside the text bounding box
Computationally efficient, may not fully exploit the features of text regions in small sizes
External OCR model

Open-source models:

Chargrid: Katti et al. Chargrid: Towards Understanding 2D Documents. EMNLP, 2018.
ViBERTgrid: Lin et al. ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents. ICDAR, 2021

GNN-based methods

Consider text segments as graph nodes and encode the relations between segment coordinates for edge representations
Good balance between cost and performance
Inherent characteristics of GNN, such as over-smoothing and gradient vanishing, makes the training process challenging
External OCR model

Open-source models:

PICK: Yu et al. PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks. ICPR, 2020
GraphDoc: Zhang et al. Multimodal Pre-training Based on Graph Attention Network for Document Understanding. arXiv preprint, 2022
GeoContrastNet: Nil Biescas, et al. GeoContrastnet: Contrastive Key-value Edge learning for Luanguage-Agnostic Document Understanding. ICDAR, 2024.

Large scale pre-trained models

Leverage pre-training with a large amount of data to obtain effective generic models
Computationally expensive

Open-source models:

LayoutLMv3: Huang et al. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. ACMMM, 2022.
LiLT: Jiapeng Wang, Lianwen Jin and Kai Ding. LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding. ACL, 2022.
StrucTexT: Li et al. StrucTexT: Structured Text Understanding with Multi-Modal Transformers. ACMMM, 2021.
DocFormer: Appalaraju et al. DocFormer: End-to-End Transformer for Document Understanding. ICCV, 2021.
StructuralLM: Li et al. StructuralLM: Structural Pre-training for Form Understanding. ACL, 2021.
BROS: Hong et al. BROS: A Pre-Trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents. AAAI, 2022.
ERNIE-Layout: Peng et al. ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding. EMNLP Findings, 2022.
GeoLayoutLM: Luo et al. GeoLayoutLM: Geometric Pre-training for Visual Information Extraction. CVPR, 2023.

End-to-end methods

To reduce OCR error accumulation
New

Open-source models:

TRIE: Zhang et al. TRIE: End-to-End Text Reading and Information Extraction for Document Understanding. ACMMM, 2020.
Donut: Kim et al. OCR-free Document Understanding Transformer. ECCV, 2022.
Dessurt: Davis et al. End-to-end Document Recognition and Understanding with Dessurt. ECCV, 2022
StrucTexTv2: Yu et al. StrucTexTv2: Masked Visual-Textual Perdiction for Document Image Pre-training. ICLR, 2023.
Kuang CFAM: Kuang et al. Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution. ICDAR, 2023.
Pix2Struct: Lee et al. Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. ICML, 2023.
UDOP: Tang et al. Unifying Vision, Text, and Layout for Universal Document Processing. CVPR, 2023.

Few-shot methods

Only a small number of samples
New

Open-source models:

LLM-based methods

Prompt engineering, fine-tuning
Multi-image document understanding

*Open-source models:

mPLUG-DocOwl: Ye et al. mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding. arXiv preprint 2307.02499, 2023.
UReader: Ye et al. UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model. EMNLP Findings, 2023.
Monkey: Li et al. Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models. CVPR, 2024.
OmniParser: Wan et al. OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition. CVPR, 2024.
LayTextLLM: Lu et al. LayTextLLM: A Bounding Box is Worth One TokenInterleaving Layout and Text in a Large Language Model for Document Understanding. arXiv, 2024.

Potential Solutions: Document AI

Problems

Current end-to-end models only consist the landmark detection & the OCR part, not including the segmenation module.
Licensing: non-commercial use
Latency?

What end-to-end model solves

Layout analysis (one-for-all documents)
Joint optimization
Unified evaluation
Mitigating error propagation
VQA?

Donut (Document understanding transformer)

donut

Contributions:

OCR-free approach
Pre-training with SynthDoG (their own synthetic dataset generator)

Model:

Transformer-based visual encoder and textual decoder
Visual encoder (Swin Transformer) for extracting features from a given document image
Textual decoder (BART) maps the derived features into a sequence of subword tokens to construct a desired structured format (e.g. JSON)
Model input: teacher-forcing; generate token sequence given prompt
Output conversion: output token sequence --> JSON
Metrics: field-level F1 score, Tree Edit Distance

Example: id-card

{"predictions": [{
   "surname": "Berthier",
   "name": "Corinne",
   "sex": "F",
   "birthday": "12/06/1965",
   "birthplace": "Paris 1er (75)"
}]}

LayoutLMv3

layoutlmv3

Modifications:

Multi-task learning approach: attach head for detection and OCR
Input: whole image, bbox (toggle OCR=False), type, text
Output: bbox (for scan quality check), type, text

Example:

sample-output layoutlmv3-example-task

Misc.

Datasets

MIDV
IAM

benchmark-datasets

TO-DO

Metrics
Parameter size, latency (inference speed)

API services

Regula Forensics
Google Document AI
Azure AI Document Intelligence
Amazon Textract

Document AI​

Tasks​

Visual Information Extraction (VIE)​

Grid-based methods​

GNN-based methods​

Large scale pre-trained models​

End-to-end methods​

Few-shot methods​

LLM-based methods​

Potential Solutions: Document AI​

Problems​

What end-to-end model solves​

Donut (Document understanding transformer)​

LayoutLMv3​

Misc.​

Datasets​

TO-DO​

API services​

References​

Document AI

Tasks

Visual Information Extraction (VIE)

Grid-based methods

GNN-based methods

Large scale pre-trained models

End-to-end methods

Few-shot methods

LLM-based methods

Potential Solutions: Document AI

Problems

What end-to-end model solves

Donut (Document understanding transformer)

LayoutLMv3

Misc.

Datasets

TO-DO

API services

References