Opening the era of intelligence: In-depth analysis of the frontier and application of intelligent document analysis technology

Opening the era of intelligence: In-depth analysis of the frontier and application of intelligent document analysis technology

This chapter mainly introduces the theoretical knowledge of document analysis technology, including background introduction, algorithm classification and corresponding ideas. Through the study of this article, you can master: 1. The classification and typical ideas of layout analysis 2. The classification and typical ideas of table recognition 3. The classification and typical ideas of information extraction.

As an information bearing tool, different layouts of documents represent various information, such as checklists and ID cards. Document analysis is an automated process of reading, interpreting and extracting information from documents. Document analysis often includes the following research directions:

  1. Layout Analysis Module: Divide each document page into different content areas. This module can be used not only to delineate relevant and irrelevant regions, but also to classify the types of content it recognizes.
  2. Optical Character Recognition (OCR) Module: Locates and recognizes all text present in the document.
  3. Table recognition module: Recognize and convert the table information in the document into an excel file.
  4. Information Extraction Module: With the help of OCR results and image information to understand and identify specific information expressed in documents or the relationship between information.

Since the OCR module has been introduced in detail in the previous chapters, the following three modules of layout analysis, form recognition and information extraction will be introduced separately. For each module, the classic or commonly used methods and data sets of the module will be introduced.

1. Layout Analysis

1.1 Background introduction

Layout analysis is mainly used for document retrieval, key information extraction, content classification, etc. Its task is mainly to classify the content of document images, and the content categories can generally be divided into plain text, title, table, picture and list. However, the diversity and complexity of document layouts and formats, the poor quality of document images, and the lack of large-scale annotated datasets make layout analysis still a very challenging task.
The visualization of the layout analysis task is shown in the figure below:

Figure 1: Layout analysis renderings

Existing solutions are generally based on object detection or semantic segmentation methods, which basically treat different patterns in the document as different objects for detection or segmentation.

Some representative papers are divided into the above two categories, as shown in the table below:

category main paper
Object detection based methods Visual Detection with ContextObject DetectionVSR
Semantic Segmentation Based Methods Semantic Segmentation

1.2 Methods based on object detection

Based on the target detection algorithm Faster R-CNN, Soto Carlos[1] combines context information and utilizes the inherent location information of document content to improve region detection performance. Li Kai [2] et al. also proposed a document analysis method based on target detection. By introducing a feature pyramid alignment module, a region alignment module, and a rendering layer alignment module to solve cross-domain problems, these three modules complement each other. And adjust the domain from a general image perspective and a specific document image perspective, thus solving the problem that large labeled training datasets are different from the target domain. The figure below is a flowchart of layout analysis based on the target detection Faster R-CNN algorithm.

Figure 2: Flow chart of layout analysis based on Faster R-CNN

1.3 Methods Based on Semantic Segmentation

Sarkar Mausoom[3] and others proposed a priori-based segmentation mechanism to train the document segmentation model on very high-resolution images, which solved the problem that different structures in dense areas caused by excessive reduction of the original image could not be distinguished and merged. . Combining vision, semantics and relations in documents, Zhang Peng[4] proposed a unified framework VSR (Vision, Semantics and Relations) for document layout analysis, which uses a two-stream network to extract the visual and Semantic features, and adaptively fuse these features through an adaptive aggregation module, which solves the limitations of existing CV-based methods such as low efficiency of fusion of different modalities and lack of relationship modeling between layout components.

1.4 Datasets

Although existing methods can solve layout analysis tasks to a certain extent, such methods rely on a large amount of labeled training data. Recently, many datasets have also been proposed for document analysis tasks.

  1. PubLayNet[5]: The data set contains 500,000 document images, of which 400,000 are used for training, 50,000 are used for verification, and 50,000 are used for testing. There are five forms of marked tables, text, images, titles and lists.
  2. HJDataset[6]: The dataset contains 2271 document images, in addition to the bounding box and mask of the content area, it also includes the hierarchy and reading order of the layout elements.

A sample of the PubLayNet dataset is shown below:

Figure 3: PubLayNet sample

references:

[1]:Soto C, Yoo S. Visual detection with context for document layout analysis[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 3464-3470.

[2]:Li K, Wigington C, Tensmeyer C, et al. Cross-domain document object detection: Benchmark suite and method[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 12915-12924.

[3]:Sarkar M, Aggarwal M, Jain A, et al. Document Structure Extraction using Prior based High Resolution Hierarchical Semantic Segmentation[C]//European Conference on Computer Vision. Springer, Cham, 2020: 649-666.

[4]:Zhang P, Li C, Qiao L, et al. VSR: A Unified Framework for Document Layout Analysis combining Vision, Semantics and Relations[J]. arXiv preprint arXiv:2105.06220, 2021.

[5]:Zhong X, Tang J, Yepes A J. Publaynet: largest dataset ever for document layout analysis[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 1015-1022.

[6]:Li M, Xu Y, Cui L, et al. DocBank: A benchmark dataset for document layout analysis[J]. arXiv preprint arXiv:2006.01038, 2020.

[7]:Shen Z, Zhang K, Dell M. A large dataset of historical japanese documents with complex layouts[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020: 548-549.

2. Form recognition

2.1 Background introduction

Tables are common page elements in various documents. With the explosive growth of various documents, how to efficiently find tables from documents and obtain content and structure information, that is, table recognition, has become an urgent problem to be solved. The difficulties of form recognition are summarized as follows:

  1. Table types and styles are complex and diverse, such as different row and column combinations, different content text types , etc.
  2. The style of the document itself has various styles.
  3. Lighting environment, etc. when shooting

The task of table recognition is to convert the table information in the document into an excel file. The task visualization is as follows:

Figure 4: An example diagram of table recognition, in which the original image is on the left and the result image after table recognition is on the right, presented in the form of Excel

The existing table recognition algorithms can be divided into the following four categories according to the principle of table structure reconstruction:

  1. Heuristic-based approach
  2. CNN-based approach
  3. GCN-based methods
  4. End to End based approach

Some representative papers are divided into the above four categories, as shown in the table below:

category train of thought main paper
Heuristic-based approach Artificially design rules, connected domain detection, analysis and processing T-Rectpdf2table
CNN-based approach object detection, semantic segmentation CascadeTabNet, Multi-Type-TD-TSR, LGPMA, tabstruct-net, CDeC-Net, TableNet, TableSense, Deepdesrt, Deeptabstr, GTE, Cycle-CenterNet, FCN
GCN-based methods Form recognition as a graph reconstruction problem based on graph neural network GNN, TGRNet, GraphTSR
End to End based approach Use the attention mechanism Table-Master

2.2 Traditional algorithms based on heuristic rules

Early research on table recognition was mainly based on heuristic rule-based methods. For example, the T-Rect system proposed by Kieninger[1] uses a bottom-up method to analyze the connected domains of document images, and then merge them according to defined rules to obtain logical text blocks. The pdf2table proposed by Yildiz[2] and others is the first method for table recognition on PDF documents, which uses some unique information of PDF files (such as text, drawing paths, etc. ) to assist form recognition. In recent work, Koci[3] and others represented the layout area in the page as a graph (Graph), and then used the Remove and Conquer (RAC) algorithm to identify the table as a subgraph.

Figure 5: Schematic diagram of the heuristic algorithm

2.3 Method based on deep learning CNN

With the rapid development of deep learning technology in the fields of computer vision, natural language processing, and speech processing, researchers have applied deep learning technology to the field of table recognition and achieved good results.

In the DeepTabStR algorithm, Siddiqui Shoaib Ahmed [12] and others expressed the problem of table structure recognition as an object detection problem, and used deformable convolution to better detect table cells. Raja Sachin[6] and others proposed TabStruct-Net to visually combine cell detection and structure recognition for table structure recognition, which solved the problem of recognition errors caused by large changes in table layout in existing methods, but this method cannot Solve the problem of many empty cells in rows and columns.

Figure 6: Schematic diagram of the algorithm based on deep learning CNN
Figure 7: An example of an algorithm error based on a deep learning CNN

Previous table structure recognition methods generally started to deal with the problem from elements of different granularities (rows/columns, text areas), and it was easy to ignore the problem of merging empty cells. Qiao Liang[10] proposed a new framework LGPMA, which fully utilizes the information from local and global features through the mask re-scoring strategy, and then can obtain more reliable alignment cell regions, and finally introduces cell matching, empty A table structure recovery pipeline with cell search and empty cell merging to handle the table structure recognition problem.

In addition to the above algorithm for table recognition alone, there are also some methods that complete table detection and table recognition in one model. Schreiber Sebastian[11] et al. proposed DeepDeSRT, which uses Faster RCNN for table detection and FCN semantic segmentation model for Table structure row and column detection, but this method uses two independent models to solve these two problems. Prasad Devashish[4] and others proposed a method based on end-to-end deep learning, CascadeTabNet, which uses the Cascade Mask R-CNN HRNet model to perform table detection and structure recognition at the same time, solving the problem of using two independent methods to process table recognition in previous methods The lack of problems. Paliwal Shubham [8] et al. proposed a novel end-to-end deep multi-task architecture TableNet for table detection and structure recognition, while adding additional spatial semantic features to TableNet during training to further improve model performance. Zheng Xinyi[13] et al. proposed the system framework GTE for table recognition, using the cell detection network to guide the training of the table detection network, and proposed a hierarchical network and a new cluster-based cell structure recognition algorithm , the framework can be connected to the back of any object detection model to facilitate the training of different table recognition algorithms. Previous research mainly focused on parsing well-aligned table images with simple layouts from scanned PDF documents, but tables in real-world scenarios are generally complex, and there may be serious deformation, bending or occlusion problems, so Long Rujiao[14 ] et al. also constructed a table recognition dataset WTW in realistic complex scenes, and proposed a Cycle-CenterNet method, which utilizes cycle pairing module optimization and the proposed new pairing loss to accurately group discrete units into structured Tables, improved the performance of table recognition.

Figure 8: Schematic diagram of the end-to-end algorithm

The CNN-based method cannot handle the table crossing rows and columns well, so in the follow-up method, it is divided into two research methods to solve the problem of crossing rows and columns in the table.

2.4 Method based on deep learning GCN

In recent years, with the rise of the Graph Convolutional Network (Graph Convolutional Network), some researchers have attempted to apply the Graph Neural Network to the problem of table structure recognition. Qasim Shah Rukh [20] and others transformed the table structure recognition problem into a graph problem compatible with graph neural networks, and designed a novel differentiable architecture, which can not only take advantage of the advantages of convolutional neural network to extract features, but also The advantages of effective interaction between graph neural network vertices can be exploited, but this method only uses the positional features of the cells, not the semantic features. Chi Zewen[19] et al proposed a novel graph neural network GraphTSR for table structure recognition in PDF files. Predicting the relationship between cells to identify the table structure solves the problem of cell identification across rows or columns to a certain extent. Xue Wenyuan[21] et al. reformulated the problem of table structure recognition as table graph reconstruction, and proposed an end-to-end method TGRNet for table structure recognition, which includes a cell detection branch and a cell logical location branch , these two branches jointly predict the spatial location and logical location of different cells, solving the problem that previous methods did not pay attention to the logical location of cells.

Schematic diagram of GraphTSR table recognition algorithm:

Figure 9: Schematic diagram of GraphTSR table recognition algorithm

2.5 End-to-End Based Approach

Unlike other methods that use post-processing to complete the reconstruction of the table structure, the end-to-end method directly uses the network to complete the HTML representation output of the table structure

)
Figure 10: Input and output of the end-to-end approach Figure 11: Example of Image Caption

Most of the end-to-end methods use the Seq2Seq method of Image Caption (see the picture) to complete the prediction of the table structure, such as some methods based on Attention or Transformer.

Figure 12: Schematic diagram of Seq2Seq

Ye Jiaquan[22] obtained the table structure output model by improving the Transformer-based Master text algorithm in TableMaster. In addition, a branch is added for frame coordinate regression. The author did not split the model into two branches in the last layer, but decoupled sequence prediction and frame regression into two branches after the first Transformer decoding layer. branches. The comparison between its network structure and the original Master network is shown in the figure below:

Figure 13: Left: master network diagram, right: TableMaster network diagram

2.6 Datasets

Since the deep learning method is a data-driven method, a large amount of labeled data is required to train the model, and the small size of the existing data set is also an important constraint factor, so some data sets have also been proposed.

  1. PubTabNet[16]: Contains 568k tabular images and corresponding structured HTML representations.
  2. PubMed Tables (PubTables-1M) [17]: Table structure recognition dataset with highly detailed structural annotations, 460,589 pdf images for table detection tasks and 947,642 table images for table recognition tasks.
  3. TableBank[18]: Table detection and recognition data set, using Word and Latex documents on the Internet to construct table data containing 417K high-quality annotations.
  4. SciTSR[19]: Table structure recognition dataset, the images are mostly converted from the paper, which contains 15,000 tables and their corresponding structure labels from PDF files.
  5. TabStructDB[12]: Consists of 1081 tabular regions densely labeled with row and column information.
  6. WTW[14]: Large-scale dataset scene table detection and recognition dataset, which contains table data under various deformation, bending and occlusion conditions, and contains a total of 14,581 images.

dataset example

Figure 14: Sample diagram of PubTables-1M dataset
Figure 15: Sample diagram of WTW dataset

3. Document VQA

3.1 Background introduction

In the VQA (Visual Question Answering) task, questions and answers are mainly aimed at image content, but for text images, the content of concern is the text information in the image, so this type of method can be divided into Text-VQA for natural scenes and Text-VQA for natural scenes. DocVQA in the scanning document scene, the relationship between the three is shown in the figure below.

Figure 16: VQA Hierarchy

Example graphs of VQA, Text-VQA and DocVQA are shown in the figure below.

task type VQA Text-VQA DocVQA
mission details Ask questions about image content Ask questions about text on images Ask questions about the text content of document images
sample image etc textv is docvga

Because DocVQA is closer to practical application scenarios, a large number of academic and industrial works have emerged. In commonly used scenarios, the questions asked in DocVQA are fixed. For example, the questions in the ID card scenario are generally

  1. What is a Citizenship Number?
  2. What's your name?
  3. What is the clan?
Figure 17: Example of ID card

Based on such prior knowledge, the research of DocVQA began to focus on the Key Information Extraction (KIE) task. This time we also mainly discuss the research related to KIE. The KIE task mainly extracts the key information needed from the image, such as extracting from the ID card. Provide name and citizenship number information.

KIE is usually divided into two subtasks for research

  1. SER: Semantic Entity Recognition, which classifies each detected text, such as dividing it into names and ID cards. The black box and the red box in the picture below.
  2. RE: Relation Extraction, which classifies each detected text, such as dividing it into questions and answers. Then find the corresponding answer for each question. The red and black boxes in the figure below represent questions and answers, respectively, and the yellow lines represent the correspondence between questions and answers.
Figure 18: Example of SER,RE tasks

The general KIE method is based on named entity recognition (Named Entity Recognition, NER) [4] to study, but this kind of method only uses the text information in the image, lacks the use of visual and structural information, so the accuracy is not high. On this basis, methods in recent years have begun to integrate visual and structural information with text information. According to the principles used in the fusion of multimodal information, these methods can be divided into the following three types:

  1. Grid-based method
  2. Token-based method
  3. GCN-based methods
  4. End to End based approach

Some representative papers are divided into the above three categories, as shown in the table below:

category train of thought main paper
Grid-based method Fusion of multimodal information on images (text, layout, image) Chargrid
Token-based method Fusion of multimodal information using methods such as Bert LayoutLM, LayoutLMv2, StrucText,
GCN-based methods Multimodal Information Fusion Using Graph Network Structure GCN, PICK, SDMG-RSERA
End to End based approach Unify OCR and key information extraction into one network Trie

3.2 Grid-based method

The Grid-based method fuses multi-modal information at the image level. Chargrid [5] first performs character-level text detection and recognition on the image, and then completes the construction of the network input by filling the one-hot code of the category into the corresponding character area (the non-black part of the right image in the figure below) , input the coordinate detection and category classification of key information through the CNN network of encoder-decoder structure.

Figure 19: Chargrid data example
Figure 20: Chargrid network

Compared with the traditional text-based method, this method can use text information and structural information at the same time, so it can achieve a certain degree of accuracy improvement. However, the fusion of text and structural information in this method is only a simple embedding, which is not very effective. ok combine the two

3.3 Token-based method

LayoutLM [6] encodes 2D location information and text information into the BERT model, and learns from Bert's pre-training ideas in NLP to pre-train on large-scale data sets. In downstream tasks, LayoutLM also adds image information. to further improve the model performance. Although LayoutLM integrates text, location and image information, the image information is fused in the training of downstream tasks, so the multimodal fusion of the three information is not sufficient. Based on LayoutLM, LayoutLMv2 [7] fuses image information, text and layout information in the pre-training stage through transformers, and also adds a spatially aware self-attention mechanism to Transformer to assist the model to better integrate visual and text features. Although LayoutLMv2 fused text, location and image information in the pre-training stage, due to the limitations of the pre-training task, the visual features learned by the model were not fine enough. StrucTexT[8] proposed two new tasks, Sentence Length Prediction (SLP) and Paired Boxes Direction (PBD) in the pre-training task on the basis of previous multimodal methods to help the network learn fine visual features. The SLP task allows the model Learning the length of the text segment, the PDB task allows the model to learn the matching relationship between Box directions. With these two new pre-training tasks, the deep cross-modal fusion between text, vision and layout information can be accelerated.

)
Figure 21: Transformer algorithm flow chart Figure 22: LayoutLMv2 Algorithm Flowchart

3.4 GCN-based methods

Existing GCN-based methods [10] do not make good use of image information, although they exploit textual and structural information. PICK[11] added image information to the GCN network and proposed a graph learning module to automatically learn the type of edge. SDMG-R [12] encodes the image into a bimodal graph. The nodes of the graph are the visual and text information of the text area, and the edges represent the direct spatial relationship of adjacent texts. By iteratively propagating information along the edges and reasoning about graph node categories, SDMG -R solves the problem that the existing method can't do anything to the template that has not been seen.

The PICK flow chart is shown in the figure below:

Figure 23: PICK algorithm flow chart

SERA [10] introduces the biaffine parser in dependency parsing to document relation extraction, and uses GCN to fuse text and visual information.

Figure 24: SERA Algorithm Flowchart

3.5 End to End based method

Existing methods divide KIE into two independent tasks: text reading and information extraction. However, they mainly focus on improving the information extraction task, while ignoring that text reading and information extraction are interrelated. Therefore, Trie[9 ] proposed a unified end-to-end network that can learn both tasks simultaneously and reinforce each other during the learning process.

Figure 25: Trie algorithm flow chart

3.6 Datasets

There are two main data sets used for KIE:

  1. SROIE: Task 3 of the SROIE dataset [2] aims to extract four predefined pieces of information from scanned receipts: company, date, address or total. There are 626 samples in the dataset for training and 347 samples for testing.
  2. FUNSD: The FUNSD dataset [3] is a dataset for extracting form information from scanned documents. It contains 199 annotated real scan forms. 149 of the 199 samples are used for training and 50 for testing. The FUNSD dataset assigns each word a semantic entity label: question, answer, title, or others.
  3. XFUN: The XFUN dataset is a multilingual dataset proposed by Microsoft, which contains 7 languages, each language contains 149 training sets and 50 test sets.
)
Figure 26: sroie example graph Figure 27: xfun example diagram

reference link

https://aistudio.baidu.com/education/group/info/25207

https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.7

For more high-quality content, please pay attention to the official account: Ting, artificial intelligence; some related resources and high-quality articles will be provided for free reading.

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/132618246