Exploring Hidden Information in Image Data: A Wonderful Journey to Semantic Entity Recognition and Relation Extraction

Exploring Hidden Information in Image Data: A Wonderful Journey to Semantic Entity Recognition and Relation Extraction

1 Introduction

1.1 Background

Key Information Extraction (Key Information Extraction, KIE) refers to extracting key information from text or images. As a downstream task of OCR, the key information extraction task for document images has many practical application scenarios, such as form recognition, ticket information extraction, ID card information extraction, etc. However, it is time-consuming and labor-intensive to extract or collect key information from these document images manually. How to automatically fuse the visual, layout, text and other features in the image and complete the key information extraction is a problem with both value and challenge.

For the document image of a specific scene, the location and layout of the key information are relatively fixed. Therefore, in the early stage of research, there are many methods based on template matching to extract key information. Considering that the process is relatively simple, this method is still widely used in the current in many scenarios. However, when this method based on template matching is applied to different scenarios, it takes a lot of effort to adjust and adapt the template, and the migration cost is high.

KIE in document images generally includes 2 subtasks, as shown in the figure below.

  • (1) SER: Semantic Entity Recognition, which classifies each detected text, such as dividing it into names and ID cards. The black box and the red box in the picture below.
  • (2) RE: Relation Extraction, which classifies each detected text, such as dividing it into questions (key) and answers (value). Then find the corresponding answer for each question, which is equivalent to completing the key-value matching process. The red and black boxes in the figure below represent questions and answers, respectively, and the yellow lines represent the correspondence between questions and answers.

1.2 Mainstream methods based on deep learning

The general KIE method is based on named entity recognition (Named Entity Recognition, NER), but this type of method only uses text information and ignores location and visual feature information, so the accuracy is limited. In recent years, most scholars have begun to integrate input information of multiple modalities, perform feature fusion, and process multi-modal information, thereby improving the accuracy of KIE. The main methods are as follows

  • (1) Grid-based method: This type of method mainly focuses on the fusion of multi-modal information at the image level. Most of the text is character granularity, and the embedding method of text and structural information is relatively simple, such as Chargrid[1] and other algorithms.
  • (2) Token-based method: This method refers to methods such as BERT in NLP, and jointly encodes feature information such as location and vision into a multimodal model, and pre-trains on a large-scale data set, so that it can be used in downstream tasks. In , only a small amount of labeled data is needed to achieve good results. Such as LayoutLM[2], LayoutLMv2[3], LayoutXLM[4], StrucText[5] and other algorithms.
  • (3) GCN-based method: This type of method tries to learn the structural information between images and text, so that it can solve the problem of open set information extraction (templates that have not been seen in the training set), such as GCN[6], SDMGR[7 ] and other algorithms.
  • (4) End-to-end-based method: This method puts the existing two tasks of OCR text recognition and KIE information extraction in a unified network for joint learning, and strengthens each other during the learning process. Such as Trie[8] and other algorithms.

For a more detailed introduction to this series of algorithms, please refer to the sixth part of the "Hands-on OCR Ten Lectures" course: Document Analysis Theory and Practice .

2. Key information extraction task process

Algorithms such as LayoutXLM (based on Token) are implemented in PaddleOCR. At the same time, in PP-StructureV2, the network structure of the LayoutXLM multimodal pre-training model is simplified, the Visual backbone part is removed, and the visually irrelevant VI-LayoutXLM is designed. At the same time, it introduces the sorting logic in line with the human reading order and the UDML knowledge distillation strategy, and finally improves the accuracy and reasoning speed of the key information extraction model at the same time.

The following describes how to complete key information extraction tasks based on PaddleOCR.

In the non-end-to-end KIE method, at least two steps are required to complete the key information extraction: first use the OCR model to complete the extraction of the text position and content, and then use the KIE model to extract the key information according to the image, text position and text content. Extract key information from it.

2.1 Training OCR model

2.1.1 Text detection

(1) data

Most of the models provided in PaddleOCR are general-purpose models. In the process of text detection, the detection of adjacent text lines is generally distinguished according to the distance of the position. As shown in the figure above, when using the PP-OCRv3 general Chinese and English detection model for text detection , it is easy to detect the two different fields of "ethnicity" and "Han" together, thus increasing the difficulty of the subsequent KIE task. Therefore, it is recommended to first train a detection model for the document dataset in the process of doing the KIE task.

When labeling data, the labeling of key information needs to be separated, which is closer than the three characters "ethnic Han" in the above picture. At this time, it is necessary to mark "ethnicity" and "Han" as two text detection boxes, otherwise it will be Increase the difficulty of subsequent KIE missions.

For downstream tasks, generally speaking, 200~300Zhang's text training data can guarantee the basic training effect. If you don't have much prior knowledge, you can mark 200~300Zhang's picture first, and then train the subsequent text detection model.

(2) model

In terms of model selection, it is recommended to use PP-OCRv3_det. For more information about the training method of the detection model, please refer to: OCR Text Detection Model Training Tutorial and PP-OCRv3 Text Detection Model Training Tutorial .

2.1.2 Text recognition

Compared with natural scenes, the difficulty of text content recognition in document images is generally relatively low (the background is relatively uncomplicated), so it is recommended to try the PP-OCRv3 general text recognition model provided in PaddleOCR ( PP-OCRv3 model library link ).

(1) data

However, there are also some challenges in some document scenarios, such as rare characters in ID card scenarios, and special fonts in invoice scenarios. These problems will increase the difficulty of text recognition. At this time, if you want to ensure or further improve For the accuracy of the model, it is recommended to load the PP-OCRv3 model for fine-tuning based on the text recognition dataset of a specific document scene.

In the process of model fine-tuning, it is recommended to prepare 5000text recognition images of at least vertical scenes, which can ensure the basic model fine-tuning effect. If you want to improve the accuracy and generalization ability of the model, you can synthesize more text recognition data similar to this scene, collect general real text recognition data from public datasets, and add it to the text recognition training task process of this scene. During the training process, it is recommended that the ratio of real vertical data, synthetic data, and general data for each epoch should be around the left 1:1:1, which can be controlled by setting the sampling ratio of different data sources. If there are 3 training text files containing 1W, 2W, and 5W pieces of data respectively, then the data can be set in the configuration file as follows:

Train:
  dataset:
    name: SimpleDataSet
    data_dir: ./train_data/
    label_file_list:
    - ./train_data/train_list_1W.txt
    - ./train_data/train_list_2W.txt
    - ./train_data/train_list_5W.txt
    ratio_list: [1.0, 0.5, 0.2]
    ...

(2) model

In terms of model selection, it is recommended to use the general Chinese and English text recognition model PP-OCRv3_rec. For more information about the training method of the text recognition model, please refer to: OCR text recognition model training tutorial and PP -OCRv3 text recognition model library and configuration files .

2.2 Training KIE model

There are two main methods for extracting key information from the recognized text.

(1) Use SER directly to obtain the type of key information: For example, in the ID card scenario, mark "name" and "Zhang San" as and name_keyrespectively name_value. The finally recognized category is the name_valuecorresponding text field , which is the key information we need.

(2) Combined use of SER and RE: In this method, first use SER to obtain all the keys and values ​​in the image text content, and then use the RE method to pair all the keys and values ​​to find the mapping relationship, thus completing Extract key information.

2.2.1 SER

Taking the ID card scenario as an example, the key information generally includes 姓名, 性别, , 民族etc. We can directly mark the corresponding fields as specific categories, as shown in the figure below.

Notice:

  • During the labeling process, for text content that is not related to KIE key information, it needs to be marked as othera category, which is equivalent to background information. For example, in the ID card scenario, if we don't pay attention to gender information, we can mark the categories of the two fields "gender" and "male" other.
  • During the labeling process, it is necessary to label in units of text lines , and there is no need to label the position information of a single character.

In terms of data volume, generally speaking, for relatively fixed scenes, about 50 training pictures can achieve acceptable results, and PPOCRLabel can be used to complete the KIE labeling process.

In terms of models, it is recommended to use the VI-LayoutXLM model proposed in PP-StructureV2. It is improved based on the LayoutXLM model, and the visual feature extraction module is removed. In the case of basically no loss of accuracy, the model reasoning speed is further improved. For more tutorials, please refer to: VI-LayoutXLM Algorithm Introduction and KIE Key Information Extraction Tutorial .

2.2.2 BEING + RE

This process mainly includes two processes of SER and RE. The SER stage is mainly used to identify all keys and values ​​in the document image, and the RE stage is mainly used to match all keys and values.

Taking the ID card scenario as an example, the key information generally includes key information such as 姓名, 性别, , 民族etc. In the SER stage, we need to identify all question (key) and answer (value). Callouts are shown below. The category information (field) of each field labelcan be question, answer or other (fields not related to the key information to be extracted)

In the RE phase, you need to mark the id and connection information of each field, as shown in the figure below.

In each text line field, you need to add idand linkingfield information idto record the unique identifier of the text line. Different text content in the same picture cannot be repeated. linkingIt is a list that records the connection information between different texts. For example, if the id of the field "Birth" is 0, and the id of the field "January 11, 1996" is 1, then they are all marked with [[0, 1]], indicating that the fields with id=0 and id=1 are linkingcomposed Key-value relationship (name, gender and other fields are similar, so I won’t go into details here).

Notice:

  • During the tagging process, if the value is multiple characters, then a key-value pair can be added in linking, such as[[0, 1], [0, 2]]

In terms of data volume, generally speaking, for relatively fixed scenes, about 50 training pictures can achieve acceptable results, and PPOCRLabel can be used to complete the KIE labeling process.

In terms of models, it is recommended to use the VI-LayoutXLM model proposed in PP-StructureV2. It is improved based on the LayoutXLM model, and the visual feature extraction module is removed. In the case of basically no loss of accuracy, the model reasoning speed is further improved. For more tutorials, please refer to: VI-LayoutXLM Algorithm Introduction and KIE Key Information Extraction Tutorial .

3. References

[1] Katti A R, Reisswig C, Guder C, et al. Chargrid: Towards understanding 2d documents[J]. arXiv preprint arXiv:1809.08799, 2018.

[2] Xu Y, Li M, Cui L, et al. Layoutlm: Pre-training of text and layout for document image understanding[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020: 1192-1200.

[3] Xu Y, Xu Y, Lv T, et al. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding[J]. arXiv preprint arXiv:2012.14740, 2020.

[4]: Xu Y, Lv T, Cui L, et al. Layoutxlm: Multimodal pre-training for multilingual visually-rich document understanding[J]. arXiv preprint arXiv:2104.08836, 2021.

[5] Li Y, Qian Y, Yu Y, et al. StrucTexT: Structured Text Understanding with Multi-Modal Transformers[C]//Proceedings of the 29th ACM International Conference on Multimedia. 2021: 1912-1920.

[6] Liu X, Gao F, Zhang Q, et al. Graph convolution for multimodal information extraction from visually rich documents[J]. arXiv preprint arXiv:1903.11279, 2019.

[7] Sun H, Kuang Z, Yue X, et al. Spatial Dual-Modality Graph Reasoning for Key Information Extraction[J]. arXiv preprint arXiv:2103.14470, 2021.

[8] Zhang P, Xu Y, Cheng Z, et al. Trie: End-to-end text reading and information extraction for document understanding[C]//Proceedings of the 28th ACM International Conference on Multimedia. 2020: 1413-1422.

reference link

https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.7

For more high-quality content, please pay attention to the official account: Ting, artificial intelligence; some related resources and high-quality articles will be provided for free reading.

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/132651810