Innovative OCR structured technology application, Baidu Chinese and English OCR structured model StrucTexT pre-training model

Optical character recognition (OCR) is currently one of the most widely used visual AI technologies. With the rapid development of OCR technology in industrial applications, the real scene puts forward new requirements for OCR: from perception to cognition - OCR not only needs to know the text, but also to further understand the text. Therefore, structuring has gradually become one of the core technologies for OCR industrial applications, aiming to quickly and accurately analyze structured text information in rich visual data such as cards, bills, and file images, and extract key data. OCR structured technology usually has to solve two types of high-frequency application tasks:

Entity classification: extract text content corresponding to predefined entity tags (such as "name", "date", etc.) in OCR results;

Entity connection: Analyze the relationship between text entities, such as whether they form a key (key)-value (value) pair, whether they belong to the same row or column in the table.
insert image description here

The application requirements of OCR structuring technology exist widely. In business activities, processing documents such as reports, contracts, forms, and invoices has become an important task in daily office work. OCR structured technology can help enterprises automatically understand and identify key information in documents and bills, reduce labor costs, and improve operational efficiency.
insert image description here

Baidu proposed the OCR structured model StrucTexT, which for the first time integrated Chinese and English field-level multi-modal features into OCR structured pre-training for feature enhancement, and strived to refresh the industry's best results on 6 OCR structured data sets; at the same time, based on StrucTexT to create digital The medical claims solution helps enterprises to achieve paperless office and digital transformation.

The industry's first Chinese and English field-level multi-modal feature enhanced OCR structured model StrucTexT

Existing OCR structured schemes can be divided into text information extraction methods, image information extraction methods and multimodal information extraction methods:

1. Text information extraction method: Based on natural language processing, extract text sequences in images, and use named entity recognition technology to mark text semantic entities;

2. Image information extraction method: based on computer vision tasks such as detection and segmentation, locate the image area of ​​​​the text entity;

3. Multi-modal information extraction method: Rich visual text images such as archives, bills, cards, etc. have multiple attributes of text, image (texture, color, font, etc.), layout (spatial location). This kind of method comprehensively uses multi-modal clues for modeling, and shows better results.
insert image description here

In the past two years, the blessing of multi-modal pre-training technology has brought significant benefits to the performance and generalization of the OCR structured model. However, the existing pre-training models are mainly modeled at the granularity of characters (single characters in Chinese and words in English), ignoring the visual text line structure characteristics of text on images, and it is difficult to efficiently represent document semantics and visual information .

In order to solve this problem, Baidu OCR proposes a multimodal pre-training model at the character level and field level - StrucTexT:

1. The first field-level multimodal feature enhancement: Proposes field-level document structure modeling, combined with text sequences, proposes a masked visual language model, field length prediction, and field orientation prediction to more effectively understand rich visual documents.

2. The effect is comprehensively leading in Chinese and English scenes: covering 40,000+ Chinese and English common words, realizing the industry's largest 50 million OCR Chinese and English scene data pre-training, and deeply mining the semantic relationship between different modalities.

3. Complete OCR field analysis capability: Based on the dual-granularity output framework, flexible modeling granularity selection, it can support three structured information extraction tasks: character information extraction, field information extraction and field connection prediction.

4. A single model supports multiple downstream tasks: it supports OCR scenarios of mixed Chinese and English scenarios, and a single model can process multiple downstream tasks in parallel.

StrucTexT is a multimodal information extraction model based on dual-granularity representation. In addition to using character granularity to model text, StrucTexT uses fields to organize document visual cues, and constructs a matching relationship between characters and fields to align image and text features. In terms of multimodal information representation, StrucTexT constructs multimodal features of text, images and layouts, and proposes three self-supervised pre-training tasks of "masked visual language model", "field length prediction" and "field orientation prediction" Promote cross-modal feature interaction, help the model learn the information association between modalities, and enhance the comprehensive understanding of documents. In addition, StrucTexT supports Chinese and English bilingual encoding. Under the double-granularity representation, the model can realize the information extraction task of character and field granularity, and realize flexible model selection and scene adaptation.
insert image description here

Multi-granularity modeling + multi-modal features = StrucTexT's overall leading effect

Based on multi-granularity modeling and multi-modal feature enhancement, StrucTexT has achieved industry-leading results in 3 OCR structured task scenarios and 6 lists of 4 data sets.

1. Character information extraction task: StrucTexT uses the character granularity classification method based on the pre-training model, and achieved an excellent effect of 99.30% on the Chinese test paper data set EPHOIE.
insert image description here

2. Field entity classification: StrucTexT uses field features to classify entities, and achieves SOTA on the three data sets of bill information extraction set SROIE, English form data set FUNSD and Chinese form data set XFUND-CHN. It is worth mentioning that the latter two tasks use the same finetuned model to achieve the unification of Chinese and English application scenarios.

Among them, StrucTexT's prediction result field F1 value on SROIE is 98.70%, ranking first on the list.
insert image description here
On the FUNSD and XFUND datasets, StrucTexT classifies the predefined four types of entity categories, and the F1 values ​​of the large model on the two datasets reach 87.56% and 92.29%, respectively.
insert image description here
3. Entity relationship prediction: that is, to judge whether there is a connection relationship between semantic entities. StrucTexT has a large lead of more than 8% on the FUNSD and XFUN datasets, refreshing the SOTA index.
insert image description here

With strong support from StrucTexT, the recognition of complex bills in medical claim scenarios has also been won

Medical insurance claims is an important application scenario for OCR structured information extraction. The compound annual growth rate of China's commercial health insurance in the past ten years has exceeded 28%, and the premium income of health insurance in 2020 will exceed 800 billion. The China Banking and Insurance Regulatory Commission has proposed that by 2025, the commercial health insurance market will exceed 2 trillion yuan. With the rapid development of health insurance business, insurance companies have to deal with an increasing number of claims cases.

Traditional insurance companies use manual underwriting, and claims personnel manually enter the content information on the ticket, ranging from a dozen to dozens of items. The size of the claims entry and review team is increasing year by year. A large amount of cost input has brought enormous pressure to the company's operations. In order to improve business efficiency and reduce operating costs, the use of artificial intelligence technology to achieve intelligent claim settlement has become the best help for insurance companies to improve the claim settlement process.

To realize automated claims settlement, accurate identification of medical image information is the key. However, medical image recognition scenarios are relatively complex, and accurate OCR structured information extraction is a problem that has plagued the industry for a long time:

1. There are many types of bills: there are hundreds of common lists, invoices, and inspection reports alone.

2. The format of bills is different: the output layouts of hospitals in various provinces and cities are different, and the forms are complicated. For convenience, medical institutions often do not print according to the specifications, and the content layout is extremely random, and there are strong interferences such as occlusion, offset, and overlapping characters.

3. Irregular image collection: Health insurance is a C-end service, and users’ behavior of taking photos is not standardized. There are problems such as damage, bending, and deformation of documents, and the quality of uploaded images is not high.

4. Complicated typesetting of receipts: medical receipts are mixed with multiple types of text, including Chinese and English, numbers and special symbols, and text recognition is difficult.
insert image description here
insert image description here
insert image description here
insert image description here
In response to the above problems, based on the industry-leading OCR recognition capabilities and StrucTexT OCR structural technical capabilities, Baidu has cooperated with large insurance companies to build digital medical claims solutions. Thanks to the StrucTexT model's ability to OCR and structure a variety of complex medical images in a general format, by combining industry business terms, it develops the ability to extract structured information from medical images, and performs standardized output of professional terms on the upper layer to realize the intelligence of the underwriting system. At present, the medical claims settlement solution equipped with medical imaging OCR structural capabilities has been applied in the actual claims settlement and underwriting business of many customers, among which the collection energy efficiency of a top customer in the insurance industry has increased by 4 times.

Medical invoice OCR structuring:
insert image description here
inspection diagnosis report OCR structuring:
insert image description here
expense settlement OCR structuring:
insert image description here
medical test report OCR structuring:
insert image description here
discharge summary OCR structuring:
insert image description here

conclusion

On September 22, 2020, China proposed at the 75th session of the United Nations General Assembly: "China will increase its nationally determined contribution and adopt more powerful policies and measures. to achieve carbon neutrality”. OCR structuring is the basic core technology to realize information electronization and office intelligence. In daily work, there are a large number of cards, bills, and rich document image data, which require OCR identification and structured input. StrucTexT, an OCR structured model based on Chinese and English field-level multi-modal feature enhancements, can digitally input office process input and various document certificates from all walks of life in society. The goal of "double carbon" has laid a good foundation.

At present, the StrucTexT model has been opened on PaddlePaddle. For more technical details of StrucTexT, you can use the following link:

StrucTexT paper address:

https://arxiv.org/abs/2108.02923

StrucTexT open model:

https://github.com/PaddlePaddle

Guess you like

Origin blog.csdn.net/PaddlePaddle/article/details/122409689#comments_27457509