Exploration and thinking on the large document model of the 12th China Intelligent Industry Summit Forum in 2023

Preface

Recently, the 12th China Intelligent Industry Summit Forum 2023 (CIIS 2023) was successfully held in Nanchang, Jiangxi. This forum mainly explained topics in the fields of AI large models, generative AI, unmanned systems, intelligent manufacturing and digital security. . Among them, what impressed me the most was the special forum on multi-modal large models and intelligent understanding of document images explained by Mr. Ding Kai from Shanghai Hehe Information .

Insert image description here

Hehe information

Before we explain the special forum on multi-modal large models and document image intelligent understanding, let’s give a basic introduction to Shanghai Hehe Information Technology Co., Ltd.

Shanghai Hehe Information Technology Co., Ltd. is an industry-leading artificial intelligence and big data technology company. It is committed to providing global enterprises and individuals with core technologies in the field of intelligent text recognition and commercial big data, C-side and B-side products, and industry solutions. Provide users with innovative digital and intelligent services. I believe everyone must have heard of its products - the All-in-one Business Card King and the All-in-One Scanner .

Insert image description here

Multimodal large models and intelligent understanding of document images

Multimodal large models refer to powerful neural network models that can process multiple types of data (such as images, text, speech, etc.) simultaneously. It integrates input data from multiple modalities and performs joint training and inference through a shared model structure.

Unlike traditional deep learning models, which usually only model and process a specific type of data, multi-modal large models further expand the capabilities of the model to enable it to process different types of data at the same time.

The core idea of ​​multi-modal large models is to fuse and interact data from different modalities to achieve more comprehensive and accurate task processing. For example, in image and document generation tasks, the model can accept both image and document inputs and generate corresponding output based on the association between the two. This joint training and generation approach can provide richer and more diverse results.

Technical difficulties in document image analysis, recognition and understanding

At the conference, according to what Teacher Ding said, the current technical difficulties in document image analysis, recognition and understanding are mainly reflected in the following aspects:

  • When document image quality degrades, it causes the document image to become blurry. This quality issue is closely related to document image scanning technology;
  • According to the following picture examples, the text layout is very complex, which brings huge challenges to layout analysis and text detection;
  • In the field of text recognition, due to the sloppy writing, there are many types of recognition, in addition to text and formulas, there are also some special symbols;

Insert image description here
Based on the above issues and problems, Hehe Information has divided the research topic of document image analysis, recognition and understanding into the following six modules:

Document image analysis and preprocessing

The main solution is the quality problem of document images. For example, a document image that cannot be seen clearly by the human eye becomes very clear after being processed by edge enhancement, moire removal, bend correction, image compression, PS detection and other technologies. Very high quality images.

Document parsing and recognition

After document image analysis and preprocessing, the document image will then go to the document parsing and recognition module. We obtain text information through technologies such as text recognition, table recognition, and electronic file analysis.

Layout analysis and restoration

We will process the text information obtained in the previous step, use element detection, element identification, layout restoration and other technologies to identify the title, paragraph, image and other elements of the document, and restore the original layout structure of the document for subsequent information extraction and understanding.

Document information extraction and understanding

Through computer technology, useful information is automatically extracted from documents and understood, classified and summarized. Document information extraction and understanding can help people manage and utilize large amounts of document data more effectively, improving work efficiency and decision-making quality. It has broad application prospects in digital file management, enterprise knowledge management, search engines, automated customer service and other fields.

AI safety

In the process of document image analysis, recognition and understanding, users' data privacy and document image security are guaranteed through technologies such as tamper classification, tamper detection, synthesis detection, and AI generation detection.

Knowledge & storage retrieval and management

Effective organization, storage, retrieval and management of information and knowledge, extraction of useful knowledge from large amounts of data and information, and making it easy to access and utilize are of great significance for improving work efficiency, decision-making quality and innovation capabilities.

Insert image description here

Analysis, recognition and understanding of document images and their relationship with large models

Teacher Ding believes that the relationship between the analysis, recognition and understanding of document images and large models should be complementary .

For example: data and computing power are two important factors for large-scale cloud computing. With the development of artificial intelligence and deep learning, the training of large models requires large amounts of data and powerful computing resources. Regarding the issue that the amount of data available for large model training around the world may be exhausted, some institutions have indeed made predictions.

Insert image description here

At present, the data volume of large models is already quite large, and many large model manufacturers have begun to pay attention to the field of electronic documents. As the need for large models and the importance of electronic documents increases, so will the need for document image scanning and OCR technology. This may be a new data source and application area for providing more training data and computing resources to support large models.

Progress in document image large models

LayoutLM

When everyone talks about large models of document images, they generally cannot avoid Microsoft's LayoutLM series of large models. Its working principle: perform OCR on text images. If it is an electronic document, parse it directly. Put its text information, location information, and subsequent image information together to make a pre-trained model, and then perform the task.

Insert image description here

DOP

Microsoft launched the unified document processing model UDOP in 2023, which is an end-to-end model. It uses a unified Vision-Text-Layout encoder to uniformly encode text information, visual information, and layout information, and uses Text-Layouot and Vision decoders to decode separately during decoding.

Insert image description here

Donut

NAVER developed OCR Free's document image model Donut in 2022. It is a Transformer model for document understanding that does not require OCR, that is, it processes images directly.

Insert image description here

BLIP2

The multi-modal model BLIP2 integrates visual modalities and language modalities very well. It encodes images through Image Encoder, uses Q-Former to fuse image modalities and text modalities, and then connects a large Language model.

Insert image description here

Its characteristic is that it can not only understand images, but also make full use of the understanding capabilities of large language models.

Hehe Information cooperated with South China University of Technology to jointly study LiLT, a proprietary large model for document images. LiLT takes an innovative approach by modeling visual and language models separately and integrating them through joint modeling. This decoupled design enables the model to better process textual and visual information in document images, thereby improving recognition and understanding accuracy.

To better integrate visual and language models, LiLT introduces the Bidirectional Complementary Attention Module (BiCAM). The function of this module is to enable the model to conduct two-way information transfer and interaction between vision and language, thereby better capturing the correlation between different elements in the document image.

LiLT shows excellent performance in multi-lingual small sample and zero sample scenarios. This means that even with limited data, the model can still effectively perform document image information extraction tasks, demonstrating its robustness in dealing with multi-language and insufficient data situations.

Exploration of large document image models

Document image large model design ideas

  • Define various tasks of document image recognition analysis as a form of sequence prediction
    • Text, paragraphs, layout analysis, tables, formulas, etc.
  • Guide the model to complete different OCR tasks through different prompts
  • Supports chapter-level document image recognition and analysis, and outputs standard formats such as Markdown/HTML/Text
  • Leave the work related to document understanding to LLM

Insert image description here

SPTS

The SPTS document image large model is mainly designed for scene text: end-to-end detection and recognition is defined as a picture-to-sequence prediction task, and single-point annotation is used to indicate the text position, which greatly reduces the annotation cost. No need for Rol sampling and complex post-processing operations, truly integrating detection and recognition.

Insert image description here

In the V2 version, to address the problem of slow SPTS inference speed, detection and recognition are decoupled into autoregressive single-point detection and parallel text recognition. IAD autoregressively obtains the single point coordinates of each text based on the visual encoder features. PRD obtains the recognition results of each text in parallel based on the single point features of IAD.

Insert image description here

After several rounds of iterations, the SPTS-based OCR unified model (SPTS v3) has successfully expanded the input from scene text to tables, formulas, chapter documents, etc. Multiple OCR tasks are defined as sequence predictions, and different prompts are used to guide the model to complete different OCR tasks. The model follows the picture-to-sequence structure of SPTS's CNN+TransformerEncoder+Transformer Decoder.

Insert image description here

SPTS v3 task definition: Currently, it mainly focuses on tasks such as end-to-end detection and recognition, table structure recognition, and handwritten mathematical formula recognition.

Insert image description here

Training platform: A100GPU * 10

Insert image description here

Experimental results

Insert image description here

Insert image description here

Outlook

Insert image description here

What the team hopes is that in the future, when inputting, it will no longer be a fixed formula, a picture of a formula, or an image of a table, but a document image, which contains both text, formulas, tables, and pictures. We use different prompts to control what is extracted specifically, so that the model outputs Token Sequence, and finally connects to the large model to implement polymorphic practical applications in different scenarios.

The research results of Hehe Information in the intelligent industry are of great significance. These achievements not only provide practical solutions for various industries, but also provide new ideas and directions for the development of the intelligent industry. It is hoped that through continuous exploration and innovation, Hehe Information is expected to achieve more breakthroughs in intelligent image processing and other fields, and promote the application of artificial intelligence technology and the development of the intelligent industry.

Guess you like

Origin blog.csdn.net/Qingai521/article/details/133172087