2023 The 12th China Intelligent Industry Summit Forum - Future Prospects of Document Large Model

Preface

At the 12th China Intelligent Industry Summit Forum in 2023, Dr. Ding Kai, deputy general manager and senior engineer of Hehe Information, gave us a wonderful speech, sharing the latest research results on document large models and the analysis of Outlook for the future.

Hehe Information is a leading artificial intelligence and big data technology company known for its innovative intelligent text recognition and commercial big data solutions. This article will introduce the content of Dr. Ding Kai’s speech at the conference, covering document image analysis, recognition, and the application and challenges of large models in this field.

Let’s take a deeper look together and explore the future prospects of document large models, as well as Hehe Information’s unique contribution in promoting the development of the intelligent industry.

Technical challenges in document image analysis, recognition and understanding

  • Scene and style diversity : The diverse shapes and lighting conditions of documents increase the complexity of image analysis, as each document may have different characteristics.
  • Collection device uncertainty : Documents can be collected from a variety of devices, including cameras, scanners, industrial robots, and intelligent robots. This requires algorithms and processing that adapt to different input sources.
  • Diversity of user needs : Different users have different needs for document image recognition. For example, in the financial field, high-precision bill recognition is required, while in the education, archives management and office fields, more emphasis on understandability and structured document processing is required.
  • Document image quality degradation : Document images can degrade in quality for a variety of reasons, including noise, blur, and distortion. Handling these problems requires powerful image preprocessing techniques.
  • Text detection and layout analysis : Detecting text in documents and analyzing layout structure are complex tasks that involve visual object detection and analysis.
  • Unrestricted text recognition : Under unrestricted conditions, such as handwritten text or irregularly typed documents, the accuracy of text recognition is low. This requires more flexible models and algorithms.
  • Structured intelligent understanding : Understanding the structure and content of documents requires highly intelligent processing, including semantic understanding and information extraction.

Insert image description here

Research topics on document image analysis, recognition and understanding

In order to solve the above technical problems, researchers have carried out extensive research in the field of document image analysis, recognition and understanding, mainly including the following topics:

Document image analysis and preprocessing

  • Trim edge enhancement
  • Remove moiré
  • Curvature correction
  • Image Compression
  • PS detection

Document parsing and recognition

  • text recognition
  • Table recognition
  • Electronic file analysis

Layout analysis and restoration

  • Element detection
  • element identification
  • Layout restoration

Document information extraction and understanding

  • information extraction
  • Question-answer

AI safety

  • Tampering with classification
  • Tamper detection
  • Synthetic detection
  • AI generated detection

Knowledge & storage retrieval and management

  • entity relationship
  • Document topic
  • ERP/OA
  • SAP

Insert image description here

Application of multi-modal large models in document image processing

  • GPT-4 : Multimodal large models such as GPT-4 have made significant progress in processing text and image data simultaneously, thereby improving the performance of document image recognition and understanding. This makes it easier to process many types of information, including text, images, and other media.
  • Google Bard : Google Bard is another large multi-modal model that also performs well in the field of document images. This competition drives technological progress in the field and is expected to lead to more innovation.
  • Document image large models : A series of proprietary large models have emerged in the field of document image processing, such as LayoutLM series, LiLT INTSIG, UDOP and Donut. These models use multi-modal Transformer encoders and can be applied to different document image processing tasks, including text, tables, layout structure and multi-language support.
  • Limitations of multimodal large models : Although multimodal large models perform well in processing text and images, they still have some limitations, especially poor performance in processing fine-grained text. This provides challenges and opportunities for future research to further improve the performance of these models.

Performance of multi-modal GPT-4 on document images

Multimodal large-scale language models such as GPT-4 have made significant progress in document image analysis. They can process text and image data simultaneously, improving the performance of document image recognition and understanding.

Insert image description here

Multimodal Google Bard performance on document images

Google Bard is another multi-modal large-scale language model that performs well in the field of document images.

Insert image description here

Progress in document image large models

Document image proprietary large model
LayoutLM series

The LayoutLM series is a group of models that have achieved great success in the field of document image processing. Their design ideas and technical applications are worthy of in-depth discussion. The following is a more detailed introduction to the LayoutLM series:

1. The basis of multi-modal Transformer Encoder : The models of the LayoutLM series are all based on multi-modal Transformer Encoder. This core component combines the Transformer architecture and multi-modal processing capabilities, allowing the model to process text and image data simultaneously. The Transformer architecture has achieved outstanding success in the field of natural language processing, and extending it to document image processing provides a powerful tool for modeling the relationship between text and images.
2. Pre-training and downstream task fine-tuning : The models of the LayoutLM series adopt a training strategy of pre-training and downstream task fine-tuning. In the pre-training stage, the model is trained on large-scale document image data and learns the representation of text and images and the connections between them. This pre-training method enables the model to have general document image understanding capabilities. Subsequently, in the downstream task fine-tuning stage, the model further improves performance by training on specific tasks, such as text recognition, table detection, layout analysis, etc.
3. Application of multi-modal tasks : LayoutLM series models perform well on multi-modal tasks. They can not only recognize text content, but also understand image information in documents. This multi-modal processing capability makes the model more advantageous when processing documents containing multiple media elements such as text, charts, and pictures, such as annual reports, research reports, or financial documents.
4. Evolution of different versions : The LayoutLM series includes multiple versions, such as LayoutLM, LayoutLMv2, LayoutLMv3 and LayoutXLM. These versions have evolved in the core architecture to adapt to different application scenarios and mission requirements. For example, LayoutLMv3 may have higher performance and efficiency in some aspects, while LayoutXLM may have advantages in multi-language support. This enables the LayoutLM series of models to function under a variety of needs.
Insert image description here

LiLT

1. Decoupled joint modeling of visual and language models : LiLT adopts an innovative approach to model visual and language models separately and integrate them through joint modeling. This decoupled design enables the model to better process textual and visual information in document images, thereby improving recognition and understanding accuracy.
2. Bidirectional Complementary Attention Module (BiCAM) : In order to better integrate visual and language models, LiLT introduces Bidirectional Complementary Attention Module (BiCAM). The function of this module is to enable the model to conduct two-way information transfer and interaction between vision and language, thereby better capturing the correlation between different elements in the document image.
3. Excellent multi-language small sample/zero sample performance : LiLT shows excellent performance in multi-language small sample and zero sample scenarios. This means that even with limited data, the model can still effectively perform document image information extraction tasks, demonstrating its robustness in dealing with multi-language and insufficient data situations.

DOP

UDOP, as an important innovation in the field of document image processing, represents the emerging trend of unified models for document processing. This model is designed to make the document processing process more efficient and integrated to cope with the diverse needs in different fields and applications. The following are the main features of UDOP:

1. Grand unified model of document processing : UDOP is called the "grand unified model" of document processing, which means that it is designed to be a general tool that can handle a variety of document image processing tasks, including text recognition, layout analysis, and image processing. wait. This unified model design simplifies document processing workflows, making them more efficient and flexible.
2. Unified Vision-Text-Layout encoder : UDOP uses a unified encoder to integrate visual information, text content and layout structure information. This encoder can process different types of input simultaneously, including text images, tables, pictures, etc., thus achieving comprehensive processing of multi-modal information.
3. Separated Text-Layout and Vision decoders : In order to better understand and process document images, UDOP uses separate decoders to process text, layout and visual information respectively. This separated architecture enables the model to better capture the correlation between different elements, improving the accuracy and efficiency of document processing.
4. Multi-task support : UDOP is designed to support multiple tasks, including text recognition, table detection, layout restoration, etc. This allows it to adapt to the needs of different fields and industries, from bill processing in the financial field to medical record management in the medical field, all of which can unleash its powerful potential.
5. Respond to multilingual needs : UDOP also has the ability to process multilingual documents, which is very important for international enterprises and cross-border cooperation. It can handle documents in different languages ​​with ease, providing convenience to users around the world.

Insert image description here

Donut

Donut, as a Transformer model for document understanding, marks a revolutionary breakthrough in the field of document image processing. The way this model is designed and applied brings new possibilities to document understanding. The following is a more detailed introduction to Donut:

1. Document understanding without OCR : One of the most notable features of Donut is that it does not require the traditional OCR (Optical Character Recognition) step to process document images. Traditional OCR methods may be limited by image quality, fonts and layout, while Donut directly understands the content and structure of the document through the Transformer model, without converting the text in the image into text. This makes document understanding more efficient and accurate.
2. Application of Transformer model : Donut adopts Transformer model as its core architecture. Transformer models have achieved great success in the field of natural language processing, but their application in document understanding is a new field. This model uses advanced technologies such as self-attention mechanism and multi-head attention mechanism to capture the correlation between different elements in the document, including text, images and layout structure.
3. Multi-modal processing : Donut not only processes text content, but can also understand image information in documents. This multimodal processing capability makes it excellent when working with documents that contain multiple media elements, such as reports or documents containing text, charts, and pictures.
4. Document structure understanding : Donut not only focuses on text content, but also understands the structure of the document. This includes identifying different types of document elements such as headings, paragraphs, lists, tables, etc., and understanding the hierarchical relationships between them. This understanding of document structure helps to dig deeper into the document's information.
5. Application fields : Donut has a wide range of application fields and can be used for various tasks such as automated document processing, information extraction, and knowledge management. It can extract key information from documents, identify themes, analyze trends, and provide powerful decision support for enterprises and research institutions.
6.Future potential: Donut represents the future trend in the field of document image processing. Its OCR-free and multi-modal processing capabilities bring new ideas to document understanding. In the future, we can expect to see more innovative applications based on the Donut model, pushing document processing to new heights.

Insert image description here

Multimodal large model
BLIP2

BLIP2 (Bidirectional Language-Image Pretraining 2) adopts an innovative method that combines image encoding and language decoding to achieve efficient pre-training and representation learning of multi-modal data. Here is a more detailed introduction to BLIP2:

1. Q-Former connection pre-training : BLIP2 uses Q-Former to connect pre-trained image encoders (such as ViT, Vision Transformer) and LLM (Language-Layout-Model) decoders (such as OPT and FlanT5, etc.). This Q-Former plays a key role, allowing the model to process information from images and text simultaneously. This connection method is innovative because it takes full advantage of the Transformer architecture to effectively integrate visual and language information.
2. Only the Q-Former part needs to be trained : A notable feature is that BLIP2 only needs to train the Q-Former part. This is because Q-Former assumes the core task of the entire model, which is responsible for fusing information from images and text together to generate rich multi-modal representations. This strategy not only reduces the computational cost of training, but also improves the training efficiency of the model.
3. Multi-modal representation learning : The core goal of BLIP2 is to learn multi-modal representation, which means that the model can understand images and text simultaneously and establish meaningful associations between the two. This is important for multi-modal tasks such as image annotation, text-to-image generation, document image understanding, etc. Through pre-training, BLIP2 can learn universal representations on large-scale multi-modal data, providing a strong foundation for various tasks.

Insert image description here

Flamingo

Flamingo is a high-profile model that has attracted much attention for introducing innovative designs in multi-modal information processing. Here is a more detailed introduction to Flamingo:

1. Introducing the Gated Attention layer : A notable feature of Flamingo is the introduction of the Gated Attention layer in LLM (Language-Layout-Model). The role of this layer is to introduce visual information and integrate it into the text processing process. With Gated Attention, the model can selectively focus on text and image information to better understand multi-modal data.
2. Multi-modal data understanding : One of Flamingo’s design goals is to enable models to effectively understand the relationship between text and images. With Gated Attention, the model can adjust its focus according to the needs of the task. For example, in the image annotation task, the model can adjust the attention of generating text descriptions based on the image content, thereby generating more accurate annotations.
3. Enhanced task performance : After introducing the Gated Attention layer, Flamingo performs well on multi-modal tasks. Not only can it better handle the association of images and text, but it can also improve performance in various tasks, including image annotation, visual question answering, document image understanding, etc. This makes Flamingo a powerful tool for working with multimodal data.
Insert image description here

The lava
  • Connect CLIP ViT-L and LLaMA using a fully connected layer
  • Generate high-quality 158k instruction following data using GPT-4 and Self-Instruct
    Insert image description here
MiniGPT-4
  • The visual part adopts ViT+Q-Former
  • The language model part uses Vicuna
  • The visual and language modules are connected using a fully connected layer.
    Insert image description here
Limitations of using large multi-modal models in the field of OCR

Although multimodal large models perform well in handling salient text, they still have some limitations. These models are limited by the resolution of the visual encoder and training data, and perform poorly on fine-grained text.

Insert image description here

Are document images more text-based or image-based?

In document image analysis, there is a key question: Are document images more text-oriented or image-oriented? This involves the recognition and understanding of various elements in the document image.

Insert image description here

Pixel2seq large model series

Pix2Seq

Treat the object detection task as an image-to-sequence language modeling task.

Insert image description here

UniTAB

Multi-modal encoder (image & text) + autoregressive decoder completes various Vision-Language (VL) tasks.
Insert image description here

NOUGAT

Realize document image to document sequence output through Swin Transformer and Transformer Decoder.

Insert image description here

Document Image Large Model Exploration

Document image large model design ideas

The design idea of ​​the document image large model includes several key points, which play an important role in promoting document image recognition and understanding:

  • The task of document image recognition and analysis is defined as a form of sequence prediction, which includes prediction of text, paragraphs, layout analysis, tables, formulas and other elements.
  • Different prompts are used to guide the model to perform different OCR (Optical Character Recognition) tasks, thus improving the versatility and applicability of the model.
  • It supports chapter-level document image recognition and analysis, and can output document types such as Markdown, HTML or plain text in standard formats, allowing the model to perform well when processing complex documents.
  • The tasks related to document understanding are delegated to LLM (Language-Layout-Model). This strategy helps to improve the efficiency and accuracy of the model when processing structured documents.

Insert image description here

SPTS document image large model

SPTS (Sequence-to-Sequence Prediction for Text Spotting) : SPTS is an important document image processing model that defines the end-to-end text detection and recognition task as a picture-to-sequence prediction task. This model indicates the location of text through single-point annotation, thereby reducing annotation costs and eliminating the need for complex post-processing steps. This method provides a more efficient solution for document image processing and can be applied to tasks such as end-to-end detection and recognition of scene text, table structure recognition, and handwritten mathematical formula recognition.

SPTS
  • Define end-to-end detection and recognition as an image-to-sequence prediction task
  • Single-point annotation is used to indicate text position, which greatly reduces annotation costs.
  • No need for Rol sampling and complex post-processing operations, truly integrating detection and recognition

Insert image description here

OCR unified model based on SPTS (SPTS v3)
  • Define multiple OCR tasks as a form of sequence prediction
  • Guide the model to complete different OCR tasks through different prompts
  • The model follows the image-to-sequence structure of SPTS’s CNN+Transformer Encoder+Transformer Decoder.

Insert image description here

SPTSv3 task definition
  • SPTSv3 defines a variety of OCR tasks as a form of sequence prediction, including end-to-end detection recognition, table structure recognition, and handwritten mathematical formula recognition. This model uses different prompts to guide the model to complete different OCR tasks, making it more flexible and versatile.

Insert image description here

Experimental results show that SPTSv3 achieves excellent performance on various OCR tasks, showing its potential in document image processing. This provides an efficient solution for multi-tasking of document images and is expected to be used in a wide range of applications, including automated document processing, document search, and content extraction.

Training platform:A100GPUx10

Insert image description here

End-to-end detection and recognition of scene text

Insert image description here

Table structure recognition

Insert image description here

Handwritten mathematical formula recognition

Insert image description here

Experimental results

Insert image description here

Insert image description here

Summarize

At the 12th China Intelligent Industry Summit Forum in 2023, Dr. Ding Kai’s speech led us to delve into the cutting-edge research of large document models. He shared the latest research results on large document models, and introduced Hehe Information Technology Company and the challenges in the field of document image analysis, recognition and understanding. The speech also mentioned current technical challenges and future research directions aimed at achieving more flexible document image processing. It brings more possibilities to the future of document image processing. This wonderful speech gave us full confidence in the development of the intelligent industry and we look forward to more innovations and breakthroughs.

Guess you like

Origin blog.csdn.net/qq_44273429/article/details/133135394