[2023 CSIG Vertical Domain Large Model] In the era of large models, how to complete the unification of OCR in the field of IDP intelligent document processing?

From December 28th to 31st, 2023, the 19th CSIG Young Scientists Conference hosted by the Chinese Society of Image and Graphics was grandly held in Guangzhou, China. The conference attracted experts and young scholars from academia and business circles, and the conference was oriented to the international academic frontier. In line with national strategic needs, we focus on the latest cutting-edge technologies and hot areas, jointly discuss cutting-edge issues in the field of image graphics, share the latest research results and innovative ideas, and hold a special session on large models in vertical fields. Vice President of Hehe Information Intelligence Technology Platform Division Dr. Ding Kai, manager and senior engineer, brought us a report on the theme of "Thinking and Exploration of Document Image Large Models".

This article will focus on the following issues and share the research issues and in-depth thinking in the field of intelligent document image processing in the era of large models in the theme report:

  • What kind of inspiration can the large model represented by GPT4-V Gemini bring to the technical solutions and research and development paradigms in the IDP field?
  • Can we learn from the advantages of large models and propose a unified OCR model with good accuracy and strong generalization?
  • Can LLM be better combined with document recognition analysis engines to solve core problems in the IDP field?

1. Pixel-level OCR unified model: UPOCR

UPOCR is a pixel-level OCR unified model proposed by Hehe Information-South China University of Technology Document Image Analysis, Recognition and Understanding Joint Laboratory in December 2023. UPOCR is based on the encoder-decoder architecture of the Visual Transformer (ViT), which unifies various OCR tasks into an image-to-image transformation paradigm, and introduces learnable task cues to push the universal feature representation extracted by the encoder to the task-specific space, making The decoder is task aware. Experiments show that the model can have modeling functions for different tasks, and can simultaneously realize pixel-level OCR tasks such as text erasure, text segmentation, and tampered text detection.

1.1. Why is UPOCR proposed?

The field of general character recognition (OCR) currently faces a number of major problems, which actually limit its wide application in various application fields.

  1. Fragmentation of task-specific models : Although many task-specific models have emerged in OCR field research, each model is only optimized for a specific field. The model is too fragmented and difficult to use collaboratively between different tasks. Cross-domain and multi- The versatility of the scene is greatly limited.
  2. Lack of unified interfaces : Some existing general models rely on specific interfaces or decoding mechanisms such as VQGAN. This dependence limits the flexibility and adaptability of the model in the pixel space, making it difficult to correlate and implement different tasks.
  3. Pixel-level OCR challenges : Current models still face challenges in generating pixel-level text sequences. This is because text generation not only involves semantic understanding, but also needs to consider pixel-level details. Improving the model's ability to generate pixel-level text is still an important research direction.

1.2. What is UPOCR?

UPOCR is a general OCR model. It uses ViTEraser in the AAAI 2024 acceptance paper of the South China University of Technology team as the backbone network, and also draws on SegMIM, a self-supervised document image pre-training method based on MIM and segmentation map guidance for self-supervised pre-training. , and then combine three different task prompt words such as text erasure, text segmentation and tampered text detection for unified training.
Insert image description here

After the model is trained, it can be directly used for downstream tasks without the need for special fine-tuning. The model is mainly studied from three aspects: unified paradigm, unified architecture and unified training strategy.

1.2.1. Unified Paradigm

Insert image description here
As shown in the figure, the author proposes a unified paradigm for OCR tasks, which converts various pixel-level OCR tasks into RGBtoRGB conversion problems. Although the goals of these tasks are different (e.g. image generation and segmentation), they can all be unified to operate in a shared feature space:

  1. Text erasure task : For the text erasure task, the output is the text-removed image corresponding to the input, which belongs to the RGBtoRGB task.
  2. Text segmentation task : Text segmentation aims to assign each pixel to foreground (i.e. text stroke) or background. Under the unified image-to-image translation paradigm, UPOCR predicts RGB images with white and black colors by comparing the generated RGB values. Distance from predefined foreground RGB values ​​to determine category.
  3. Tampered text detection task : The tampered text detection task is defined as the per-pixel classification of tampered text, real text and background categories, and then UPOCR assigns red (255, 0, 0) and green (0, 0, 255, 0) and blue (0, 0, 255) colors. During inference, the class of each pixel is determined by comparing the distance of the predicted RGB value to these three colors.

1.2.2. Unified Architecture

Insert image description here

As shown in the figure, the authors implement a unified image-to-image translation paradigm to handle various pixel-level OCR tasks by adopting a ViT-based encoder-decoder. Among them, the encoder-decoder architecture uses ViTEraser as the backbone network. The encoder includes four sequential blocks. Each encoder block contains a block embedding layer for downsampling and a Swin Transformer v2 block. The decoder part consists of five sequential blocks, each decoder block contains a block partitioning layer for upsampling and a Swin Transformer v2 block.

In addition, the authors introduce learnable task cues in the encoder-decoder architecture, and the corresponding cues are added to each pixel of the hidden features generated by the encoder, pushing the general OCR-related representations generated by the encoder toward task-specific area. The decoder then converts the adjusted latent features into task-specific output images. Based on this architecture, UPOCR can handle multiple tasks simultaneously simply and effectively with minimal parameters and computational overhead.

1.2.3. Unified Training Strategy Unified Training Strategy

Insert image description here

Since the model is trained using the image-to-image conversion paradigm, during the training process, the goal of model optimization only needs to consider minimizing the difference between the generated predicted image and the real image in the pixel space and feature space, without considering the differences between tasks. difference.

  1. Pixel space loss : The difference in pixel space is measured by the L1 distance between the output image and the real image: L pix = ∑ i = 1 3 α i ∥ I out i − I gti ∥ 1 L_{pix}=\sum_{ i=1}^{3} \alpha_{i}\left\|\mathbb{I}_{\text {out }}^{i}-\mathbb{I}_{gt}^{i}\right \|_{1}Lpix=i=13ai Iout iIgti 1,其中 I o u t i \mathbb{I}_{out}^{i} IoutiRepresents the output image, I gti \mathbb{I}_{gt}^{i}Igtirepresents a real image.
  2. Feature space loss : For tasks associated with real image generation, it is also necessary to align the output image and the real image in the feature space: L feat = 0.01 × L per + 120 × L sty L_{\text {feat }}=0.01 \ times L_{\text {per }}+120 \times L_{\text {sty }}Lfeat =0.01×Lper +120×Lsty 
  3. Overall loss : The overall loss of the model is the sum of pixel loss and feature loss: L total = L pix + L feat L_{\text {total }}=L_{pix}+L_{\text {feat }}Ltotal =Lpix+Lfeat 

1.3. How effective is UPOCR?

Insert image description here

The experimental results are shown in the three tables above. The upper left table is a comparison of text erasure experiments. Even compared with the fine-tuned model dedicated to the erasure field, the UPOCR unified model is ahead of the SOTA method in most indicators; the upper right table The table is a comparison of text image segmentation experiments. It can be seen that UPOCR is better than the segmentation method dedicated to a single task in all indicators; the table on the lower left is text tamper detection, and UPOCR also achieved good results. Figure 5 shows that the task-related prompts designed by the UPOCR model can also distinguish different tasks very well. The following figure is a visual comparison of text erasure, segmentation, and tampering detection with the SOTA method of existing subtasks.
Insert image description here

In summary, UPOCR proposes a simple and effective unified pixel-level OCR interface, which adopts a ViT-based encoder-decoder to handle various tasks through learnable task prompts, and has excellent performance in text removal, text segmentation and It has shown extremely high performance on tasks such as tampered text detection.

2. A quick overview of cutting-edge research on OCR unified models

2.1. Donut: Transformer model for document understanding without OCR

Paper address: https://link.springer.com/chapter/10.1007/978-3-031-19815-1_29

Project address: https://github.com/clovaai/donut
Insert image description here

The Donut model is a novel OCR-free VDU model based on the Transformer architecture . The Donut model first generates the layout through a simple rule, and then applies some image rendering techniques to simulate real documents, which is carried out through two stages of pre-training and fine-tuning. train. In the pre-training stage, the model uses the IIT-CDIP dataset for visual language modeling and learns to read text from images. In the fine-tuning phase, the model is trained to generate output in JSON format to solve downstream tasks such as document classification, document information extraction, and document visual question answering. Compared with other OCR-based models, Donut does not need to rely on an OCR engine and therefore has higher speed and smaller model size. Experiments on multiple public datasets demonstrate that Donut exhibits advanced performance in document classification tasks.

2.2. NouGAT: realize document image to document sequence output

Paper address: https://arxiv.org/abs/2308.13418

Project address: https://github.com/facebookresearch/nougat
Insert image description here

The Nougat model is an OCR model that implements document image to document sequence output through Swing Transformer and Transformer Decoder . The model uses an end-to-end training method based on OCR-free Transformer, and is trained using pre-training and fine-tuning. In the pre-training stage, Donut is pre-trained using document images and their text annotations, learning how to read text by combining the image and previous text context to predict the next word. In the fine-tuning phase, Donut learns how to understand the entire document based on downstream tasks. Extensive evaluations on various VDU tasks and datasets demonstrate Donut’s strong understanding capabilities.

2.3. SPTS v3: Unified OCR model based on SPTS

Paper address: https://arxiv.org/abs/2112.07917

Project address: https://github.com/shannanyinxiang/SPTS
Insert image description here

SPTS, the full name of Single-Point Text Spotting, is a single-point text recognition technology . Its main innovation is: the method uses extremely low-cost single-point annotation for training to formalize the text detection task into a language modeling task, only It is necessary to label each text instance at a single point to train a scene text recognition model. SPTS is based on an autoregressive Transformer framework that simply generates results as sequential tokens, thus avoiding complex post-processing or exclusive sampling stages. Based on such a concise framework, SPTS shows advanced performance on various datasets.

3. Intelligent document processing applications in the era of large models

3.1. Application of LLM and document recognition analysis

The large language model can understand natural language text and has the ability to understand context. In document recognition and analysis applications, the work related to document understanding is handed over to the large language model and automatic chapter-level document understanding and analysis can help the system to better Understand document content, including relationships in context, entity recognition, sentiment analysis , etc. Currently, the most common and widespread applications include retrieval enhanced generation (RAG) and document question and answer.

Insert image description here

  1. Retrieval-enhanced generation : There are already large language models aimed at retrieving relevant information from large amounts of documents and providing more detailed and accurate answers in a generative manner. This has important application value in information retrieval scenarios.
  2. Document Q&A : LLM can be directly used to build a document Q&A system, allowing users to obtain relevant information in documents by asking questions. It can be applied to scenarios such as interpretation of legal documents, query of technical manuals, and understanding of knowledge bases.

3.2. Intelligent document processing application products

Intelligent Document Processing (IDP) uses artificial intelligence and machine learning technology to automatically analyze and understand documents. It improves business processes by identifying, parsing, understanding document content, and converting it into actionable data or information. The degree of automation improves work efficiency and reduces costs.

Dr. Ding Kai also brought us Hehe Information document image recognition and analysis product sharing. Based on such intelligent document processing technology, the product can quickly and accurately process a large number of documents, helping banks, insurance, logistics, supply chain, and customer service Digital transformation in many fields including more efficient and reliable business process management.

Insert image description here

Hehe Information TextIn intelligent text recognition product is based on self-developed text recognition technology, computer graphics technology and intelligent image processing engine, which can quickly convert text information in paper documents or pictures into computer-readable text format . Provide better document management solutions to help in many scenarios such as document electronicization, office document/report recognition, educational text recognition, express delivery receipt recognition, edge trimming enhancement, bend correction, shadow processing, seal detection, handwriting erasure, etc. Enterprises achieve digital transformation and automated management.

Insert image description here

Although multi-modal large model technology represented by GPT4-V has greatly promoted technological progress in the field of document recognition and analysis, it has not completely solved the problems faced in the field of image document processing. There are still many issues worthy of our study. How to combine The ability of large models to better solve IDP problems deserves more thinking and exploration.
Insert image description here

4. Lucky draw at the end of the article

Hehe Information is sending benefits to everyone! Fill out the annual questionnaire: https://qywx.wjx.cn/vm/exOhu6f.aspx . On January 12, 10 people will be randomly selected to receive a 50 yuan JD card. Welcome to participate!

Guess you like

Origin blog.csdn.net/air__Heaven/article/details/135407255