ChatGPT has been upgraded, and the field of document image recognition is ushering in technological innovation


​Write in front

On December 31, 2023, the 19th China Image and Graphics Society Young Scientists Conference was held in Guangzhou. The conference was hosted by the Chinese Image and Graphics Society. It aimed to promote exchanges and cooperation among young scientists and enhance my country’s image and graphics technology. Scientific research level and innovation ability in the field of graphics.

In the "Vertical Field Large Model Forum" jointly organized by the Chinese Society of Image and Graphics and Shanghai Integrated Information Technology (INTSIG), the research direction of large model technology in the field of image graphics in the era of large language models represented by ChatGPT was discussed. In other words, whether the implementation of the application will be valuable and what the value is has been discussed in depth. Many industry experts, including Professor Ding Kai of Hehe Information, introduced new explorations in the field of document and image recognition in the era of large models.

Insert image description here

ChatGPT ushered in a major upgrade

On September 25, 2023, OpenAI announced the launch of the new GPT-4V (Vision) multi-modal large model, and ChatGPT ushered in a major upgrade!

GPT-4V adds image and voice input capabilities to the original, aiming to provide users with more diversified usage methods and make ChatGPT's communication with people richer and more diverse. Its main functions include voice function, which provides 5 different voice options, high-accuracy speech recognition and speech synthesis functions; image input function, users can take photos of things they are interested in and upload them to GPT-4V. It also has the ability to process input in the form of text and images, and can produce text output based on mixed input modes; natural language task processing, text summarization, question and answer, text generation, sentiment analysis, machine translation, etc.; answering pictures and identifying places, for users Given the images, GPT-4V is able to recognize and answer questions about places. In addition, there are object detection, text recognition, face recognition, verification code solving and so on. It can be seen that GPT-4V is powerful and has broad application prospects in many fields, including image and document recognition.

So with the emergence of the GPT-4V multi-modal large model, will it have a huge impact on the field of OCR document recognition? Dr. Ding Kai from Shanghai Hehe Information gave us a detailed answer at the China Society of Image and Graphics (CSIG) Young Scientists Conference 2023...

Impact and opportunity coexist

It is undeniable that GPT-4V has made significant achievements in the field of document recognition, but it should also be noted that some core problems in this field (OCR document recognition) still exist, such as image quality, text recognition, layout analysis, etc. These issues still need to be resolved. At the same time, GPT-4V will also bring many changes to the field of document recognition, so from a research perspective, impact and opportunities coexist.

Through detailed analysis and scene testing of the GPT-4V document processing field, it was found that GPT-4V has done well in scene text recognition, handwritten document recognition, geometric figure and text combination scene recognition, formula recognition, table recognition, information extraction, etc. Very good, the level can be said to completely surpass any traditional technology.

(Scene text recognition, handwritten document recognition, formula recognition test)
Insert image description here
But even with such a high level, GPT-4V does not completely solve all problems in the field of OCR document recognition. The shortcomings are also obvious during the test. The first is the recognition of Chinese. Whether it is handwritten or printed text, GPT-4V outputs a large amount of content that has nothing to do with the actual article after recognition, and some simple handwritten formulas GPT-4V It is also impossible to identify perfectly.

Insert image description here
In addition, for long documents, there is still a pre-dependence on document parsing and recognition. ChatGPT calls the open source PyPDF2, but the plug-in has mediocre effects, and the output does not support table structures, scanned documents, processing of complex layouts, and positioning. to the original text.

Insert image description here
In summary, it can be seen that the advantage of GPT-4V is that it can end-to-end solve recognition and understanding problems, cognitive capabilities, and the ability to support the identification and understanding of document element types far exceeds traditional algorithms. However, for long documents, it needs to rely on external OCR/documents. Parsing engine, this means that the performance of the external engine will seriously affect the performance of GPT-4V in document processing, and its shortcomings are also very obvious. For pixel-level OCR tasks such as tamper detection, text segmentation and erasure, and element detection and recognition, GPT-4V is even less capable or even does not have this capability.

GPT-4V's ability to process large-scale behavioral data, as well as breakthroughs in language generation and understanding, enable it to process and analyze different types of behavioral features, such as language, sounds, images, etc., more naturally and complexly. However, GPT-4V is not optimized specifically for the field of document image recognition, so what we should do is to make full use of the potential of GPT-4V and make appropriate adjustments and improvements to it to adapt to the specific needs and challenges of document recognition. At the same time, other OCR technologies and tools still have their unique advantages and application scenarios. Therefore, GPT-4V will not completely replace other technologies, but a relationship that coexists with and promotes mutual development. There is still a huge gap in the field of OCR document image recognition. research space.

​Thinking and Exploration in the Age of Large Models

Based on the above analysis and thinking in the field of GPT-4V and document recognition, it actually provides a new direction for research in the field of OCR document recognition. Higher recognition accuracy and processing efficiency have also become new and growing application requirements. Based on this, new directions of pixel-level OCR unified model, OCR unified model, document recognition analysis + LLM application emerged at the historic moment.

Insert image description here

■ Pixel-level OCR unified model-UPOCR

Pixel-level OCR Unified Model is an advanced OCR technology designed to achieve high-precision text recognition and image processing. This model combines OCR technology and image processing technology to achieve high-precision text recognition and image processing through pixel-level analysis and processing of images. It can be used for various types of image recognition and processing tasks, such as license plate recognition, face recognition, remote sensing image processing, etc. At the same time, the model can also be customized and optimized according to different application scenarios to meet the needs of different users.

UPOCR (Towards Unified Pixel-Level OCR Interface) is a universal OCR model that unifies the paradigm, architecture and training strategy of different pixel-level OCR tasks. It unifies pixel-level OCR tasks such as text erasure, segmentation, and tamper detection, and introduces learnable task prompts to guide the ViT-based encoder-decoder architecture. UPOCR's general capabilities have been extensively validated on text erasure, text segmentation, and tampered text detection tasks, significantly outperforming existing specialized models.

Insert image description here

■ OCR Unified Model-SPTS v3

The OCR unified model can be understood as a model that integrates multiple OCR algorithms and models to achieve more efficient and accurate text recognition. This model can combine the advantages of different algorithms to improve the recognition accuracy and adaptability of OCR. Usually includes a variety of algorithms and models, such as rule-based methods, template-based methods, machine learning-based methods and deep learning methods, etc. These algorithms and models can exert their respective advantages in different scenarios and tasks, thereby improving the recognition accuracy and efficiency of OCR.

There are many tasks in the current document image recognition and analysis process, including text recognition, paragraph recognition, layout analysis, table recognition, formula recognition, etc. These tasks are defined as sequence predictions, and then are completed through different prompts to guide the model. Different OCR tasks support chapter-level document image recognition and analysis, output standard formats such as Markdown/HTML/Text, and finally leave the work related to document understanding to LLM.

Insert image description here
Based on this idea, SPTS v3, a unified OCR model based on SPTS, emerged as the times require. It defines multiple OCR tasks in the form of sequence prediction, and uses different prompts to guide the model to complete different OCR tasks.

SPTS v3 currently focuses on the following tasks: end-to-end detection and recognition, table structure recognition, and handwritten mathematical formula recognition.

Insert image description here
According to long-term training and analysis, SPTS v3 has achieved very good results in various performance aspects. However, the current number of tasks is not very large, there is still a lot of work to be done, and there is a lot of room for expansion in both functions and task scope.

■ Document recognition analysis + LLM application

For the combination of document recognition analysis field and LLM application, the technical framework proposed by Hehe Information is as follows. After inputting the document image, the document information is obtained through document recognition and layout analysis technology, and then the document is segmented and recalled. Finally, Take LLM Q&A.

Insert image description here
Combining document recognition technology with LLM (Large Language Model) applications is indeed a promising field with many potential applications and thinking directions. for example:

  • Summary and summary of the document. Combining document recognition technology and large language models, it can automatically summarize or summarize long documents to provide users with concise and key information;

  • Automated Q&A. The Q&A system based on document recognition technology answers user questions based on document content;

  • Document classification and topic identification. Document recognition technology is used to classify documents and identify topics, which can be used for automatic document sorting, summarization, information extraction and other tasks.

Not only these, combining large language models into the field of document image recognition will mutually generate more research topics and directions. At the same time, this also requires manufacturers and developers to continuously explore new technologies and methods.

write at the end

Multi-modal large model technology represented by GPT-4V has greatly promoted technological progress in the field of document recognition and analysis, and also brought challenges to traditional IDP technology. However, large models have not completely solved the problems faced in the IDP field, and many issues are still worthy of our continued research.

How to combine the capabilities of large models to better solve the IDP problem deserves more thinking and exploration in the future. The TextIn (Text Intelligence) research team of Hehe Information is a typical representative. As a team focusing on the field of intelligent document processing, after 16 years of concentration and intensive cultivation, the team has achieved remarkable results in intelligent document image recognition, text recognition, natural language processing, etc. They have conducted extensive and in-depth research in the field of intelligent document processing technology, covering document image analysis and preprocessing, document parsing and recognition, layout analysis and restoration, document information extraction and understanding, AI security and knowledgeization, storage retrieval and management and many other key technologies.

These research results have also been integrated into Hehe TextIn intelligent text recognition product. Hehe Information provides their research results to users and enterprises around the world through such an intelligent document processing cloud platform. We can experience it by visiting textin.com to one-stop intelligent text recognition service.

Insert image description here


Questionnaire lottery

Finally, everyone can fill out the questionnaire below to participate in the lottery. According to Hehe Information, 10 people will be selected to receive a 50-yuan JD card (the prize will be drawn on the 12th).

Questionnaire link: https://qywx.wjx.cn/vm/exOhu6f.aspx

Guess you like

Origin blog.csdn.net/weixin_53072519/article/details/135396178