[ECCV 2022] OCR-free Document Understanding Transformer (reproduced from CSIG Document Image Analysis and Recognition Committee)

Summary

This article briefly introduces the ECCV 2022 accepted paper "OCR-free Document Understanding Transformer". In the past, most of the document understanding algorithms depended on the existing OCR results, but the OCR engine’s high overhead, poor generalization performance, and error accumulation often affected the performance of the document understanding module. In response to these problems, this paper proposes a large-scale pre-trained document understanding model Donut that does not rely on OCR. This model has a good performance on common data sets and has a faster inference speed. This paper also provides a multi-lingual, multi-format document data synthesizer to assist the model pre-training process. The code open source address is https://github.com/clovaai/donut.

Research Background

Visual Document Understanding (VDU) technology aims at summarizing, organizing, and extracting useful information from document images. This technology has a very wide range of applications in daily life, but it is also a challenging subject. Its specific tasks include document classification, information extraction, and visual question answering. Most of the existing VDU models [1][2][3][4][5] generally use a two-stage scheme to solve this problem: 1) read the text from the document image; understanding. They typically rely on optical character recognition (OCR) engines for the first step of text reading, and themselves focus on modeling the second step of text understanding. However, there are some problems with these methods relying on OCR: one is that OCR will bring additional overhead. Although we can use off-the-shelf OCR engines, the additional time required for reasoning is not negligible; in addition, the existing OCR engines lack the flexibility to handle different languages ​​or formats, and have poor generalization capabilities; moreover, training a performance Excellent OCR models also need to consume a lot of resources. The second problem is that the accumulation of OCR errors will affect the subsequent process. For some languages ​​with complex character sets, such as Korean or Chinese, the effect of OCR is often poor, and the impact will be more serious accordingly. Although some methods [6] [7][8] set up a post-processing process for OCR error correction, but these solutions will increase the overhead of the entire system in the application, and the practical significance is not great.

Brief description of method principle

The Donut model proposed in this paper gets rid of the dependence on the OCR results, and directly generates the result strings in an end-to-end manner, avoiding the problems mentioned in the previous section. Its structure is shown in Figure 1. The model structure is very simple, the input is a document image, the feature sequence is obtained through the encoder module, and then the result string is generated through a Transformer-based decoder.
Donut flow chart
The encoder encodes a picture-sized document image into a series of latent feature vector pictures, where n is the output feature map size and d is the latent feature dimension. This module can use a convolutional neural network or a Transformer-based vision model. The author finally adopted Swin Transformer[9] as the backbone network through experimental comparison.

The decoder decodes the input feature vector z into a result sequence picture, where picture, v is the dictionary size of the result string, m is the maximum length of the sequence, and is a hyperparameter. The author uses BART as the decoder, and initializes the weight of this part with the public model 1 pre-trained on multilingual data.

In the pre-training stage, the author designed a text reading task, that is, given the image of the document, the model outputs the text content in it. The tags used for supervision come from the OCR engine of the author's team. There will inevitably be some errors in this label, so the author also calls this pre-training task pseudo-OCR. Two kinds of data are used for pre-training, one is the real document dataset IIT-CDIP, with a total of 11 million documents; the other is the multilingual data synthesized by the author, including Chinese, Japanese, Korean, and English, with a total of 2 million documents document.

For the above synthetic data, the author designed a data synthesis paradigm, SynthDoG, which divides documents into four components: background, document texture, text, and layout. The background part is sampled from ImageNet [10], the document texture is from the document image collected by the author, and the text is collected from Wikipedia. For the layout design, the author designed a series of rules to divide the document into multiple regions to simulate the transformation of the layout. Part of the synthetic data is shown in Figure 2.
Some samples synthesized by SynthDoG
The fine-tuning process is shown in the red, blue, and green text boxes in Figure 1. The initial input of the decoder is a prompt template indicating the type of task, and the output of the model is a hierarchical description language like HTML. For example, for the task of document image classification, the model inputs an initial label indicating the type of task as document image classification, and the output of the model indicates that this part is a category name entity, and the content is Receipt, indicating the end of this task. In this way, the system can further parse it into JSON-formatted text, which is convenient for subsequent landing applications. It is worth noting that this format can also be used to handle multi-level information extraction. There is less research on this scenario in academia, but it is a very common and urgent problem in industry.

Main experimental results and visualization results

As shown in Table 1, the model has been tested on document image classification on the RVL-CDIP [11] dataset. The results show that Donut has excellent accuracy, fast inference speed, and relatively fewer parameters than commonly used models. The ones in the table represent the additional overhead brought by the OCR engine.
Table 1 Experimental results of Donut's document image classification on the RVL-CDIP dataset
Table 2 lists the performance metrics of the models on CORD, EATEN and internal datasets. Donut has advantages in accuracy, speed, and model size. It is worth noting that there is a gap between the indicators of the LayoutLM series model and the indicators in the official paper. The author said in the Issue of his warehouse that 2. The indicators of the LayoutLM and other models in this article are calculated using the results of the OCR engine. It is closer to reality, and the index in the original text uses the GT label of the dataset, so there will be a gap.
Table 2 Donut's performance on some visual information extraction tasks
Table 3 lists the metrics of the model on the DocVQA dataset. Donut's performance on the original test set is not optimal, but it performs well on handwritten documents, showing the model's excellent generalization ability. The author believes that the resolution of the images in the DocVQA dataset is low, and some small-scale texts cannot be understood by the model.
Table 3 Document Visual Question Answering Performance of Donut on DocVQA Dataset
Figure 3 Visualization results of the attention mechanism
Figure 3 shows the visualization results of the model's attention mechanism. It can be seen that Donut has learned the connection between text and images very well.

Summary and Discussion

The model Donut proposed in this paper got rid of the dependence of most previous algorithms on OCR, and achieved good performance in visual document understanding tasks. At the same time, its model size and inference speed have certain advantages compared with past methods.

The disadvantage of the model is the limited ability to understand small-scale texts, which needs further research in future work.

references

[1] Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. Proceedings of the AAAI Conference on Artificial Intelligence 36(10), 10767–10775 (Jun 2022).

[2] Hwang, W., Kim, S., Yim, J., Seo, M., Park, S., Park, S., Lee, J., Lee, B., Lee, H.: Post-ocr parsing: building simple and robust parser via bio tagging. In: Workshop on Document Intelligence at NeurIPS 2019 (2019).

[3] Hwang, W., Yim, J., Park, S., Yang, S., Seo, M.: Spatial dependency parsing for semi-structured document information extraction. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. pp. 330–343. Association for Computational Linguistics, Online (Aug 2021).

[4] Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. p. 1192–1200. KDD ’20, Association for Computing Machinery, New York, NY, USA (2020).

[5] Xu, Y., Xu, Y., Lv, T., Cui, L., Wei, F., Wang, G., Lu, Y., Florencio, D., Zhang, C., Che, W., Zhang, M., Zhou, L.: LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 2579–2591. Association for Computational Linguistics, Online (Aug 2021).

[6] Duong, Q., H¨am¨al¨ainen, M., Hengchen, S.: An unsupervised method for OCR post-correction and spelling normalisation for Finnish. In: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa). pp. 240–248. Link¨oping University Electronic Press, Sweden, Reykjavik, Iceland (Online) (May 31–2 Jun 2021).

[7] Rijhwani, S., Anastasopoulos, A., Neubig, G.: OCR Post Correction for Endangered Language Texts. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 5931–5942. Association for Computational Linguistics, Online (Nov 2020).

[8] Schaefer, R., Neudecker, C.: A two-step approach for automatic OCR postcorrection. In: Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. pp. 52–57. International Committee on Computational Linguistics, Online (Dec 2020).

[9] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 10012– 10022 (October 2021).

[10] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A largescale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009).

[11] Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR). pp. 991–995 (2015).

[12]Park, S., Shin, S., Lee, B., Lee, J., Surh, J., Seo, M., Lee, H.: Cord: A consolidated receipt dataset for post-ocr parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019).

[13]Guo, H., Qin, X., Liu, J., Han, J., Liu, J., Ding, E.: Eaten: Entity-aware attention for single shot visual text extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 254–259 (2019).

[14]Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 2200–2209 (2021).

Original Authors: Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han and Seunghyun Park

Guess you like

Origin blog.csdn.net/weixin_42280271/article/details/128214184