ICDAR competition technology sharing

a background

ICDAR 2021 (International Conference on Document Analysis and Recognition) will be held in Switzerland on September 5-10, 2021. The ICDAR conference is an international top conference in the field of document analysis and recognition. Held every two years, it covers the latest academic achievements and cutting-edge application development trends in the field of document analysis and recognition, attracting the participation of the world's top R&D teams, experts and scholars in this field. The algorithm competition held at the conference is the top event in the field of text recognition (OCR). The Autohome Dealer Technology Department won the second place in two subtasks of the Competition on Time-Quality Document Image Binarization (DIB) image binarization algorithm competition.
insert image description here

Figure 1 Competition results and certificates

Introduction to the second competition

The DIB competition topic of ICDAR2021 is to binarize historical document images and separate the text from the background. The evaluation index adopts the comprehensive weighted value of PSNR, DRDM, F-Measure (FM), pseudoFMeasure (Fps) and Cohen's Kappa. The difficulty of the competition is that the background of historical document images is very complex, and there are various degradation factors, which make it difficult for existing algorithms to achieve good results, such as page stains blocking handwriting, characters fading, resulting in too similar to the background, ink infiltration, As a result, the text on the back is soaked into the front, but the real label needs to divide it into the background, as well as the folded imprint, which is darker in color and may be confused with the text.
insert image description here

Figure 2 Various degradation examples of historical document image datasets

Three technical scheme images

The traditional methods of binarization are mainly divided into global threshold method, local threshold method and some methods combining the two. The global threshold method directly uses a fixed threshold to segment the document image into text foreground and background, such as the classic OTSU algorithm. The local threshold method calculates the dynamic local threshold according to the local neighborhood window in the image to classify the pixel as foreground text or background. The traditional method can achieve better accuracy when the background of the document image is not very complex, but when the background image has multiple degradation conditions (such as page stains, writing penetration on the back, uneven lighting, etc.), the effect is poor.
The method combined with deep learning is more robust and can perform better in complex backgrounds. The method based on deep learning regards the document image binarization as the task of image segmentation, performs binary classification on each pixel through the convolutional neural network, and finally obtains the segmentation map of the entire document image, which is divided into foreground text and background areas, thus Realize binarization [1]. However, for this competition, the resolution of each historical document image is large (often 3000 pixels in the width or height direction), considering the limitation of GPU memory, often in the neural network method, the input is cropped from the entire image The resulting image blocks (eg, 128×128 in size) instead of feeding the entire image into the network structure. However, this cropping strategy loses the global spatial information of the entire document image, especially when there is infiltration of the writing on the back. At this time, it is difficult to distinguish the writing on the back from the real foreground text, and it will be mistaken for the foreground text, resulting in binarization. Accuracy is reduced.
Therefore, we designed a set of document image binarization methods that combine global information and local information, and achieved good results in the competition. The schematic diagram is as follows:
insert image description here

Figure 3 Binarization method combining global information and local information
Our proposed architecture is composed of three U-Net branch modules: two local U-nets with input image sizes of 128 × 128 and 256 × 256, and a global U-Net with an input image size of 512×512. The binarized images obtained by two local U-nets are first fused, and then intersected with the binarized images obtained by the global U-Net to obtain the final binarized image.
Local U-net: Use a sliding window of 128×128 size to crop the original image to obtain local blocks, and use the U-Net convolutional neural network [2] to obtain a classification probability map and then stitch the block images into a complete image. U-Net is an image segmentation model based on deep learning. We use the classic U-Net network structure, which consists of an encoder and a decoder. The encoder consists of 4 repeated modules, and each module includes 2 layers of 3× 3 convolutional layers and 1 layer of 2×2 pooling layers, each convolutional layer is followed by a batch normalization layer (Batch Normalization) and an activation function layer (RELU) of the linear correction unit, along the downsampling path of the encoder , the height and width of the feature map are halved, while the number of channels is doubled. The network structure of the decoder is just the opposite of that of the encoder, with double the height and width of the feature map and half the number of channels. The U-Net structure has a skip-connection (Skip-connection) between the encoder and the decoder to improve the segmentation accuracy. Since the image binarization task is to map the value of each pixel of the input image to 0 or 1, the last layer of the U-Net network structure uses the Softmax activation function to convert each image block into the same size Categorical probability plot of size. Usually, an activation threshold is given, and the classification probability map is directly converted into a binary map of 0 or 1, because the size of each pixel of the probability map is in the interval [0, 1]. For example, the activation threshold is 0.5, In the probability map, values ​​greater than or equal to 0.5 are converted to 1, and values ​​smaller than 0.5 are converted to 0. In order to improve the accuracy, a multi-level scale model fusion method is adopted when extracting local information, that is, the information of two local blocks of 128×128 and 256×256 is fused.
Global U-net: Since the local block size is much smaller than the original complete image, a classification probability map based on local information is obtained. To take into account the global spatial context information and the limitation of model capacity, a more direct method is to reduce the original image (such as the size of 3000×3000) to a fixed lower resolution size by downsampling ( e.g. 512×512 dimensions). However, this method has two disadvantages: one is that different document images have different aspect ratios, and uniform reduction to 512×512 will cause aspect ratio distortion and introduce errors; the other is that when training the model compared to the image block method Reduced the number of trainable samples. Based on this, we use a fixed-size sliding window of 512×512 to crop the downsampled image of the original document image to obtain image blocks, and the image blocks at this time can contain enough background and foreground text, including the global Spatial context information.
Fusion: First, fuse the results of two local U-nets. The classification probability maps obtained with sizes of 128 × 128 and 256 × 256 are obtained through the U-Net image segmentation model with different receptive area sizes. After averaging the two A classification probability map with the same size as the original document image is obtained. Given an activation threshold of 0.5, the classification probability map can be converted into a binarized map. At this time, the binarized map is obtained by an image segmentation model based on the fusion of local information. Then it is intersected with the result of the global U-net to obtain the final binarized image.
insert image description here

Figure 4 Binarization results of the sample
Figure 4 shows an example of the binarization results using this model on the printed document images of the competition dataset. It can be seen that when only local information is considered, that is, when local blocks are used to obtain a binary image, it is easy to mistakenly predict the text in the background area of ​​the historical document image as the foreground text. After combining the global and local information, it can better distinguish the background area and the foreground text area, and it is easy to achieve better results.

Four summary

In this competition, the Autohome Dealer Technology Department proposed an image binarization method that combines overall and local features, constructed a multi-level convolutional neural network to extract image features, and accurately described text outlines through local channels , combined with the overall channel to better separate the complex background and the text foreground, and finally greatly improve the binarization effect of the text image. Image binarization is a crucial preprocessing step in image processing, and the effect of binarization has a great impact on the subsequent OCR (character recognition) accuracy. The results of this research have effectively improved the effect of binarization and provided valuable experience for subsequent business scenarios such as image OCR and automatic image review. The distributor's technical department has rich experience in image OCR and automatic image review. It has identified more than 10 million tickets of various types throughout the year, saving the company's purchase of external OCR recognition services, and better protecting the company's customers and users. Personal information data security. In addition, technical achievements such as telephone robots, IM dialogue robots, and intelligent quality inspection developed by the dealer's technical department using natural language processing technology are widely used in smart products, marketing activities, and related products of Cheshanghui, saving a lot of clue cleaning and event invitations Labor costs such as , lead conversion, etc. are also applied to the sale of commercial products to play a role in increasing the company's revenue.

参考文献
[1] Jorge Calvo-Zaragoza and Antonio-Javier Gallego. A selectional autoencoder approach for document image binarization. Pattern Recognition, 86:37{47, 2019.
[2] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234{241. Springer, 2015.

Guess you like

Origin blog.csdn.net/autohometech/article/details/126510856