[Technology New Trends] Hehe Information: Understanding and R&D Trends of OCR and Seal Recognition Technology in Complex Environment

 Click to receive a 100 yuan trial bonus for AI products to help developers work efficiently and solve document problems:

Exclusive privileges for AI products

Summarize

After research, reproduction and comparative experiments, the Hehe information technology team has the following summary for the above-mentioned seal recognition solutions, as shown in the following table:

Technical solutions

advantage

shortcoming

Seal text detection + text correction (optional) + text recognition

The detection and identification modules can be optimized independently

Applicable to different types of seal identification

slow

high maintenance cost

Seal end-to-end detection and recognition

The model pipeline is simple

Fast and easy to maintain

hard to train

Special seals such as square seals are not applicable

Seal Sequence Prediction Scheme

The model pipeline is simple

Applicable to different types of seal identification

The model is prone to overfitting

explainable difference

Can't give text position

introduction

With the development of social economy, seals, as a symbol and evidence with legal significance of enterprises, institutions, social groups, government departments and even the country, play an important role in modern social life. With the continuous development of modern business activities, enterprises usually involve a lot of contract signing and archiving work in the process of business development. In the past, manual review of contract photos was used to determine whether both parties to the contract signed the official seal, but this The time cost of manual review is high and the cost of manpower is high. Therefore, seal recognition can automatically extract the text of the seal, so as to realize the replacement of manual review and comparison by computer, solve the problem of high time cost and high labor cost of manual review in contract management, and reduce financial and taxation And the business risks in the process of signing business contracts, making business connections more efficient and convenient.

common seal

Common seals in daily work include: official seal, financial seal, legal representative seal, special seal for invoices, and special seal for contracts.

Technical Difficulties

Going back to the introduction theme of this article, this article hopes to explain the differences between stamp recognition and conventional text line recognition by introducing the comparison between the two, so that readers can establish a more concrete cognition.

Compare latitude

Regular Text Line Recognition

seal recognition

sample image

character shape

Rectangular/Quadrangle

any shape

text occlusion

Typically unobstructed, independent lines of text

There is a high probability of varying degrees of occlusion and background interference

overlapping

Typesetting

Various shapes, such as squares, ovals, and circles, are very different

reading order

human natural reading order

from left to right

from right to left

out of order

Affected by stamping direction

There are few researches on seal text recognition in the field of OCR, but the Hehe technical team research and argumentation believes that some technologies of natural scene text recognition can be applied to seal recognition. This article will introduce some technical solutions for seal recognition.

Seal recognition mainstream solutions

The input of the stamp recognition system is a cropped stamp image, and the output is the coordinate frame and recognition results of all texts in the stamp. The process is shown in the following figure:

Some technical solutions for stamp recognition are introduced below.

Seal text detection + text correction (optional) + text recognition

The first seal recognition scheme is the traditional cascade system:

The cropped stamp image first obtains the text coordinate frame through a text detection model that supports curved text detection, and its output can be directly sent to the text recognition model to obtain the final recognition result; a TPS[1] correction module can also be used first, Straighten all the curved text into a horizontal text line image, and then feed it into the text recognition model.

The following briefly introduces the selection of key modules.

Text detection model: In recent years, significant progress has been made in the field of scene text detection based on deep learning. There are many mature curved text detection solutions to choose from, which can be roughly divided into two categories:

  • Regression-based detection models, such as Mask RCNN[2], EAST-like[3] and TextRay[4];

  • Segmentation-based detection models such as PSENet [5], CRAFT [6] and DBNet [7].

Let's take EAST-like (the model has no name, so it is called EAST-like) as an example to give a brief introduction:

The above picture is a schematic diagram and network structure of EAST-like label generation. This model extends the classic EAST[8] model to support polygon detection by increasing the number of regression points; multi-scale prediction is used to improve the scale change of the target. problem; in addition, improvements have been made in Loss.

Text recognition model: Divided from the decoding method, the current mainstream text recognition models in academia are divided into two categories: CTC and Attention.

  • The representative model based on CTC is CRNN[9], which has fast speed and stable performance, but it can only process horizontal text line pictures (the curved text needs to be straightened through the TPS module);

  • There are two ideas for the recognition model based on Attention. The first one is to correct it through STN[10] first, and then send it to the recognition model of 1D Attention for end-to-end training. The representative model is ASTER[11]; the other is to discard The STN module is directly based on 2D Attention decoding to complete (curved) text recognition, such as SAR[12], MASTER[13] and SATRN[14]. The recently popular multimodal text recognition models such as ABINet[15] and VisionLAN[16] in academia can actually be regarded as an extension of the 2D Attention model.

The following is the network structure of the classic CRNN model:

The CRNN model consists of three parts: convolutional layer, recurrent layer and transcription layer. The convolution layer is responsible for extracting image features; the loop layer performs sequence modeling through BLSTM to further improve feature representation; the last is the linear classification layer, and the final prediction result is obtained through CTC decoding.

To sum up, the Hehe Information Technology team believes that multi-model cascade is a relatively mature seal recognition solution. Its advantage is that there are abundant models to choose from in each link, and at the same time, the detection and recognition modules are decoupled so that their respective trainings do not affect each other. Its disadvantage is mainly the error accumulation problem of the cascaded system.

Seal end-to-end detection and recognition (End2End)

In the cascaded system in the previous section, the detection and recognition models are trained separately. On the one hand, it will cause the sub-optimization effect of the entire recognition system, and on the other hand, it will miss the opportunity to detect and recognize the image features of the head sharing the backbone network.

In recent years, the academic community has been working on proposing an end-to-end text detection and recognition system. By sharing the backbone network with the detection and recognition heads, the complexity of the system can be effectively reduced, and the overall performance of the model can be further improved by using multi-task learning.

The stamp end-to-end detection and recognition model process is as follows:

In the past two years, many end-to-end models supporting curved text recognition have been proposed. The following will briefly introduce ABCNet[17] and Mask TextSpotterv3[18] as examples.

The schematic diagram of ABCNet network structure and Bezier curve fitting is as follows:

 The main highlight of ABCNet is the introduction of Bezier curves for modeling curved text bounding boxes. As shown in the figure above, various curves can be flexibly fitted through third-order Bezier curves (four control points).

In terms of network structure, ABCNet consists of a detection module + RoI_Transform + recognition module, similar to the early end-to-end model FOTS[19], except that the regression target of the detection module becomes the control point of the Bezier curve, and the RoI_Transform part is replaced by Bezier Align.

The network structure of Mask TextSpotterv3 is shown in the figure below:

This model is also composed of detection module + RoI_Transform + recognition module, but the detection module is replaced by a segmentation model. Compared with the regression-based detection model, the segmentation model can flexibly model text of any shape and is not sensitive to text length; The RoI_Transform part directly uses RoIAlign, uses the horizontal bounding box to crop each RoI text block, and performs zero masking on the non-foreground area to prevent background interference; supporting RoI_Transform, the recognition module uses an attention-based recognizer.

To sum up, the Hehe Information Technology team believes that the end-to-end detection and recognition algorithm has the following advantages:

  1. End-to-end training of the detection and recognition module improves the error accumulation problem of the cascaded system and achieves better performance

  2. By sharing the backbone network, the end-to-end detection and recognition model is faster

  3. Maintain a unified framework with the same dependencies, saving a lot of engineering effort

At the same time, the stamp end-to-end detection and recognition algorithm also has the following shortcomings:

  1. The magnitude of training samples required for detection and recognition tasks is different (recognition tasks require more training data), so the requirements for training data sets are higher.

Seal sequence prediction scheme (Image2Sequence)

Regardless of whether it is a cascade model or an end-to-end detection and recognition model, the problem of multi-directional text will be encountered when recognizing square chapters, as shown in the following figure:

In the picture of the real square chapter, the reading direction of the text may be from left to right or from right to left; it may be horizontal text or vertical text. The text detection model only relies on visual information, and it is easy to detect horizontal text into vertical text. At the same time, different text lines also need to consider how to splicing into a complete string in the correct semantic order.

The stamp sequence prediction scheme directly discards the detection module: input the cropped stamp image, and the model outputs the final string sequence, which contains all the lines of interest in the stamp:

In theory, a sequence prediction scheme can handle all types of stamps simultaneously. If you want to distinguish different strings, you can insert a special symbol (such as '#') between different strings, as shown in the following figure:

It is only necessary to artificially stipulate the fixed reading order between different strings in the seal, construct the corresponding ground truth string labels, and leave the rest to the model to learn by itself. Aiming at the problem that the reading order of character strings in square chapters is not fixed, this solution can greatly simplify the processing flow.

There are no special requirements for the selection of the img2seq model. Considering the 2D spatial layout of the characters in the seal, a 2D Attention-based text recognition model can be used to complete the img2seq task. In order for the img2seq model to better learn global context features, it is recommended to use a recognition model based on transformer encoder/decoder, such as MASTER[12] and SATRN[13]. With its powerful self-attention mechanism, the img2seq model can obtain global receptive field, and better relational modeling at the decoding stage.

The following takes MASTER as an example to briefly introduce the img2seq model, and its network structure is shown in the figure below:

The model is a typical encoder-decoder architecture, in which the encoder part is a customized CNN network with a global receptive field, which outputs an 8-fold feature map, and the decoder part uses a standard transformer decoder. At the same time, the paper also uses memory cache technology to accelerate the decoding part.

To sum up, the Hehe Information Technology team believes that the img2seq model has the following advantages and disadvantages:

The advantage is that a single model can be used to solve the recognition problem of different types of stamps. Deployment and maintenance are fairly simple.

Its disadvantage is that the model is easy to overfit and requires a large amount of training data, which can be alleviated by the data synthesis scheme. In addition, the img2seq model cannot provide the location information of each text line.

Interested friends can download and experience:

Support Android and iOS platforms, Android supports at least 4.4, iOS supports at least 9

Android download channels: Xiaomi and Huawei app stores;

iOS download channel: AppStore application store.

references

  1. Bookstein, Fred L. "Principal warps: Thin-plate splines and the decomposition of deformations." IEEE Transactions on pattern analysis and machine intelligence 11.6 (1989): 567-585.

  2. He, Kaiming, et al. "Mask r-cnn." Proceedings of the IEEE international conference on computer vision. 2017.

  3. Li, XiaoQian, et al. "Learning to predict more accurate text instances for scene text detection." Neurocomputing 449 (2021): 455-463.

  4. Wang, Fangfang, et al. "Textray: Contour-based geometric modeling for arbitrary-shaped scene text detection." Proceedings of the 28th ACM International Conference on Multimedia. 2020.

  5. Wang, Wenhai, et al. "Shape robust text detection with progressive scale expansion network." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

  6. Baek, Youngmin, et al. "Character region awareness for text detection." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

  7. Liao, Minghui, et al. "Real-time scene text detection with differentiable binarization." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 07. 2020.

  8. Zhou, Xinyu, et al. "East: an efficient and accurate scene text detector." Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017.

  9. Shi, Baoguang, Xiang Bai, and Cong Yao. "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition." IEEE transactions on pattern analysis and machine intelligence 39.11 (2016): 2298-2304.

  10. Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. "Spatial transformer networks." Advances in neural information processing systems 28 (2015).

  11. Shi, Baoguang, et al. "Aster: An attentional scene text recognizer with flexible rectification." IEEE transactions on pattern analysis and machine intelligence 41.9 (2018): 2035-2048.

  12. Li, Hui, et al. "Show, attend and read: A simple and strong baseline for irregular text recognition." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. No. 01. 2019.

  13. Lu, Ning, et al. "Master: Multi-aspect non-local network for scene text recognition." Pattern Recognition 117 (2021): 107980.

  14. Lee, Junyeop, et al. "On recognizing texts of arbitrary shapes with 2D self-attention." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020.

  15. Fang, Shancheng, et al. "Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

  16. Wang, Yuxin, et al. "From two to one: A new scene text recognizer with visual language modeling network." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

  17. Liu, Yuliang, et al. "Abcnet: Real-time scene text spotting with adaptive bezier-curve network." proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

  18. Liao, Minghui, et al. "Mask textspotter v3: Segmentation proposal network for robust scene text spotting." European Conference on Computer Vision. Springer, Cham, 2020.

  19. Liu, Xuebo, et al. "Fots: Fast oriented text spotting with a unified network." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

Guess you like

Origin blog.csdn.net/INTSIG/article/details/125203307