Classical model of big talk text recognition: CRNN

In the previous article (see this blog post for details: CTPN, a classic model for big-word text detection ), I introduced the wide application of text recognition in real life, as well as the simple process of text recognition:

Among them, "text detection" and "text recognition" are two of the key links. "Text detection" has been introduced in detail in the previous article. This article mainly introduces the classic model of "text recognition" CRNN and its principle.

 

Before introducing CRNN, let's sort out what elements are needed to realize the model of "text recognition":

(1) The first step is to read the input image and extract the image features. Therefore, a convolutional layer is required to read the image and extract the features. The specific principle can be found in the article of this official account: Vernacular Convolutional Neural Network (CNN);

(2) Since the text sequence is of indeterminate length, RNN (recurrent neural network) needs to be introduced into the model, and bidirectional LSTM is generally used to deal with the problem of indeterminate length sequence prediction. The specific principle can be found in the article of this official account: Vernacular Recurrent Neural Network (RNN);

(3) In order to improve the applicability of the model, it is best not to require segmentation of input characters, and end-to-end training can be performed directly, which can reduce a lot of segmentation and labeling work. At this time, the CTC model (Connectionist temporal classification, connection) should be introduced. temporal classification) to solve the problem of segmentation and alignment of samples.

(4) Finally, correct the output results of the model according to certain rules, and output the correct results.

The above are several essential elements of the "text recognition" model.

The CRNN model to be introduced next is also basically composed of these parts.

 

1. What is CRNN

CRNN (Convolutional Recurrent Neural Network, Convolutional Recurrent Neural Network) is a recognition text proposed by Huazhong University of Science and Technology in the paper "An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition" This model is mainly used to solve image-based sequence recognition problems, especially scene text recognition problems.

The main features of CRNN are:

(1) End-to-end training is possible;

(2) There is no need for character segmentation of sample data, and text sequences of any length can be recognized

(3) The model is fast, has good performance, and the model is small (less parameters)

 

2. CRNN model structure

The structure of the CRNN model is as follows:

Just like the essential elements of the "text recognition" model reviewed earlier, the CRNN model is mainly composed of the following three parts:

(1) Convolutional layer: extract the feature sequence from the input image;

(2) Recurrent layer: predicts the label distribution of the feature sequence obtained from the convolutional layer;

(3) Transcription layer: Convert the label distribution obtained from the recurrent layer into the final recognition result through operations such as deduplication and integration.

 

The three layers are described below:

(1) Convolutional layer

① Preprocessing

CRNN scales the input image first, and scales all input images to the same height, the default is 32, and the width can be arbitrarily long.

② Convolution operation

It consists of the convolutional layer and the maximum pooling layer in the standard CNN model, and the structure is similar to VGG, as shown in the following figure:

As can be seen from the above figure, the convolutional layer is composed of a series of operations such as convolution, max pooling, and batch normalization.

③ Extract sequence features

The vectors in the extracted feature sequence are generated in order from left to right on the feature map, and are used as the input of the recurrent layer. Each feature vector represents a feature of a certain width on the image. The default width is 1, and is a single pixel. Since CRNN has scaled the input image to the same height, it only needs to extract features according to a certain width. As shown below:

(2) Circulation layer

The recurrent layer consists of a bidirectional LSTM recurrent neural network that predicts the label distribution of each feature vector in the feature sequence.

Since LSTM needs a time dimension, in this model, the width of the sequence is taken as the time steps of LSTM.

Among them, the "Map-to-Sequence" custom network layer is mainly used for error feedback of the loop layer, conversion with the feature sequence, as a bridge between the convolution layer and the loop layer, so as to feedback the error from the loop layer to the convolution layer. Floor.

 

(3) Transcription layer

The transcription layer integrates the results of the feature sequences predicted by the LSTM network and converts them into the final output.

In the CRNN model, the last of the bidirectional LSTM network layer is connected to a CTC model, thus achieving end-to-end identification. The so-called CTC model (Connectionist Temporal Classification) is mainly used to solve the alignment problem between input data and a given label. It can be used to perform end-to-end training and output sequence results of indeterminate lengths.

Due to the input text image of the natural scene, due to the character spacing, image deformation and other problems, the same text has different expressions, but in fact it is the same word, as shown in the following figure:

The introduction of CTC is mainly to solve this problem. After training through the CTC model, the interval characters and repeated characters are removed from the result (if the same character appears consecutively, it means that there is only 1 character, and if there is an interval character in the middle, it means the character appears multiple times), as shown in the following figure:

 

The above is the introduction of the text recognition model CRNN, which can be used to recognize English, numbers, and Chinese. It is generally used in combination with CTPN, using CTPN for text detection, and using CRNN for text recognition.

I use CTPN+CRNN to recognize the effect of Chinese as follows (with hidden private information):

 

Wall Crack Advice

In 2015 , Baoguang Shi et al. published a classic paper on CRNN " An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition  ", in which the idea and technology of CRNN were introduced in detail. principle, it is recommended to read the paper to learn more about the model.

Follow my official account "Big Data and Artificial Intelligence Lab" (BigdataAILab), and then reply to the keyword " thesis " to read the content of classic papers online .

 

Recommended related reading

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324142585&siteId=291194637