CRNN introduction of the principle of OCR

4.CRNN principle introduction

This article is based on the paper "An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition" translated summary from. CRNN can identify different sizes and different lengths of text images. The paper also identifies the music, in theory, this model can also be effective identification Chinese, does not distinguish between languages.

4.1.1.CRNN Summary

Convolutional Recurrent Neural Network (CRNN), as the name suggests, it is a combination of CNN and RNN. Finally, we added a CTC.

4.1.2.CRNN model structure

As shown, comprises three layers, from bottom to top are the convolutional layer, RNN layer, a translation layer. Convolution-picture feature extraction layer. RNN layer uses LSTM. Convolution in the middle layer and layer RNN created a Map-to-Sequence layer. Translation layer includes two types of dictionary-based, a non-dictionary-based. Wherein RNN translation layer converts the result into a label. Model structure shown below.

Here Insert Picture Description

Here Insert Picture Description
Layer 3 and 4 in windows 1 * 2 instead of square windows, the fine adjustment feature increases the length, thus producing a longer signature sequence.
Use the batch normalization techniques.
The right to use all layers CRNN shared connection weights, while not fully connected layers, so few parameters, take up less memory.

4.1.3 feature extraction CNN

1. All the connecting layer is removed.
2. All images must be the same height input, the model is 100 * 32, to improve training efficiency.
3. The image reading characteristics in columns 1 pixel width. Below, is a list of features splicing.
Here Insert Picture Description

4.1.4.Transcription layer, CTC

Transcription layer corresponding output label lstm layer, the technology uses CTC.
CTC, Connectionist Temporal Classification, to solve the problem of the input sequence and output sequence difficult one to one.
The pair of input and output (X, Y) is, CTCs goal is to maximize the probability formula
Here Insert Picture Description
to explain, for RNN + CTC model is, RNN is output probability Pt, t represents the time which RNN concept. Multiplication means that all the characters multiplied by the probability of a path, the addition represents multiple paths. Because of the above said CTC aligned input and output is many, e.g. he-l-lo- and hee-l-lo are corresponding to "hello", which is output from the two paths which, to all of the path relative plus the conditional probability is output.
dictionary-based model, in fact, is the basis of the above, CTC, in obtaining results, and from the dictionary again, to further improve accuracy, but no dictionary can only take high probability As a result, fewer check this step from the dictionary.

4.1.5. Model Training

Model input (I, I), I represents the input of the picture, I represent the actual text results. Training result is to minimize the following function.
Here Insert Picture Description
Where y is the output result cnn and rnn, the above function without any manual handling, is equivalent to calculating the direct inputs and outputs, so the model is end (end-to-end).
Using a stochastic gradient descent (SGD) for training.
Use ADADELTA to automatically adjust the learning rate.

4.1.6. Recognition score

Because fewer training samples, the model was trimmed. Deleted convolutional layer 4 and layer 6, the bidirectional LSTM 2 layer becomes a layer of unidirectional LSTM 2.
The model also identified on the scores achieved excellent results.

Published 21 original articles · won praise 18 · views 1453

Guess you like

Origin blog.csdn.net/zephyr_wang/article/details/104445744