SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition --- Paper reading notes

Paper : https://arxiv.org/abs/2005.10977

Code : https://github.com/Pay20Y/SEED

Aiming at the problems of blurred images, uneven lighting and incomplete characters, a model SEED (Semantic Enhanced Codec Framework) is proposed, which can recognize low-quality scene text.

Insert picture description here

SEED basic process

  1. The image is input to the correction module to correct irregularly shaped text into horizontal text;
  2. Input the corrected features to the encoder (CNN + LSTM), and output hhh
  3. Use two linear functions to convert hhh is processed as semantic informationSSS
  4. Use Semantic Information SSS is the initial state of the decoder, and the output of the encoder ishhh is used as the input of the decoder to predict the result.

SEED Common Framework

For ordinary Encode-Decoder and Attention-based Encoder-Decode, the decoder only relies on local information to decode, and does not use global information. SEED adds the semantic model to learn semantic information as the global information of the picture. SEED mainly consists of four parts:

  1. Encoder: extract visual features, CNN + LSTM;
  2. Semantic model : predict the semantic information, as the global information of the picture
  3. Pre-training language model : generate word embedding, supervise and predict semantic information
  4. Decoder: prediction result, Attention + RNN

It can be used in any Encode-Decode model based on Attention.

Semantic model

Insert picture description here

Input: the output of the decoder hhh

Structure: Two linear functions (one used in the code)

Output: Semantic Information SSS

Use pre-trained language model FastText to generate word embedding to supervise semantic information SSS , use cosine embedding loss (cosine embedding loss)

Pre-trained language model

SEED uses FastText to generate the pre-trained language model of word embedding, and uses the generated word embedding to supervise and predict the semantic information. FastText can also solve the problem of insufficient vocabulary.

A visual guide to FastText word embedding

Semantics Enhanced ASTER ( SE-ASTER)

Taking ASTER as a concrete example of the proposed framework, SE-ASTER

Insert picture description here

Contains four parts: correction module, encoder, semantic model, decoder

First, the image is input to the correction module, and the image is corrected to be horizontal (TPS).

Then, the corrected features are input into the encoder, which contains 45 layers of ResNet and 256 hidden units of BiLSTM, and the encoder output h = (h 1,..., H L) h=(h_1 , ...,h_L)h=(h1,...,hL) , The size is $ L × C$,LLL is the width of the final feature map of CNN,CCC is depth.

Encoder output characteristics hhh has two functions, one is to predict semantic information through the semantic module, and the other is as the input of the decoder.

To predict semantic information, first flatten the feature sequence to KKK -dimensional vectorIII K = L × C K=L×C K=L×C , use two linear functions to predict semantic informationSSS
S = W 2 σ ( W 1 I + b 1 ) + b 2 (1) S = W_2 \sigma (W_1 I + b_1) + b_2 \tag{1} S=W2σ ( W1I+b1)+b2( 1 )
σ \ sigmaσ is the ReLU activation function.

Then use semantic information SSS is the initial state of the decoder, and the output of the encoder ishhh is used as the input of the decoder to predict the result. The decoder uses a single-layer attention GRU, and the attention uses Bahdanau-Attention

Loss Function

Total loss:
L = L rec + λ L sem (2) L = L_{rec} + \lambda L_{sem} \tag{2}L=Lrec+λ Ls e m(2)
L r e c L_{rec} LrecIs the prediction probability and the cross entropy loss of GT, L sem L_{sem}Ls e mIt is the predicted semantic information and the cosine embedding loss of word embedding.
L sem = 1 − cos (S, em) (3) L_{sem} = 1-cos(S, em) \tag{3}Ls e m=1cos(S,and m )(3)
S S S is the predicted semantic information,em eme m is the word embedding generated by the pre-trained language model

Examples of recognition results:

Insert picture description here

to sum up

Since there is a fully connected layer in the semantic model, the feature dimensions after extracting features in the inference phase must be the same as those during training. Therefore, the size of the input picture in the inference phase must be processed as a fixed size of the image during training. When processing long images, forced zooming will cause certain losses and negatively affect the prediction results.

Guess you like

Origin blog.csdn.net/m0_38007695/article/details/107391688