Paper : https://arxiv.org/abs/2005.10977
Aiming at the problems of blurred images, uneven lighting and incomplete characters, a model SEED (Semantic Enhanced Codec Framework) is proposed, which can recognize low-quality scene text.
SEED basic process
- The image is input to the correction module to correct irregularly shaped text into horizontal text;
- Input the corrected features to the encoder (CNN + LSTM), and output hhh
- Use two linear functions to convert hhh is processed as semantic informationSSS
- Use Semantic Information SSS is the initial state of the decoder, and the output of the encoder ishhh is used as the input of the decoder to predict the result.
SEED Common Framework
For ordinary Encode-Decoder and Attention-based Encoder-Decode, the decoder only relies on local information to decode, and does not use global information. SEED adds the semantic model to learn semantic information as the global information of the picture. SEED mainly consists of four parts:
- Encoder: extract visual features, CNN + LSTM;
- Semantic model : predict the semantic information, as the global information of the picture
- Pre-training language model : generate word embedding, supervise and predict semantic information
- Decoder: prediction result, Attention + RNN
It can be used in any Encode-Decode model based on Attention.
Semantic model
Input: the output of the decoder hhh
Structure: Two linear functions (one used in the code)
Output: Semantic Information SSS
Use pre-trained language model FastText to generate word embedding to supervise semantic information SSS , use cosine embedding loss (cosine embedding loss)
Pre-trained language model
SEED uses FastText to generate the pre-trained language model of word embedding, and uses the generated word embedding to supervise and predict the semantic information. FastText can also solve the problem of insufficient vocabulary.
A visual guide to FastText word embedding
Semantics Enhanced ASTER ( SE-ASTER)
Taking ASTER as a concrete example of the proposed framework, SE-ASTER
Contains four parts: correction module, encoder, semantic model, decoder
First, the image is input to the correction module, and the image is corrected to be horizontal (TPS).
Then, the corrected features are input into the encoder, which contains 45 layers of ResNet and 256 hidden units of BiLSTM, and the encoder output h = (h 1,..., H L) h=(h_1 , ...,h_L)h=(h1,...,hL) , The size is $ L × C$,LLL is the width of the final feature map of CNN,CCC is depth.
Encoder output characteristics hhh has two functions, one is to predict semantic information through the semantic module, and the other is as the input of the decoder.
To predict semantic information, first flatten the feature sequence to KKK -dimensional vectorIII , K = L × C K=L×C K=L×C , use two linear functions to predict semantic informationSSS:
S = W 2 σ ( W 1 I + b 1 ) + b 2 (1) S = W_2 \sigma (W_1 I + b_1) + b_2 \tag{1} S=W2σ ( W1I+b1)+b2( 1 )
σ \ sigmaσ is the ReLU activation function.
Then use semantic information SSS is the initial state of the decoder, and the output of the encoder ishhh is used as the input of the decoder to predict the result. The decoder uses a single-layer attention GRU, and the attention uses Bahdanau-Attention
Loss Function
Total loss:
L = L rec + λ L sem (2) L = L_{rec} + \lambda L_{sem} \tag{2}L=Lrec+λ Ls e m(2)
L r e c L_{rec} LrecIs the prediction probability and the cross entropy loss of GT, L sem L_{sem}Ls e mIt is the predicted semantic information and the cosine embedding loss of word embedding.
L sem = 1 − cos (S, em) (3) L_{sem} = 1-cos(S, em) \tag{3}Ls e m=1−cos(S,and m )(3)
S S S is the predicted semantic information,em eme m is the word embedding generated by the pre-trained language model
Examples of recognition results:
to sum up
Since there is a fully connected layer in the semantic model, the feature dimensions after extracting features in the inference phase must be the same as those during training. Therefore, the size of the input picture in the inference phase must be processed as a fixed size of the image during training. When processing long images, forced zooming will cause certain losses and negatively affect the prediction results.