CVPR2021 | ABINet+: Human-like reading: Autonomous, bidirectional and iterative language modeling for scene text recognition

论文标题:ABINet+:Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

代码:https://github.com/FangShancheng/ABINet

Connection:https://arxiv.org/abs/2103.06495

1. Abstract

However, how to effectively model linguistic rules in end-to-end deep networks remains a research challenge.

Reason:

implicitly language modeling; implicitly language modeling

unidirectional feature representation; unidirectional feature representation

language model with noise input; language model with noise input

Therefore the author proposes:

block gradient flow between vision and language models to enforce explicit language modeling. Block the gradient flow between the visual model and the language model to achieve explicit language modeling

bidirectional cloze network (BCN) as the language model. Bidirectional Network (BCN) as a language model

execution manner of iterative correction. Iterative correction reduces noise

2. Introductions

Autonomous: Apply the autonomous principle to scene text recognition (STR) and decouple the recognition model into a visual model (VM) and a language model (LM). Sub-model as an independent functional unit. Whether and how LMs learn character relationships is agnostic. Furthermore, this approach is infeasible to directly pretrain LM to obtain rich prior knowledge from large-scale unlabeled text.

Bidirectional: The bidirectional LM model can capture twice the amount of information. A straightforward way to build a bidirectional model is to merge the left-to-right model and the right-to-left model. However, since its linguistic features are represented in one direction, it is less powerful and costs twice as much in terms of calculations and parameters. BERT introduces deep bidirectional representation through mask textual token modeling. Directly applying BERT to STR requires masking all characters in the text, which is very expensive because only one character can be masked at a time.

Iterative:LMs that employ iterative execution can optimize predictions from visual and verbal cues, which are not explored by existing methods. In order to adapt to the Transformer architecture, autoregression is abandoned and parallel prediction is used to improve efficiency. However, there are still noisy inputs in parallel prediction, and the error in the virtual machine output directly affects the accuracy of LM.

First, a decoupling method is explored between VM and LM through blocking gradient flow (BGF) (Fig. 1b), which forces the LM to learn language rules explicitly. Furthermore, VM and LM are both autonomous units and can be pre-trained from images and text respectively. Secondly, a new bidirectional gestalt network (BCN) is designed as the LM, which eliminates the dilemma of combining two unidirectional models (Fig. 1c). BCN simultaneously conditions the left and right contexts and controls access to characters on both sides by specifying attention masks. In addition, to prevent information leakage, cross-step access is not allowed. Third, we propose a way to perform LM iterative correction (Fig. 1b). By repeatedly feeding the output of ABINet into the LM, predictions can be gradually improved and the length misalignment problem can be alleviated to a certain extent.

三、Proposed Method

1、 Vision Model

ResNet + position attention module

ResNet : 5 residual blocks in total, downsampled after the 1st and 3rd blocks.

position attention moduleparallelly transcribes visual features into character probabilities.

Q is the position encoding of the character sequence, T is the length of the character sequence, K is the UNet network, and V is the self-mapping.

2、Language Model

Autonomous Strategy

1) Treat LM as an independent spelling correction model, taking the probability vector of characters as the input and output probability distribution of the desired characters. 2) The training gradient flow is blocked at the input vector (BGF). 3) LM can be trained separately from unlabeled text data.

Bidirectional Representation

Given a text string y = (y1,…,yn), when the text length is n and the class number is c, the conditional probability of yi in the two-way model and one-way model is:

The available entropy of the two-way representation can be quantified as Hy = (n−1)log c, and the available entropy of the one-way representation is 1/2 Hy

Use BERT MLM and replace yi with [MASK]. However, each text instance needs to call MLM n times separately, resulting in extremely low efficiency. Propose BCN by specifying the attention mask, instead of masking the input character:

BCN is a variant of the L-layer transformer decoder. Character vectors are fed into the multi-head attention block instead of the first layer of the network. In addition, the attention mask in multi-head attention is designed to prevent "seeing yourself".

BCNDoes not use self-attention to avoid information leakage across time steps. The attention operation inside the long block can be formalized as:

Q is the positional encoding of the character sequence of the first layer, otherwise it is the output of the last layer. P is the character probability, Wl is the linear mapping matrix. M is the matrix of attention masks, which prevents attention to current features. Stack the BCN layers into a deep architecture to determine the bidirectional representation Fl of the text y.

BCN is able to elegantly learn bidirectional representations that are more powerful than a collection of unidirectional representations. In addition, thanks to the transformer-like architecture, BCN can be calculated independently and in parallel. Furthermore, it is more efficient than ensemble models as only half the computations and parameters are required.

Iterative Correction

In order to solve the problem of noise input, an iterative LM is proposed (as shown in Figure 2). LM is executed repeatedly M times, assigning different values ​​to y. In the first iteration, yi=1 is the probability prediction of VM. For subsequent iterations, yi≥2 is the probability prediction of the fusion model in the last iteration. In this way, LM is able to iteratively revise visual predictions.

The length misalignment issue is caused by the unavoidable implementation of padding masks, which are used to filter context beyond text length. Our iterative LM can alleviate this problem by fusing visual features and linguistic features multiple times, so the predicted text length is gradually refined.

3、Fusion

Vision models trained based on images and language models trained based on text come from different modalities. To align visual features and language features, we simply use a gating mechanism for the final decision:

4、Supervised Training

Multi-task objectives for end-to-end training:

5、Semi-supervised Ensemble Self-training

The basic idea of ​​self-training is to first generate pseudo-labels through the model itself, and then retrain the model using additional pseudo-labels. Therefore, the key issue lies in constructing high-quality pseudo-labels.

1)选择文本实例中字符的最小置信度作为文本确定性。2)每个字符的迭代预测被视为一个集合,以平滑噪声标签的影响。filtering function:

C是文本实例的最小确定性,Pm(yt)是第t个字符在第m次迭代时的概率分布。

四、Experiment

Guess you like

Origin blog.csdn.net/justBeHerHero/article/details/129189589