Paper reading and analysis: Watch, attend and parse An end-to-end neural network based approach to HMER

HMER Paper Series
1. Paper Reading and Analysis: When Counting Meets HMER Counting-Aware Network for HMER_KPer_Yang's Blog-CSDN Blog
2. Paper Reading and Analysis: Syntax-Aware Network for Handwritten Mathematical Expression Recognition_KPer_Yang's Blog-CSDN Blog
3. Paper Reading And analysis: A Tree-Structured Decoder for Image-to-Markup Generation_KPer_Yang's blog-CSDN blog
4, paper reading and analysis: Watch, attend and parse An end-to-end neural network based approach to HMER_KPer_Yang's blog-CSDN blog
5 , Paper reading and analysis: Multi-Scale Attention with Dense Encoder for Handwritten Mathematical Expression Recognition_KPer_Yang's blog - CSDN blog
6, Paper reading and analysis: Mathematical formula recognition using graph grammar_KPer_Yang's blog - CSDN blog
7,Paper reading and analysis: Hybrid Mathematical Symbol Recognition using Support Vector Machines_KPer_Yang's blog-CSDN blog
8. Paper reading and analysis: HMM-BASED HANDWRITTEN SYMBOL RECOGNITION USING ON-LINE AND OFF-LINE FEATURES_KPer_Yang's blog-CSDN blog

Paper reading and analysis: Watch, attend and parse An end-to-end neural network based approach to handwritten mathematical expression recognition

Note: In this post you can see how to post your own if there is a parallel work already published on arxiv. Parallel work is illustrated in related work, and differences are illustrated in the text. And in the final part of the experiment, I applied my own method to parallel work, which also improved parallel work, thus illustrating my own innovation.

The main work:

1. Use the neural network Watch, Attend and Parse (WAP) to avoid the problem of symbol segmentation and calculation using ME grammar;

2. Use a fully convolutional network (FCN) in the watcher: it can efficiently process large-scale images and variable input sizes. Use a coverage-based attention model to solve a lack of coverage problem in training WAP;

3. Visual attention, you can see how WAP completes symbol segmentation and parses two-dimensional structures;

WAP network structure

insert image description here

Fig. 1. Architectures of Watch, Attend, Parse for handwritten mathematical expression recognition

The network structure of WAP uses FCN to extract picture features and GRU decoding with attention.

1. The FCN in it: (the normalization layer and the RELU activation layer are not shown)

insert image description here

FCN configurations. The convolutional layer parameters are denoted as “conv(receptive field size)-[number of channels]”. For brevity, the batch normalization layer and ReLU activation function is not shown.

2. Add attention to the GRU

The most basic GRU:
insert image description here

Add attention mechanism:

It can be simply understood that the part of the input image that the parser should focus on depends on the words in the output sequence that has been generated.

insert image description here

Define the specifications:
β t = ∑ lt − 1 α l F = Q ∗ β teti = ν a T tanh ⁡ ( W aht − 1 + U aai + U ffi ) \begin{aligned} \boldsymbol\beta_t&=\; sum_l^{t-1}\ballsymbol\alpha_l \\ \mathbf{F} &= Q* \ballsymbol\beta_t\\ e_{ti}&=\ballsymbol{\nu}_a^\mathrm{T}\tanh( \mathbf{W}_a\mathbf{h}_{t-1}+\mathbf{U}_a\mathbf{a}_i+\mathbf{U}_{f}\mathbf{f}_i) \end{aligned } }btFet i=lt1al=Qbt=naTtanh ( Waht1+Uaai+Uffi)

β t \beta_t bt: sum of past attention probabilities;

f i \mathbf{f}_i fi:annotation a i a_i aiThe coverage vector, initialized to 0;

ct \mathbf{c}_tct

e t i = ν a T tanh ⁡ ( W a h t − 1 + U a a i ) α t i = e x p ( e t i ) ∑ k = 1 L exp ⁡ ( e t k ) c t = ∑ i L α t i a i \begin{aligned} e_{ti}&=\boldsymbol{\nu}_a^\text{T}\tanh(\mathbf{W}_a\mathbf{h}_{t-1}+\mathbf{U}_a\mathbf{a}_i)\\ \alpha_{ti}&=\frac{\mathbf{exp}(e_{ti})}{\sum_{k=1}^L\exp(e_{tk})}\\ \mathbf{c}_t&=\sum_i^L\alpha_{ti}\mathbf{a}_i \end{aligned} et iat ict=naTtanh ( Waht1+Uaai)=k=1Lexp(etk)exp(et i)=iLat iai

Five spatial relationships learned:

insert image description here

Fig. 5. The model learning procedure of determining five spatial relationships (horizontal, vertical, subscript, superscript and inside) through attention visualization.

experiment

1. Display pictures with/without attention:

insert image description here

Examples of attention with and without the coverage vector. The recognized LaTeX sequences of the right side of the equation are printed below each image (the white areas in the images indicate the attended regions, and the underlined text in the LaTeX sequences indicates the corresponding words).

insert image description here

Attention visualization of a tested mathematical expression image whose LaTeX sequence is “ ( sin ( x ) ) ∧ { 2 } + ( cos ( x ) ) ∧ { 2 } ”.

2. Experiment in CROHME 2014: CORRECT EXPRESSION RECOGNITION RATE (IN %)

insert image description here

2. Experiment in CROHME 2016: CORRECT EXPRESSION RECOGNITION RATE (IN %)

insert image description here

4. Comparison with a parallel job:

WYGIWYS:we find a parallel work similar to this study submitted as the arXiv preprint [48], named WYGIWYS (What You Get Is What You See), which decompiles a machine-printed mathematical expression into presentational markup。

Increase deep FCN, increase coverage attention, increase trajectory information trajectory

insert image description here

reference:

[1]:J. Zhang, J. Du and L. Dai, “Multi-Scale Attention with Dense Encoder for Handwritten Mathematical Expression Recognition,” 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 2018, pp. 2245-2250, doi: 10.1109/ICPR.2018.8546031.

[2]: Jianshu Zhang, Jun Du, Shiliang Zhang, Dan Liu, Yulong Hu, Jinshui Hu, Si Wei, Lirong Dai, Watch, attend and parse: An end-to-end neural network based approach to handwritten mathematical expression recognition, Pattern Recognition, Volume 71, 2017, Pages 196-206, ISSN 0031-3203,

Guess you like

Origin blog.csdn.net/KPer_Yang/article/details/129483137