论文阅读和分析:Multi-Scale Attention with Dense Encoder for Handwritten Mathematical Expression Recognition

HMER论文系列
1、论文阅读和分析:When Counting Meets HMER Counting-Aware Network for HMER_KPer_Yang的博客-CSDN博客
2、论文阅读和分析:Syntax-Aware Network for Handwritten Mathematical Expression Recognition_KPer_Yang的博客-CSDN博客
3、论文阅读和分析:A Tree-Structured Decoder for Image-to-Markup Generation_KPer_Yang的博客-CSDN博客
4、 论文阅读和分析:Watch, attend and parse An end-to-end neural network based approach to HMER_KPer_Yang的博客-CSDN博客
5、 论文阅读和分析:Multi-Scale Attention with Dense Encoder for Handwritten Mathematical Expression Recognition_KPer_Yang的博客-CSDN博客
6、 论文阅读和分析:Mathematical formula recognition using graph grammar_KPer_Yang的博客-CSDN博客
7、 论文阅读和分析:Hybrid Mathematical Symbol Recognition using Support Vector Machines_KPer_Yang的博客-CSDN博客
8、论文阅读和分析:HMM-BASED HANDWRITTEN SYMBOL RECOGNITION USING ON-LINE AND OFF-LINE FEATURES_KPer_Yang的博客-CSDN博客

主要工作:

1、将当时比较火的DenseNet用到HMER任务中;

2、使用多尺度注意力模型,通过将高分辨率、低语义的特征和低分辨率、高语义特征相融合,识别出细小符号,例如小数点;

核心模型实现:

1、多尺度注意力编码器:

在这里插入图片描述

k:growth rate

D:(number of convolution layers) of each block:

for example:D = 32 which means each block has 16 1 × 1 convolution layers and 16 3 × 3 convolution layers.(A batch normalization layer [24] and a ReLU activation layer [25] are performed after each convolution layer consecutively)

流程:

(1)先从正常的DenseNet中,第一个池化层分出,进行DenseB的处理得到B: B ∈ R 2 H × 2 W × C ′ \mathbf{B}\in \mathbb{R} ^{2H \times 2W \times C^{'}} BR2H×2W×C.

(2)GRU计算t步的s hat;
s ^ t = G R U ( y t − 1 , s t − 1 ) \mathbf{\hat{s}}_t=GRU\left(\mathbf{y}_{t-1},\mathbf{s}_{t-1}\right) s^t=GRU(yt1,st1)
(3)计算A和B的a single-scale coverage based attention model.
c A t = f c a t t ( A , s ^ t ) c B t = f c a t t ( B , s ^ t ) \mathbf{c}\mathbf{A}_{t}=f_{\mathrm{catt}}\left(\mathbf{A},\mathbf{\hat{s}}_{t}\right)\\ \mathbf{cB}_{t}=f_{ {\mathrm{catt}}}\left(\mathbf{B},\mathbf{\hat{\mathbf{s}}}_{t}\right) cAt=fcatt(A,s^t)cBt=fcatt(B,s^t)
f c a t t f_{catt} fcatt: coverage based attention计算
F = Q ∗ ∑ t = 1 t − 1 α l e t i i = ν a n h T tanh ⁡ ( U i S i + U a a i + U f f i ) α t i = exp ⁡ ( e t i ) ∑ k = 1 L exp ⁡ ( e t k ) c A t = ∑ t = 1 L α t i a i \begin{aligned}&\mathbf{F}=\mathbf{Q}*\sum_{t=1}^{t-1}\boldsymbol{\alpha}_l\\ &e_{ti i}=\boldsymbol{\nu}_{\mathrm{anh}}^{T}\tanh(\mathbf{U}_i\mathbf{S}_i+\mathbf{U}_a\mathbf{a}_i +\mathbf{U}_f\mathbf{f}_i)\\ &\alpha_{t i}=\frac{\exp(e_{t i})}{\sum_{k=1}^L\exp(e_{t k})}\\ &\mathbf{c}\mathbf{A}_t=\sum_{t=1}^L\alpha_{t i}\mathbf{a}_i\end{aligned} F=Qt=1t1αletii=νanhTtanh(UiSi+Uaai+Uffi)αti=k=1Lexp(etk)exp(eti)cAt=t=1Lαtiai

(4)将其拼接
c t = [ c A t ; c B t ] \mathbf{c}_t=[\mathbf{cA}_t;\mathbf{cB}_t] ct=[cAt;cBt]
(5)计算t步的隐状态和输出:
s t = G R U ( c t , s ^ t ) \mathbf{s}_t=GRU\left(\mathbf{c}_t,\mathbf{\hat{s}}_t\right) st=GRU(ct,s^t)

2、解码器

定义annotation sequence A:
A = { a 1 , … , a L } , a i ∈ R C \mathbf{A}=\{\mathbf{a}_1,\ldots,\mathbf{a}_L\},\mathbf{a}_i\in\mathbb{R}^C\quad A={ a1,,aL},aiRC
L L L L = H × W L=H\times W L=H×W。假设CNN的输出是 H × W × C H \times W \times C H×W×C.

定义LateX string:
Y = { y 1 , … , y T } , y i ∈ R K \mathbf{Y}=\{\mathbf{y}_1,\ldots,\mathbf{y}_T\},\mathbf{y}_i\in\mathbb{R}^K Y={ y1,,yT},yiRK
T T T:Latex的长度;

K K K:总的字符数目;因为 y i \mathbf{y}_i yi是一个one-hot编码的向量;

注意:annotation sequence A和LateX字符串Y都是可变长的,如何去解决?

计算解码步t的中间量 c t \mathbf{c}_t ct:通过注意力机制,计算的定长的上下文向量。
c t = ∑ i = 1 L α t i a i \mathbf{c}_t=\sum_{i=1}^L\alpha_{ti}\mathbf{a}_i ct=i=1Lαtiai
再利用上下文向量,计算每个预测符号的概率:
p ( y t ∣ y t − 1 , X ) = g ( W o h ( E y t − 1 + W s s t + W c c t ) ) p(\mathbf y_t|\mathbf y_{t-1},\mathbf X)=g\left(\mathbf W_oh(\mathbf Ey_{t-1}+\mathbf W_s\mathbf s_t+\mathbf W_c\mathbf c_t)\right) p(ytyt1,X)=g(Woh(Eyt1+Wsst+Wcct))
g g g:softmax activation function

h h h:maxout activation function

E E E:maxout activation function

W s ∈ R m × n \mathbf{W}_{s}\in\mathbb{R}^{m\times n} WsRm×nm:embedding dim and n:GRU decoder state respectively;

W o ∈ R K × m 2 \mathbf{W}_{o}\in\mathbb{R}^{K\times\frac{m}{2}} WoRK×2m

提高技巧:

1、beam search algorithm,每一步保留10个假设符号,结束则用< eos >符号。

2、集成学习,使用不同初始化、相同训练的模型,在beam search过程中,进行求平均。

实验:

1、消融实验:

在这里插入图片描述

COMPARISON OF RECOGNITION PERFORMANCE (IN %) ON CROHME 2014 AND CROHME 2016 WHEN EMPLOYING DENSE ENCODER AND MULTI-SCALE ATTENTION MODEL

2、深度D

在这里插入图片描述

COMPARISON OF RECOGNITION PERFORMANCE (IN %) ON CROHME 2014 AND CROHME 2016 WHEN INCREASING THE DEPTH OF DENSE BLOCK IN MULTI-SCALE BRANCH.

3、对比其他模型

CROHME 2014 :

在这里插入图片描述

CROHME 2016 :

在这里插入图片描述

参考:

[1]:J. Zhang, J. Du and L. Dai, “Multi-Scale Attention with Dense Encoder for Handwritten Mathematical Expression Recognition,” 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 2018, pp. 2245-2250, doi: 10.1109/ICPR.2018.8546031.

2018 24th International Conference on Pattern Recognition (ICPR)*, Beijing, China, 2018, pp. 2245-2250, doi: 10.1109/ICPR.2018.8546031.

[2]:Jianshu Zhang, Jun Du, Shiliang Zhang, Dan Liu, Yulong Hu, Jinshui Hu, Si Wei, Lirong Dai,Watch, attend and parse: An end-to-end neural network based approach to handwritten mathematical expression recognition,Pattern Recognition,Volume 71,2017,Pages 196-206,ISSN 0031-3203,

猜你喜欢

转载自blog.csdn.net/KPer_Yang/article/details/129484160