文章目录

1.Attention Summary
2.Model Architecture

2.1.Input embedding layer
2.2.Embedding encoder layer
2.3.Context-query attention layer
2.4.Model encoder layer
2.5.Output layer

3.experiment
4.源码
5.总结&碎碎念

5.1.我的一点认识
5.2.关于数据集
5.3.MC的相关文章

参考链接

编者按：本文以徐阿衡文章为主线进行整理，增加了一些其他文章的精华，以及我的一些拙见~
标记版论文下载：zsweet github

1.Attention Summary

本篇paper里的attention其实是有点麻烦的，但是文章讲的很清楚，麻烦却不代表效率低~~~

这篇论文主要对 attention 机制做了改进，为此作者总结了 MC 任务上过去常用的三类 attention：

Attention Reader。通过动态 attention 机制从文本中提取相关信息（context vector），再依据该信息给出预测结果。
代表论文：Bahdanau et al. 2015, Hermann et al. 2015, Chen et al. 2016, Wang & Jiang 2016
Attention-Sum Reader。只计算一次 attention weights，然后直接喂给输出层做最后的预测，也就是利用 attention 机制直接获取文本中各位置作为答案的概率，和 pointer network 类似的思想，效果很依赖对 query 的表示
代表论文：Kadlec et al. 2016, Cui et al. 2016
Multi-hop Attention。计算多次 attention
代表论文：Memory Network(Weston et al., 2015)，Sordoni et al., 2016; Dhingra et al., 2016., Shen et al. 2016.

在此基础上，作者对注意力机制做出了改进，具体 BiDAF attention 的特点如下：

并没有把 context 编码进固定大小的 vector(_{这个地方我有点不太理解，这个固定的vector指的是什么？求解答})，而是让 vector 可以流动，减少早期加权和的信息损失。也就是说我们的注意力层并不用于将上下文映射到一个固定大小的向量，反而是，注意力在每一个时间步都会计算，以及每一个时间步的伴随向量（Attend Vector）连同着先前层的表示，都被允许流向后来的模型层，这样的方法就减少了过早将上下文映射成固定大小的向量所带来的误差。
Memory-less，在每一个时刻，仅仅对 query 和当前时刻的 context paragraph 进行计算，并不直接依赖上一时刻的 attention，这使得后面的 attention 计算不会受到之前错误的 attention 信息的影响
计算了 query-to-context（Q2C）和 context-to-query（C2Q）两个方向的 attention 信息，认为 C2Q 和 Q2C 实际上能够相互补充。实验发现模型在开发集上去掉 C2Q 与去掉 Q2C 相比，分别下降了 12 和 10 个百分点，显然 C2Q 这个方向上的 attention 更为重要

2.Model Architecture

论文提出的是六层结构:

Character Embedding Layer
Word Embedding Layer
Contextual Embedding Layer
Attention Flow Layer
Modeling Layer
Output Layer

Model architecture picture:
在这里插入图片描述
因为前两层都是embedding层，合在一块说~

2.1.Input embedding layer

= Character Embedding Layer + Word Embedding Layer

和其他模型差不多，word embedding + character embedding，预训练词向量，OOV 和字向量可训练，字向量用 CNN 训练
单词 w 的表示由词向量和字向量的拼接然后经过两层 highway network 得到，得到 context vector $X \in R^{d*T}$ 和 query vector $Q \in R^{d*J}$

2.2.Embedding encoder layer

= Contextual Embedding Layer
对上一步的结果 X 和 Q 分别使用 Bi-LSTM 编码，捕捉 X 和 Q 各自单词间的局部关系，拼接双向 LSTM 的输出，得到 $H \in R^{2d*T} 和 U \in R^{2d*J}$
这前面的两层（or 原文三层）用来捕捉 query 和 context 各自不同粒度（character, word, phrase）上的特征

2.3.Context-query attention layer

= Attention Flow Layer

The attention flow layer is not used to summarize the query and context into single feature vectors. instead, the attention vector at each time step, along with the embeddings from previous layers, are allowed to flow through to the subsequent modeling layer. This reduces the information loss caused by early summarization.

输入是 H 和 U，输出是 context words 的 query-aware vector G，以及上一层传下来的 contextual embeddings。做 context-to-query 以及 query-to-context 两个方向的 attention。做法还是一样，先计算相关性矩阵，再归一化计算 attention 分数，最后与原始矩阵相乘得到修正的向量矩阵。
c2q 和 q2c 共享相似度矩阵， $S \in R^{T*J}$ ，这里S的 $i$ 行表示context中第 $i$ 个词与query中每个词的相似度，而S的 $j$ 列表示query中第 $j$ 个词与context中每个词的相似度，相似度计算方式是：
$S_{tj}=\alpha(H_{:t}, U_{:j}) \in R$

$\alpha(h,u)=w^T_{(S)}[h;u;h⊙u]$

其中：

$S_{tj}$ : 第 t 个 context word 和第 j 个 query word 之间的相似度。 $S \in \mathbb{R} ^{T*J}$
$\alpha(h,u)$ : scalar function。 $\alpha(h,u) \in \mathbb{R}$ ，是一个score实数
$W_S$ :待训练的scalar参数。 $W_S \in \mathbb{R} ^{6d}$
$H_{:t}$ : H 的第 t 个列向量。 $H_{:t} \in \mathbb{R} ^{2d}$
$U_{:j}$ ：U 的第 j 个列向量。 $U_{:t} \in \mathbb{R} ^{2d}$
$⊙$ ：element-wise multiplication(对位相乘）
$[;]$ ：向量在行上的拼接,也就是vector的拼接

context-to-query attention(C2Q):

计算对每一个 context word 而言哪些 query words 和它最相关。前面得到了相关性矩阵，现在 softmax 对列归一化然后计算 query 向量加权和得到 $\hat U$
$a_t=softmax(S_{t:})$
$\hat U_{:t}=\sum_ja_{tj}U_{:j}$

其中：

$a_t$ ：context中第t个词与query中每个词的相似度归一化结果。 $a_t \in \mathbb{R} ^{J}$
$\hat U_{:t}$ :query中每个词的加权和，是query中哪些words和context中第t个词相关的量化和。 $\hat U_{:t} \in \mathbb{R} ^{2d}$ ，因此 $\hat U \in \mathbb{R} ^{2d*T}$

query-to-context attention(Q2C):
计算对每一个 query word 而言哪些 context words 和它最相关，这些 context words 对回答问题很重要。取相关性矩阵每列最大值，对其进行 softmax 归一化计算 context 向量加权和，然后 tile T 次得到 $\hat H \in R^{2d*T}$ 。这里要注意的是，一个query word可能能挑出几个context word,因为他们的相似度比较高，具体是取每个context最相关的query词，但是这个权值用来对所有的context求和。
$b=softmax(max_{col}(S))$

$\hat h=\sum_tb_tH_{:t} \in R^{2d}$
其中：

$\mathbf {max}_{col}(S)$ :取S矩阵的每行（原文是这么说的：the maximum function (maxcol) is performed across the column.），然后求每一列的最大值。求的是query中每个词。 $b \in \mathbb{R}^T$
$\hat h$ ：对context所有word进行weight sum，相当于对context做attention。 $\hat h \in \mathbb{R} ^{2d}$ ，在tile(复制)T次之后， $\hat H \in \mathbb{R} ^{2d}$

至此 $\hat U$ 和 $\hat H$ 都是 2dxT 的矩阵.
最后将context embedding和C2Q、Q2C的结果（三个矩阵）拼接起来得到 G
$G_{:t}=\beta (H_{:t}, \hat U_{:t}, \hat H_{:t}) \in R^{d_G}$

$\beta$ 可以是多层 perceptron，不过如上简单的拼接效果也不错。
$\beta(h, \hat u, \hat h)=[h;\hat u; h⊙\hat u; h⊙\hat h] \in R^{8d*T}$

于是就得到了 context 中单词的 query-aware representation。
attention源码：

N = tf.shape(0) #batch size
J = tf.shape(u)[1]  #query 长度
h_expand = tf.tile(tf.expand_dims(h,2),[1,1,J,1]) ##[N,T,J,d]  将context每个词复制J份
T = tf.shape(h)[1]  #context长度
u_expand = tf.tile(tf.expand_dims(u,1),[1,T,1,1])  ##[N,T,J,d]  将query整体复制T份
h_element_wise_u=tf.multiply(h_expand, u_expand)  ##[N,T,J,d]  [N,context中每个词与query整体的mutiply结果,context中某个词与query每个词的mutiply结果,context中某个词与query某个词的mutiply结果]
cat_data = tf.concat((h_expand,u_expand,h_element_wise_u),3)  #[N,T,J,3d]
#            S = tf.layers.dense(tf.reshape(cat_data,[tf.shape(self.p)[0],-1]),1).reshape(N,T,J)
print("cat_data:{}".format(cat_data))
S = tf.layers.dense(cat_data,1)  ##[N,T,J,1]
S = tf.nn.dropout(S,self.dropout_keep_prob)   #！！！！！！！
print("S:{}".format(S))
S = tf.squeeze(S,3)   ##[N,T,J]
print("S reshape result:{}".format(S))
# Context2Query
c2q = tf.matmul(tf.nn.softmax(S),u)  # (N, T, 2d) = bmm( (N, T, J), (N, J, 2d) )
# Query2Context
# b: attention weights on the context
b = tf.nn.softmax(tf.reduce_max(S, 2), dim=-1)  #N*T  和query中每个词关系度比较大的context词的分数(这里面包含了上下文语义)，
                                                # 这里可能不止一个(具体是找出context中每个词最相关的query)，
                                                # 然后softmax，并求和
q2c = tf.matmul(tf.expand_dims(b,1), h) # (N, 1, 2d) = bmm( (N, 1, T), (N, T, 2d) )
print("q2c shape : {}".format(q2c)) 
q2c = tf.tile(q2c,[1,T,1])    # (N, T, 2d)
#q2c.set_shape([self.batch_size,self.max_p_len,2*self.hidden_size])
print("q2c expand_dim shape : {}".format(q2c)) 
G = tf.concat((h, c2q, tf.multiply(h,c2q), tf.multiply(h,q2c)), 2)  # (N, T, 8d)

2.4.Model encoder layer

= Modeling Layer
输入是 G，再经过一次 Bi-LSTM 得到 $M \in r^{2D * T}$ ，捕捉的是 interaction among the context words conditioned on the query
M 的每一个列向量都包含了对应单词关于整个 context 和 query 的上下文信息

2.5.Output layer

预测开始位置 p1 和结束位置 p2
$p^1=softmax(W^T_{(p^1)}[G; M]), \ \ \ p^2=softmax(W^T_{(p^2)}[G; M^2])$

M 再经过一个 Bi-LSTM 得到 $M^2 \in R^{2d * T}$ ，用来得到结束位置的概率分布

最后的目标函数：
$L(\theta)=-{1 \over N} \sum^N_i[log(p^1_{y_i^1})+log(p^2_{y_i^2})]$

$y^1_i$ 和 $y^2_i$ 分别是第 i 个样本的 groundtruth 的开始和结束位置

3.experiment

模型中的一些参数，这个对于coding阶段还是蛮重要的：

分词工具：PTB Tokenizer
char-embedding中filter个数是100 1D，width是5
hidden size=100,也就是paper以及本文中的d
用的是AdaDelta optimizer， minibatch size=60，initial learning rate=0.5, for 12 epochs
在CNN、LSTMlayer、softmax之前的Linear transformation使用dropout，dropout=0.2
训练时，模型所有权重的moving averages使用0.999的exponential decay rate。更多关于理解滑动平均(exponential moving average)，以及相关的code教程

4.源码

更多源码详见:zsweet github

5.总结&碎碎念

5.1.我的一点认识

在阅读原文的时候，BIDAF首先创新了attention，增加了双向，也增加Memory-less的思想；
另外给我一个很深的感觉就是用了resnet的思想，不管是attention中 $G_{:t}=\beta (H_{:t}, \hat U_{:t}, \hat H_{:t}) \in R^{d_G}$ ，还是output layer中的 $[G;M]$ 、 $[G; M^2]$ ，都是用了前面一层的layer结果，我觉得这也是一种在resnet在nlp上的“创新”~

5.2.关于数据集

在本文的related word中还提到了Machine Comprehension(MC)的几个数据集，在很多paper中都提到过，论文也都用在这些数据集结果作为参考，这里汇总一下：

MCTest：这是Richardson在2013年提出的一个MC数据集，（没记错的话是cs224n里的那个帅小伙老师，是真的cool~跑题了ORZ）
CNN/DailyMail： Hermann et al. (2015)
Childrens Book Test： Hill et al. (2016)
SQuAD: Rajpurkar et al. (2016) ，给出问题passage，然后根据passage提问问题query，问题的答案都是出自于passage的span。这是毋庸置疑现在RC最火的数据集，没有之一。2.0推出之后让它一直火~
Dureader：这个也算是比较火的中文数据集合，数据都是来自百度知道和search。和SQuAD的一个很大的区别就是：其一，这里有很多passage，然后从里面找答案；其二，问题的答案不再是从文章中的span，而是人工标注的答案，当然这里还是给出文章中的fake answer，也就是文章中的相关位置。

5.3.MC的相关文章

这里我觉得这篇paper里面特别好的一个就是对MC，更准确的说是对Attention的总结，为了原汁原味我直接拿过文章的相关片段：

Previous works in end-to-end machine comprehension use attention mechanisms in three distinct ways. The first group (largely inspired by Bahdanau et al. (2015)) uses a dynamic attention mechanism, in which the attention weights are updated dynamically given the query and the context as well as the previous attention. Hermann et al. (2015) argue that the dynamic attention model performs better than using a single fixed query vector to attend on context words on CNN & DailyMail datasets. Chen et al. (2016) show that simply using bilinear term for computing the attention weights　in the same model drastically improves the accuracy. Wang & Jiang (2016) reverse the direction of the attention (attending on query words as the context RNN progresses) for SQuAD. In contrast to these models, BIDAF uses a memory-less attention mechanism.
The second group computes the attention weights once, which are then fed into an output layer for final prediction (e.g., Kadlec et al. (2016)). Attention-over-attention model (Cui et al., 2016) uses a 2D similarity matrix between the query and context words (similar to Equation 1) to compute the weighted average of query-to-context attention. In contrast to these models, BIDAF does not summarize the two modalities in the attention layer and instead lets the attention vectors flow into the modeling (RNN) layer.
The third group (considered as variants of Memory Network (Weston et al., 2015)) repeats computing an attention vector between the query and the context through multiple layers, typically referred to as multi-hop (Sordoni et al., 2016; Dhingra et al., 2016). Shen et al. (2016) combine Memory Networks with Reinforcement Learning in order to dynamically control the number of hops. One can also extend our BIDAF model to incorporate multiple hops.

Bi-Direction attention flow for machine reading(原理篇)