Generation  pregnant  who B

Generation  pregnant  who B█ Micro Signal █: 138 • 0226 • 9370█ █ successful surrogacy package bag pack healthy sex surrogate █ ██ bag boy

Depth study notes --Attention Model (Attention Model) learning summary

In depth study of fact Attention model simulates the attention model of the human brain, for example, when we watch a painting, although we can see the whole picture of the whole picture, but in our in-depth careful observation in fact the eye to focus on only a small piece, this time the human brain focused on this small piece of pattern, that this time the human brain to focus on the whole image is not balanced, there is a certain weight distinction. This is the core idea of ​​the depth of learning in the Attention Model.

AM is indeed the beginning of the application in the field of image, AM has achieved very good results in the field of image processing! As a result, people began to study how the AM model is introduced to the field of NLP. Undoubtedly the most famous "Neural machine translation by jointly learning to align and translate" the paper, and the paper was first presented Soft Attention Model, and apply it to the field of machine translation. Follow-up article NLP field using the AM model usually cited in this article (the amount currently has thousands of references!)

As shown below, the main use of machine translation Encoder-Decoder Model, introduced on the basis of the AM Encoder-Decoder model, achieved good results:

Soft Attention Model:

In fact, here is the dismantling top of the charts, we have said, "Neural machine translation by jointly learning to align and translate" This paper presents a soft Attention Model, and apply it to the machine translation above. In fact, the so-called Soft, meaning that when seeking attention allocation probability distribution for X input sentence given any word in all probability, a probability distribution.

That is the figure above ci Encoder probability distribution is calculated for each word in a focus, and then the resulting weighted. As shown below:

In fact, there are Soft AM, also corresponds to a Hard AM. Since Soft is to give each word is given a word alignment probability, so if you do not find a particular word directly from the input sentence inside, and then the target word sentence and word alignment, and other input words in a sentence hard align believes the probability is zero, which is the idea Attention Model of Hard. Hard AM prove useful in the image, the text inside but not very useful, because this word correctly aligned obviously asking for too much, if the effect of missing a big negative for subsequent processing.

However, a paper at Stanford University "Effective Approaches to Attention-based Neural Machine Translation" presents a hybrid model Soft AM and Hard AM, the paper, they proposed two models: Global Attention Model and Local Attention Model, Global Attention Model in fact, Soft Attention Model, is a mix Soft AM and Hard AM on Local Attention Model in nature. Estimated first position generally aligned with a Pt, Pt and then the window is about the size of the range D to take similar Soft AM probability distribution.

Global Attention Model和Local Attention Model

Global AM is actually soft AM, in the process Decoder, Context vector each time step of the calculation Encoder required in each word of attention weights, then get weighted.

 

Local AM is to find a first position thereof, and then calculates its position about a focus window weights, to obtain a final weight Context vector. This is actually a hybrid compromise of Soft AM and Hard AM.

 

Static AM

其实还有一种AM叫做静态AM。所谓静态AM,其实是指对于一个文档或者句子,计算每个词的注意力概率分布,然后加权得到一个向量来代表这个文档或者句子的向量表示。跟soft AM的区别是,soft AM在Decoder的过程中每一次都需要重新对所有词计算一遍注意力概率分布,然后加权得到context vector,但是静态AM只计算一次得到句子的向量表示即可。(这其实是针对于不同的任务而做出的改变)

 

强制前向AM

Soft AM在逐步生成目标句子单词的时候,是由前向后逐步生成的,但是每个单词在求输入句子单词对齐模型时,并没有什么特殊要求。强制前向AM则增加了约束条件:要求在生成目标句子单词时,如果某个输入句子单词已经和输出单词对齐了,那么后面基本不太考虑再用它了,因为输入和输出都是逐步往前走的,所以看上去类似于强制对齐规则在往前走。

 

看了这么多AM模型以及变种,那么我们来看一看AM模型具体怎么实现,涉及的公式都是怎样的。

我们知道,注意力机制是在序列到序列模型中用于注意编码器状态的最常用方法,它同时还可用于回顾序列模型的过去状态。使用注意力机制,系统能基于隐藏状态 s_1,...,s_m 而获得环境向量(context vector)c_i,这些环境向量可以和当前的隐藏状态 h_i 一起实现预测。环境向量 c_i 可以由前面状态的加权平均数得出,其中状态所加的权就是注意力权重 a_i:

注意力函数 f_att(h_i,s_j) 计算的是目前的隐藏状态 h_i 和前面的隐藏状态 s_j 之间的非归一化分配值。

而实际上,注意力函数也有很多种变体。接下来我们将讨论四种注意力变体:加性注意力(additive attention)、乘法(点积)注意力(multiplicative attention)、自注意力(self-attention)和关键值注意力(key-value attention)。

加性注意力(additive attention)

加性注意力是最经典的注意力机制 (Bahdanau et al., 2015) [15],它使用了有一个隐藏层的前馈网络(全连接)来计算注意力的分配:

也就是:

 

乘法(点积)注意力(multiplicative attention)

乘法注意力(Multiplicative attention)(Luong et al., 2015) [16] 通过计算以下函数而简化了注意力操作:

加性注意力和乘法注意力在复杂度上是相似的,但是乘法注意力在实践中往往要更快速、具有更高效的存储,因为它可以使用矩阵操作更高效地实现。两个变体在低维度 d_h 解码器状态中性能相似,但加性注意力机制在更高的维度上性能更优。

 

自注意力(self-attention)

注意力机制不仅能用来处理编码器或前面的隐藏层,它同样还能用来获得其他特征的分布,例如阅读理解任务中作为文本的词嵌入 (Kadlec et al., 2017) [37]。然而,注意力机制并不直接适用于分类任务,因为这些任务并不需要情感分析(sentiment analysis)等额外的信息。在这些模型中,通常我们使用 LSTM 的最终隐藏状态或像最大池化和平均池化那样的聚合函数来表征句子。

自注意力机制(Self-attention)通常也不会使用其他额外的信息,但是它能使用自注意力关注本身进而从句子中抽取相关信息 (Lin et al., 2017) [18]。自注意力又称作内部注意力,它在很多任务上都有十分出色的表现,比如阅读理解 (Cheng et al., 2016) [38]、文本继承 (textual entailment/Parikh et al., 2016) [39]、自动文本摘要 (Paulus et al., 2017) [40]。

 

关键值注意力(key-value attention)

关键值注意力 (Daniluk et al., 2017) [19] 是最近出现的注意力变体机制,它将形式和函数分开,从而为注意力计算保持分离的向量。它同样在多种文本建模任务 (Liu & Lapata, 2017) [41] 中发挥了很大的作用。具体来说,关键值注意力将每一个隐藏向量 h_i 分离为一个键值 k_i 和一个向量 v_i:[k_i;v_i]=h_i。键值使用加性注意力来计算注意力分布 a_i:

其中 L 为注意力窗体的长度,I 为所有单元为 1 的向量。然后使用注意力分布值可以求得环境表征 c_i:

其中环境向量 c_i 将联合现阶段的状态值 v_i 进行预测。

Guess you like

Origin www.cnblogs.com/DENG012/p/10947899.html