注意力机制模型
模型: 分为 Encoder层,Attention层 和 Decoder层。
将 Encoder层 的每个时间片的激活值 拷贝 Tx 次然后和全部激活值 a (Tx个时间片) 串联作为Attention 层的输入,经过Attention层的计算输出 个阿尔法 α,使用不同激活值 a 作为不同阿尔法 α 对每个单词的注意力权重,相乘,即 α⋅a,然后将 个这样的相乘作为attention层的输出,即作为 decoder层 一个时间片上输入参与后续运算。
主要思想是将一个时间片的激活值分别和不同的单词注意力权重(使用不同时间片的激活值作为权重)相乘。
下图是上图的 一个 Attention 到 context 部分,也是Attention mechanism(注意力机制)的实现:
过程:
There are two separate LSTMs in this model (see diagram on the left). Because the one at the bottom of the picture is a Bi-directional LSTM and comes before the attention mechanism, we will call it pre-attention Bi-LSTM. The LSTM at the top of the diagram comes after the attention mechanism, so we will call it the post-attention LSTM. The pre-attention Bi-LSTM goes through time steps; the post-attention LSTM goes through time steps.
翻译:模型中有两个LSTM层,主要区别在于使用注意力机制的前后以及相连的时间片不同。下面的叫 pre-attention Bi-LSTM,上面的叫 post-attention LSTM,其实这两个分别属于Seq2Seq的编码(Encoder)和解码(Decoder)部分。Bi-LSTM是指双向 LSTM。pre-attention Bi-LSTM 和 Tx 个输入时间片,post-attention LSTM 和 Ty 个输出时间片相连。
The post-attention LSTM passes from one time step to the next. In the lecture videos, we were using only a basic RNN for the post-activation sequence model, so the state captured by the RNN output activations . But since we are using an LSTM here, the LSTM has both the output activation and the hidden cell state . However, unlike previous text generation examples (such as Dinosaurus in week 1), in this model the post-activation LSTM at time does will not take the specific generated as input; it only takes and as input. We have designed the model this way, because (unlike language generation where adjacent characters are highly correlated) there isn’t as strong a dependency between the previous character and the next character in a YYYY-MM-DD date.
翻译: 指向 LSTM 的 s post-attention LSTM 的隐藏层的激活值。post-attention LSTM 没有使用双向的原因是这个例子中的机器时间前后相关不大。这一层的 激活值 s 和 记忆细胞的 c 的 初始输入和普通的LSTM一样(一般取0) ,而输入来自 Attention层 的计算。
We use to represent the concatenation of the activations of both the forward-direction and backward-directions of the pre-attention Bi-LSTM.
翻译:双向 RNN 的说明
The diagram on the right uses a RepeatVector node to copy ’s value times, and then Concatenation to concatenate and to compute , which is then passed through a softmax to compute . We’ll explain how to use RepeatVector and Concatenation in Keras below.
翻译: Attention层 的 是指 Encoder层一个时间片上的输出(激活值),将一个时间片上的激活值复制 Tx 次以便和 Encoder 层 Tx 个时间片的输出 a 进行串联(类似增广矩阵),然后一起作为 Attention层 的输入。
模型实现: implementing two functions: one_step_attention() and model() .
one_step_attention(): At step , given all the hidden states of the Bi-LSTM ( ) and the previous hidden state of the second LSTM ( ), one_step_attention() will compute the attention weights ( ) and output the context vector (see Figure 1 (right) for details):
Note that we are denoting the attention in this notebook . In the lecture videos, the context was denoted , but here we are calling it to avoid confusion with the (post-attention) LSTM’s internal memory cell variable, which is sometimes also denoted .
翻译: 在每个时间片 t,将Encoder层在 t 时间片上的输出 (激活值, t-1而不是 t 是因为代码从0开始)复制 Tx 次以便和Encoder层所以输出串联作为Attention层的输入,经过计算得到 Tx 个输出 阿尔法 α, 将每个 α 和每个时间片的输出(作为注意力权重)相乘,然后求和作为最终的输入 context,也就是 Decoder层的输入。
model(): Implements the entire model. It first runs the input through a Bi-LSTM to get back . Then, it calls one_step_attention() times (for loop). At each iteration of this loop, it gives the computed context vector to the second LSTM, and runs the output of the LSTM through a dense layer with softmax activation to generate a prediction .
翻译: 在model函数中先计算Encoder层,然后利用缓存进行 decoder 层的计算,经过 Ty 个时间片运算得到最终的输出,在每个时间片都会调用一次 Attention层(每次都涉及 Tx 个时间片的 Encoder层 缓存输出)