DeepLearning.ai code笔记5:序列模型

注意力机制模型

模型: 分为 Encoder层,Attention层 和 Decoder层。

将 Encoder层 的每个时间片的激活值 s < t > 拷贝 Tx 次然后和全部激活值 a (Tx个时间片) 串联作为Attention 层的输入,经过Attention层的计算输出 n y 个阿尔法 α,使用不同激活值 a 作为不同阿尔法 α 对每个单词的注意力权重,相乘,即 α⋅a,然后将 n y 个这样的相乘作为attention层的输出,即作为 decoder层 一个时间片上输入参与后续运算。

主要思想是将一个时间片的激活值分别和不同的单词注意力权重(使用不同时间片的激活值作为权重)相乘。

这里写图片描述

下图是上图的 一个 Attention 到 context 部分,也是Attention mechanism(注意力机制)的实现:

这里写图片描述

过程:

  • There are two separate LSTMs in this model (see diagram on the left). Because the one at the bottom of the picture is a Bi-directional LSTM and comes before the attention mechanism, we will call it pre-attention Bi-LSTM. The LSTM at the top of the diagram comes after the attention mechanism, so we will call it the post-attention LSTM. The pre-attention Bi-LSTM goes through T x time steps; the post-attention LSTM goes through T y time steps.

    翻译:模型中有两个LSTM层,主要区别在于使用注意力机制的前后以及相连的时间片不同。下面的叫 pre-attention Bi-LSTM,上面的叫 post-attention LSTM,其实这两个分别属于Seq2Seq的编码(Encoder)和解码(Decoder)部分。Bi-LSTM是指双向 LSTM。pre-attention Bi-LSTM 和 Tx 个输入时间片,post-attention LSTM 和 Ty 个输出时间片相连。

  • The post-attention LSTM passes s t , c t from one time step to the next. In the lecture videos, we were using only a basic RNN for the post-activation sequence model, so the state captured by the RNN output activations s t . But since we are using an LSTM here, the LSTM has both the output activation s t and the hidden cell state c t . However, unlike previous text generation examples (such as Dinosaurus in week 1), in this model the post-activation LSTM at time t does will not take the specific generated y t 1 as input; it only takes s t and c t as input. We have designed the model this way, because (unlike language generation where adjacent characters are highly correlated) there isn’t as strong a dependency between the previous character and the next character in a YYYY-MM-DD date.

    翻译: 指向 LSTM 的 s post-attention LSTM 的隐藏层的激活值。post-attention LSTM 没有使用双向的原因是这个例子中的机器时间前后相关不大。这一层的 激活值 s 和 记忆细胞的 c 的 初始输入和普通的LSTM一样(一般取0) ,而输入来自 Attention层 的计算。

  • We use a t = [ a t ; a t ] to represent the concatenation of the activations of both the forward-direction and backward-directions of the pre-attention Bi-LSTM.

    翻译:双向 RNN 的说明

  • The diagram on the right uses a RepeatVector node to copy s t 1 ’s value T x times, and then Concatenation to concatenate s t 1 and a t to compute e t , t , which is then passed through a softmax to compute α t , t . We’ll explain how to use RepeatVector and Concatenation in Keras below.

    翻译: Attention层 的 s < t > 是指 Encoder层一个时间片上的输出(激活值),将一个时间片上的激活值复制 Tx 次以便和 Encoder 层 Tx 个时间片的输出 a 进行串联(类似增广矩阵),然后一起作为 Attention层 的输入。

模型实现: implementing two functions: one_step_attention() and model() .

  • one_step_attention(): At step t , given all the hidden states of the Bi-LSTM ( [ a < 1 > , a < 2 > , . . . , a < T x > ] ) and the previous hidden state of the second LSTM ( s < t 1 > ), one_step_attention() will compute the attention weights ( [ α < t , 1 > , α < t , 2 > , . . . , α < t , T x > ] ) and output the context vector (see Figure 1 (right) for details):

    (1) c o n t e x t < t > = t = 0 T x α < t , t > a < t > ;

    Note that we are denoting the attention in this notebook c o n t e x t t . In the lecture videos, the context was denoted c t , but here we are calling it c o n t e x t t to avoid confusion with the (post-attention) LSTM’s internal memory cell variable, which is sometimes also denoted c t .

    翻译: 在每个时间片 t,将Encoder层在 t 时间片上的输出 s < t 1 > (激活值, t-1而不是 t 是因为代码从0开始)复制 Tx 次以便和Encoder层所以输出串联作为Attention层的输入,经过计算得到 Tx 个输出 阿尔法 α, 将每个 α 和每个时间片的输出(作为注意力权重)相乘,然后求和作为最终的输入 context,也就是 Decoder层的输入。

    这里写图片描述

  • model(): Implements the entire model. It first runs the input through a Bi-LSTM to get back [ a < 1 > , a < 2 > , . . . , a < T x > ] . Then, it calls one_step_attention() T y times (for loop). At each iteration of this loop, it gives the computed context vector c < t > to the second LSTM, and runs the output of the LSTM through a dense layer with softmax activation to generate a prediction y ^ < t > .

    翻译: 在model函数中先计算Encoder层,然后利用缓存进行 decoder 层的计算,经过 Ty 个时间片运算得到最终的输出,在每个时间片都会调用一次 Attention层(每次都涉及 Tx 个时间片的 Encoder层 缓存输出)

猜你喜欢

转载自blog.csdn.net/dod_jdi/article/details/79810703