学习笔记（二）__Self-Attention及Transformer

使用RNN作为encoder/decoder
存在问题：
顺序依赖，无法并行
梯度消失，梯度爆炸（由于使用递归的方式）——LSTM和GRU只能缓解

transformer(Attention is All You Need)

利用Attention而不是RNN

单个Encoder-Decoder结构

在这里插入图片描述 Note：Decoder部分中间的多头注意力层，Encoder的输出与Wk，Wv相乘得到k，v；Masked Multi-Head Attention的输出与Wq相乘得到q

Transformer结构

在这里插入图片描述
第一个encode的输入是原始输入的句子，其他encoder的输入是上一级encoder的输出，最后一个encoder的输出会被送给所有的decoder，除此之外，第一个decoder还有output部分作为输入，其他decoder的输入还包括上一级decoder的输出。

单个Encoder结构

在这里插入图片描述

thinking和machines通过embedding得到x1,x2
self-attention:向量共同参与self-attention
feed forward neural network:向量分别参与全连接神经网络,没有信息交换
r1和r2相比x1和x2，r1和r2已经掌握了一定的对方信息，而x1和x2是不知道对方信息的。
在这里插入图片描述在编码器#5中对单词it进行编码时，注意力集中在The Animal上，并将其表示到的一部分融合到it的编码中。

Self-Attention

在这里插入图片描述 x1乘以权重矩阵WQ产生q1。
我们为输入句子的每个单词创建一个query、key、value投影。
可以理解为key与value是一一对应的，query查询key得到的相关程度决定了value的权重。
Note:x1和x2共享权重矩阵WQ、WK、WV。
score是q与k的点乘
softmax: e^14/ e^14+ e^12
z1=v1+v2

Self-Attention的矩阵计算

在这里插入图片描述
X的第一行和第二行分别就是x1和x2

Multi-Headed Attention

Transformer使用8个注意力头，不同的W会产生不同的Q、K、V，最终得到8个不同的Z
在这里插入图片描述
组合8个Attention Heads

总览Multi-Headed Attention

在这里插入图片描述对于Encoder#1得到Z后通过前馈神经网络可以得到R；R作为X进入Encoder#2
对于橙色和绿色两种注意力，一个集中在the animal ，另一个集中在tired，多头注意力用更丰富的层次描述it

Positional Encoding

在这里插入图片描述 Positional Encoding与Embedding有相同的维度，二者可以相加。
在论文Attention Is All You Need中说明了Positional Encoding的计算方法，如下：

其中pos为单词在句子中的位置，2i 和2i+1表示的是在Positional Encoding中的位置。

举个栗子：

句子长度为length，则pos取值为[0，length-1]
Positional Encoding的维度为4即d model
根据公式得到
PE（pos，0）=sin（pos/10000^(0/4)）=sin（pos）
PE（pos，1）=cos（pos/10000^(0/4)）=cos（pos）
PE（pos，2）=sin（pos/10000^(2/4)）=sin（pos/100）
PE（pos，3）=cos（pos/10000^(2/4)）=cos（pos/100）
Positional Encoding为
[sin（pos），cos（pos），sin（pos/100），cos（pos/100）]