Transformer model study notes

Foreword

Google Research bacteria, saying: Before transformer model, we do such things as machine translation (paper original words: We do transcription model (transduction model)) is a recurrent neural network (RNN) or convolutional neural network (CNN) as the basic unit and build a model of encoder and decoder are included. Although the results are good, but there is clearly a lot of room for improvement. Now take that much money, go to work and not 996, as the whole point of something new? so with the transformer model.

review

Review of the overall process, why would think this thing is going to create a transformer.
Do Machine Translation?
-> then let's take a model encoder-decoder having a structure in which seq2seq is the most common encoder-decoder model -> model in the small the unit with RNN or basic structure of the CNN
training complete discovery of a long sentence memory effect is not ideal, the model can not remember the information before the occurrence of the gradient disappears??
-> LSTM using the variant structure RNN
translation effect is not very good? different input word does not reflect on the influence and importance of the latter (this is called distraction model)?
-> added attention mechanism (attention), so that each word are concerned about the information that should be most concerned about, that is, from " distraction model "into a" model with attention "(analogy: when you listen to the teacher in class, with a purpose of desertion, not the unbridled or desertion from the beginning to seriously hear the end)
than seq2seq any improvement?
-> original seq2seq vector model, information from the decoder portion of the entire encoder with the last part of a time slice generated, obviously finite length vector -> cause the recording information after addition is also limited attention. , Encoder portion of each word will have additional information, decoder part of the model will be added this additional information.

Now that effect can be, but attention is still unable to parallel computing mechanism, so the train speed is still very slow, do ye drop (ignore the attention and the relationship between the target sentence text between the input sentence and text)?
-> the Transformer model !! Transformer self-attention model in the parallel computing mechanism, and instead seq2seq structure.

Reference material

Original paper: the Attention IS All you need
interpretation English -> https://jalammar.github.io/illustrated-transformer/
interpretation in Chinese -> https://blog.csdn.net/yujianmin1990/article/details/85221271

Reading

1. High-level look at the, transformer roughly what it is like

. Will look like the entire model comprising encoder portion and a decoder portion, still encoder input, decoder output.
Here Insert Picture Description
Slightly Videos carefully, a long way
Here Insert Picture Description
to note that:
. 6 encoder structure identical to each other, but do not share parameter
6 decoder structure the same as each other, but it is not shared parameters.

Read, and then look at each encoder encoders in the concrete structure a long way ↓
Here Insert Picture Description
encoder in the first layer structure is called self-attention. After all the vectors into the encoder, will enter self-attention layer self-attention layer helps encoder encoder encoding (process) of a particular word, the same information while focusing on other words in the input sentence.
then output flows of a layer of self-attention forward network layer (Feed forward Neural network), each input location corresponding to the former are independent non-interfering to the network.

decoder and encoder look alike, is structured as follows
Here Insert Picture Description
OK, here, most rough structure of the entire Transformer finished.
The next section into the details of each structure

2. Detailed look, specific configuration, and are input look like

  • 首先开始讲编码器部分(encoder)
    在NLP里,拿到单词,比较常见的情况下,我们都会把它转成向量.
    我们假设每个单词已经转换成1*4的一个向量(或者叫Tensor), 比如下图 (当然实际场景里尺寸没有那么小,通常尺寸是1*512或者1*1024)
    Here Insert Picture Description
    由这些向量组成一个list. [x1,x2,x3…], 至于一个list的尺寸是多少, list的尺寸是可以设置的超参,通常是训练集的最长句子的长度. (长度不足的在后面补[PAD])

Here Insert Picture Description
这里能看到Transformer的一个关键特性,每个位置的词仅仅流过它自己的编码器路径。在self-attention层中,这些路径两两之间是相互依赖的。重点之一来了!!! 前向网络层没有这些依赖性,但这些路径在流经前向网络时可以并行执行. (记忆前文所说的,为什么transformer能够实现并行计算)

Here Insert Picture Description

3.self-attention层

self-attention比普通attention有什么改进. 举个别人的例子, 这句话

The animal didn’t cross the street because it was too tired

普通attention无法获取it和animal之间的关联. 因为decoder部分取用encoder部分的信息. 但encoder部分内部之间没有作关联处理.
而self-attention, encoder可以取用来源于自己的句子信息, 即可以取到it和animal之前的信息(我取用了我自己的信息)

Self-attention即K=V=Q,例如输入一个句子,那么里面的每个词都要和该句子中的所有词进行attention计算。目的是学习句子内部的词依赖关系,捕获句子的内部结构。

这里就再强调一下为什么需要self-attention: 实现了seq2seq不能做到的并行计算,并且从结构上代替了seq2seq.

self-attention具体如何计算===
(这里别人讲的比较清楚,直接复制了,只做了小修改和备注)
我们先看下如何计算self-attention的向量,再看下如何以矩阵方式计算。

简单向量计算:
第一步,根据编码器的输入向量,生成三个向量,比如,对每个词向量,生成query-vec(查询向量), key-vec(键向量), value-vec(值向量),生成方法为分别乘以三个矩阵,这些矩阵在训练过程中需要学习。【注意:不是每个词向量独享3个matrix,而是所有输入共享3个转换矩阵;权重矩阵是基于输入位置的转换矩阵;有个可以尝试的点,如果每个词独享一个转换矩阵,会不会效果更厉害呢?】
注意到这些新向量的维度比输入词向量的维度要小(512–>64),并不是必须要小的,是为了让多头attention的计算更稳定。

看图, queries就是我们的查询向量, keys是键向量, values是值向量, q1,k1,v1这三个向量一开始其实是相等的
Here Insert Picture Description
第二步,计算attention就是计算一个分值。对“Thinking Matchines”这句话,对“Thinking”计算attention 分值。我们需要计算每个词与“Thinking”的评估分,这个分决定着编码“Thinking”时(某个固定位置时),每个输入词需要集中多少关注度。
这个分,通过“Thing”对应query-vector与所有词的key-vec依次做点积得到。所以当我们处理位置#1时,第一个分值是q1和k1的点积,第二个分值是q1和k2的点积。
Here Insert Picture Description
第三步和第四步,除以8(√64=8, 64来源于原论文里键向量key-vec的维数)
这样做目的, 按照论文作者说梯度会更稳定. 然后加上softmax操作,归一化.
Here Insert Picture Description
softmax分值决定着在这个位置,每个词的表达程度(关注度)。很明显,这个位置的词应该有最高的归一化分数,但大部分时候总是有助于关注该词的相关的词。
第五步,将softmax分值与value-vec按位相乘。保留关注词的value值,削弱非相关词的value值。
第六步,将所有加权向量加和,产生该位置的self-attention的输出结果。
Here Insert Picture Description即z1=v1*0.88+v2*0.12

实际操作中的运算:
实际操作中,就变成了矩阵运算.
首先, 计算为我们的查询矩阵,键矩阵和值矩阵(query/key/value matrix), 将所有输入词向量合并成输入矩阵X, 并且将其分别乘以权重矩阵Wq,Wk,Wv
Here Insert Picture Description
然后, 将步骤2~6合并成一个计算self-attention层输出的公式
Here Insert Picture Description
这里 Q(查询矩阵)*KT(键矩阵的转置)计算结果为一个2*2矩阵, 2*2 乘以v(2*3形状的值矩阵)= z(2*3形状的输出), 在和z相乘的过程中,已经自动完成了上一小节所说的加权向量加和, z的第一行就是上一小节里的z1, 第二行就是上一小节里的z2.

4.多头机制 Multi-head

多头机制有什么好处?

  1. 提高了模型关注不同位置的能力.
    一般情况下,一个注意力机制下,模型可能只能关注到一两个单词, 多头注意力机制可以使模型有机会关注到其他更多的单词.

比如说"我在外滩肯德基吃饭".
提问:“你在哪里吃饭?” “你在吃什么?” 显然这两个问题需要关注到句子里不同的部分.

2.提供了多种"子空间表达方式". 如下图,我们会设定8个查询矩阵,键矩阵和值矩阵(query/key/value matrix)(共计24个=8*3) 每个矩阵都会单独随机初始化. 经过训练之后,输入向量可以被映射到不同的子表达空间中.
Here Insert Picture Description
区别: BERT里是怎么做的? 初始化同一个Q,K,V, 但是让他们经过(乘以)8个不同的linear层. 这8个linear层参数是随机初始化的. 这么做也可以达到同样目的.
Here Insert Picture Description
计算self-attention的方法和上一节提到的一样,只是做了8次, 得到8个z, 即z0,z1,z2,…z7
Here Insert Picture Description
因为前向网络并不能接收八个矩阵,而是希望输入是一个矩阵,所以就把八个矩阵合并成一个矩阵(concat操作)
1) 八个矩阵连在一起
2) 然后和一个权重矩阵w0相乘. 这个w0也是跟随者着模型一起训练的.
3) 按照下图尺寸, 2*24 矩阵 乘以 24*4矩阵=Z矩阵 (2*4矩阵), 这个Z矩阵会被送入下一步的"前向网络层"
Here Insert Picture Description
下图是完整流程
Here Insert Picture Description
注意, 其实八个QKV有可能训练到最后长一样.

5.输入语句中词的顺序(Positional Encoding)

由于本模型没有使用循环或者卷积神经网络,我们需要额外的信息来表达词的顺序.
transformer模型对每个输入词嵌入(embedding,就是上文中绿色格子的小x矩阵)额外加入了一个向量,这些向量遵循模型学习的指定模式. 这样可以帮助词定位自己的位置,或者在句子中不同词之间的距离. 直觉上看,在词嵌入中加入这些向量,可以让他们之后在映射到Q/K/V三个矩阵以及做attention乘法的过程中,提供有意义的词嵌入间的距离.
说人话就是, 在词嵌入送入encoder之前,再加上一个位置编码(positional encoding),使得词嵌入具有一定的位置信息.
Here Insert Picture Description
这个位置编码大概长这样↓*(图里数字存在疑问, 按照公式算出来并不对,也可能我算错了)*
Here Insert Picture Description
那么这个东西怎么产生?
在论文里, 用的是这样的公式:
P E ( p O s , 2 i ) = s i n ( p O s 1000 0 2 i / d m o d e l ) PE_{(pos,2i)} = sin(\frac{pos}{10000^{2i/d_{model}}} )
P E ( p o s , 2 i + 1 ) = c o s ( p o s 1000 0 2 i / d m o d e l ) PE_{(pos,2_{i+1})} = cos(\frac{pos}{10000^{2i/d_{model}}} )

公式里, pos就是词在输入句子(或者说输入的sequence)里的位置, i是词向量位置,即 2i 表示偶数位置,2i+1 表示奇数位置. dmodel的大小就是单词embedding的大小(在上图中,就是4)
这样PE是一个二维矩阵, 列数=词向量(embedding)大小, 行数=输入句子中词的个数(或者说输入sequence的长度). 举例,PE(0,3)就是句子里第一个词的第4个词向量.
这样, 上述公式表示在每个词语的词向量的偶数位置添加sin变量,奇数位置添加cos变量. 注意i是从0开始的.
解说参考: https://blog.csdn.net/Flying_sfeng/article/details/100996524
代码实现

class PositionalEncoding(nn.Module):
    "Implement the PE function."


    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)


        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0.0, d_model, 2) *
                             -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)


    def forward(self, x):
        x = x + Variable(self.pe[:, :x.size(1)],
                         requires_grad=False)
        return self.dropout

6.剩余东西

再具体一点,其实每个self-attention层和前向层后,都还要跟一个normalization.
Here Insert Picture Description
再细一点看下Add&Normalize这层,长这样,将X和Z相加后,做一个layer normalization:
Here Insert Picture Description
解码器中也是一样:
Here Insert Picture Description
编码器(Encoder)部分到此为止================================

7.解码器(Decoder)

See a GIF:
Here Insert Picture Description
After coding, a decoding process; each step decoded output as an output element of a sequence of animation has quite clear.
Here Insert Picture Description
Self Attention layer in the encoder are slightly different in the decoder, the decoder , self-attention layer is only allowed to be earlier than the current output position of interest prior softmax, by blocking future position (setting them -inf, i.e. mask) is achieved.

8. Finally, the output layer

Will go through a linear layer (linear) and softmax layer is a linear layer FC (fully connected layer) after decoder out which word with softmax highest probability to see which is output.

Published an original article · won praise 0 · Views 38

Guess you like

Origin blog.csdn.net/valleria/article/details/103966893