transformer (depth hands-on science learning)

Transformer is made of the Google team in 17 years, in the current applications of NLP is very broad, by Ashish Vaswani et al in the paper "Attention is all your need" in. Recently read Professor CHANG video to explain, also have little understanding of the Transformer. Transformer performance on the task of machine translation over the RNN, CNN, only encoder-decoder and attention mechanisms can achieve good results, the biggest advantage is efficiently parallelized. So why Transformer can be efficiently parallelized it? This is from Transformer architecture to begin with.
In fact, Transformer is a model with seq2seq self-attention mechanism, seq2seq model you may be familiar with, is a sequence that is input, the output is a sequence of models. Figure is a Transformer architecture:
transformer
Let's Imagine if we replace self-attention to RNN, then when the output b2, then certainly before reading a sequence of output b1, b3, when the output must be read first sequence before taking the output b1 and b2, the output current in the output sequence Similarly, should the previous sequence as input, so the use of RNN in seq2seq model, then the output sequence can not be parallelized efficiently. Some friends would say that CNN can use the parallel of the output sequence, using CNN is indeed possible parallel sequences, but limited CNN's receptive field, can not properly take into account the context before and after the sequence, or to many layers to see the relationship before and after the sequence, so CNN also has some defects. Well, so much of the groundwork done, we can begin to detail the architecture of the self-attention.
First, suppose we have a sequence x1, x2, x3 and x4 of the four series, at first we were a heavy weight multiplication a i = W x i a i = W x i a i = W x i V = = Wxiai Wxi {a} ^ i = {W} i x ' α high weights weight, whereas the opposite. This is done by q1 attention whole output can be determined b1, we can use the same token q2, q3 and q4 are determined to make Attention b2, b3 and b4. Further b1, b2, b3 and b4 are to be computed in parallel, each of the sequence is not affected. Therefore, the entire intermediate computation process can be viewed as a self-attention layer, the input x1, x2, x3 and x4, the output is b1, b2, b3 and b4. As shown in schematic in FIG:
transformer
flowcharts by the above we can clearly know how to self-attention throughout the process is carried out. Below this picture can tell more clearly how we operate the entire process is carried out:
transformer
reading the entire process if we look at this simplified process of self-attention becomes clear. After we clear the operation of self-attention, we can use this mechanism to seq2seq go inside, we all know seq2seq model has encoder and decoder, then we will replace the original intermediate layer into a self-attention layer on it. Next we look at the entire transformer architecture is Sha Yangzi's:

transformer
如图所示,整个的 transformer 架构可以分成左右两个部分,左边的部分是 encoder,右边的部分是 decoder。首先我们一起来看 encoder 部分,首先 input 从下方进来了之后,进行 embedding 操作,然后 embedding 之后我们再将嵌入之后的 positional encoding,那么什么是 positional encoding 呢?实际上就是给 ai 加上一个相同维度的向量 ei,示意图如下所示:
transformer
如图所示,在生成 q、k、v 的时候,我我们需要给 ai 加上一个相同维度的向量 ei,这个 ei 是我们手动进行设置的,给每一个不同的 ai 加上不同的 ei 值,这个 ei 表示了位置的信息,每一个位置都有着不同的 ei。进行了 positional encoding 之后,再进行一个 muti-head attention 和 add & norm 操作,所谓的 muti-head attention 就是可以同时生成多个 q、k、v 分别进行 attention,这样做的好处是可以获取更多的上下文信息,每个 head 可以关注各自所注意的重点所在。
然后 add & norm 操作就是将 muti-head attention 的 input 和 output 进行相加,然后进行 layer normalization 操作,这里可能问题来了,啥子是 layer normalization?其实 layer normalization 和 batch normalization 刚好相反,batch normalization 表示在一个 batch 里面的数据相同的维度进行 normalization,而 layer normalization 表示在对每一个数据所有的维度进行 normalization 操作。
然后进行一个 feed forward 和 add & norm 操作,feed forward 会对每一个输入的序列进行操作,然后同样会进行一个 add & norm 操作,到这里 encoder 的操作就完成了。
接着我们来看 decoder 的操作,我们将上一个序列的输出作为输入给到 decoder 中去,首先还是进行一个 positional encoding,然后进行的操作是 masked muti-head attention 操作,所谓的 masked muti-head attention 就是对之前产生的序列进行 attention,之后再进行 add & norm 操作,之后再将 encoder 的输出和上一轮 add & norm 的操作进行 muti-head attention 和 add & norm操作,最后再进行一个 feed forward 和 add & norm 操作,这整个的过程可以重复 N 次。
输出的时候经过一个线性层和一个 softmax 层,就输出最终的结果了。这就是完整的 Transformer 架构,整体看起来这个架构非常的复杂,但是我们通过对每一个部分的理解,从局部到整体,就可以很清晰地理解 Transformer 的架构.

发布了135 篇原创文章 · 获赞 12 · 访问量 1万+

Transformer是Google团队在17年提出的,在目前NLP的领域应用非常的广泛,是由Ashish Vaswani等人在论文《Attention is all your need》中提出的。最近看了李宏毅教授的视频讲解,对Transformer也有了小小的理解。Transformer 在机器翻译任务上的表现超过了 RNN,CNN,只用 encoder-decoder 和 attention 机制就能达到很好的效果,最大的优点是可以高效地并行化。那么为什么Transformer可以高效地并行化呢?这就要从Transformer的架构开始说起。
其实,Transformer就是一个带有 self-attention 机制的 seq2seq 模型,seq2seq 模型大家可能比较熟悉,就是输入是一个 sequence,输出也是一个 sequence 的模型。如图就是一个Transformer的架构:
transformer
我们来试想一下,如果我们把 self-attention 替换成 RNN,那么在输出b2的时候,那么一定会读取前一个序列的输出 b1,输出b3 的时候一定会先读取之前序列的输出 b1 和 b2,同理在输出当前序列的时候,都要将之前序列的输出作为输入,所以在 seq2seq 模型中使用RNN,那么输出序列就不能高效的平行化。有的朋友会说使用 CNN 就可以平行化输出序列了,使用 CNN 确实是可以将序列平行化,但是 CNN 的感受野有限,不能很好地兼顾序列前后的上下文关系,或者说是要很多层才能看到前后序列的关系,所以 CNN 也是有一定缺陷的。好了,做了这么多的铺垫,我们可以开始详细介绍 self-attention 的架构了。
首先假设我们有序列 x1、x2、x3 和 x4 这四个序列,首先我们进行一次权重的乘法 a i = W x i a i = W x i a i = W x i ai=Wxiai=Wxi {a^i} = W{x^i} α 值的高权重,反之则相反。这是用 q1 做attention 可以求得 整个的输出 b1,同理我们可以用 q2、q3 和 q4 分别做 attention 求得 b2、b3 和 b4。而且b1、b2、b3 和 b4 是平行被计算出来的,互相是没有先后顺序的影响的。所以整个中间的计算过程可以看做是一个 self-attention layer,输入 x1、x2、x3 和 x4,输出是 b1、b2、b3 和 b4。示意图如图所示:
transformer
由上面的这些流程图我们可以清楚的知道整个 self-attention 的流程是如何进行的。下面这张图可以更加清晰地告诉我们整个流程的操作是如何进行的:
transformer
看完了整个的流程我们再来看这个简化的 self-attention 过程是否就很清楚了。那么我们清楚了 self-attention 的操作之后,我们可以将这个机制运用到 seq2seq 里面去,我们都知道 seq2seq 模型都有 encoder 和 decoder,那么我们将原来的中间层替换成 self-attention layer 就可以了。接下来我们来看看整个 transformer 的架构是啥样子的:

transformer
如图所示,整个的 transformer 架构可以分成左右两个部分,左边的部分是 encoder,右边的部分是 decoder。首先我们一起来看 encoder 部分,首先 input 从下方进来了之后,进行 embedding 操作,然后 embedding 之后我们再将嵌入之后的 positional encoding,那么什么是 positional encoding 呢?实际上就是给 ai 加上一个相同维度的向量 ei,示意图如下所示:
transformer
如图所示,在生成 q、k、v 的时候,我我们需要给 ai 加上一个相同维度的向量 ei,这个 ei 是我们手动进行设置的,给每一个不同的 ai 加上不同的 ei 值,这个 ei 表示了位置的信息,每一个位置都有着不同的 ei。进行了 positional encoding 之后,再进行一个 muti-head attention 和 add & norm 操作,所谓的 muti-head attention 就是可以同时生成多个 q、k、v 分别进行 attention,这样做的好处是可以获取更多的上下文信息,每个 head 可以关注各自所注意的重点所在。
然后 add & norm 操作就是将 muti-head attention 的 input 和 output 进行相加,然后进行 layer normalization 操作,这里可能问题来了,啥子是 layer normalization?其实 layer normalization 和 batch normalization 刚好相反,batch normalization 表示在一个 batch 里面的数据相同的维度进行 normalization,而 layer normalization 表示在对每一个数据所有的维度进行 normalization 操作。
Then a Feed forward operation and add & norm, Feed forward sequence will operate each input, then it will also be an add & norm operation, the encoder operation is completed here.
Then we look at the output of the decoder operation, we will be on a sequence as input to the decoder to go, or for a first positional encoding, the operation is then masked muti-head attention operation, the so-called masked muti-head attention is sequences generated by prior Attention, then add & norm after the operation, then after the encoder output and an add & norm of operation muti-head attention and add & norm operation, and finally a feed forward and then add & norm operation, this entire process can be repeated N times.
Through a linear output when a softmax layer and layer, the final result is output. This is the complete Transformer architecture, the overall look of this architecture is very complex, but we understand through every part from the local to the whole, we can clearly understand Transformer architecture.

Guess you like

Origin blog.csdn.net/weixin_43264420/article/details/104374257