AutoCV extra: Transformer

Transformer

Precautions

1. Update on May 16, 2023

Update Transformer actual combat, handwritten reproduction, that is, the content of Section 3

foreword

In the course after AutoCV, you need to learn BEVFormer. First, do a simple understanding of Transformer.

The content about Self-attention and Transformer comes from the video explanation of Mr. Li Hongyi

Video link: [Machine Learning 2021] Self-attention mechanism (Self-attention) [Machine Learning 2021] Transformer

insert image description here

1. Self-attention

Before talking about Transformer, we need to talk about Self-attention.

以下内容均来自于李宏毅老师的 Self-attention 的视频讲解

视频链接:【机器学习2021】自注意力机制(self-attention)

1.1 前置知识

self-attention 解决的是输入的数目和序列长度不一致的问题,那在实际中有哪些应用会将一个向量集合 (Vector Set) 作为输入呢?
insert image description here

图1-1 复杂输入情形

图1-1 中对比了输入是简单 vector 和 复杂 vector set 两种情形。常见输入都是简单的 vector 应用有比如让你对 2023 年上海房价进行预测,比如给你一张图片需要你判断是那个动物等等。那也有输入是一个 vector set 形式,比如 chat Bot 聊天机器人,用户每次都会提供不同的问题,比如机器翻译,比如语言情感分析等等。

在文字处理方面,其输入常常是一句话或者一段话,比如当用户向 chatGPT 提问时,由于每次的问题都不一样,导致每次的输入都不同。self-attention 就是被用来处理这种问题的。

在正式介绍 self-attention 之前,我们需要了解到为了方便计算机的处理和计算,同时让计算机捕捉和理解词汇之间的语义关系,我们通常需要将一个词汇用一个向量表示。那么怎么把一个词汇描述为一个向量呢?

常见的有 one-hot encodingWord Embedding 方法,如 图1-2 所示

insert image description here

图1-2 词汇描述为向量方法

ont-hot encoding

  • 最先想到的可能就是 ont-hot 编码,考虑所有的词汇,比如常见的 3000 个单词,那么就编码为一个 3000 维的向量,每个词汇都在对应的位置设置为 1,其它设置为 0

  • 但是这种方法存在一个严重的问题:它假设所有词汇之间都是没有关系的,所以无法表达词汇之间的语义相似性,即从 One-hot Encoding 中你看不出来说 cat 和 dog 都是动物所有比较相像,cat 和 apple 一个动物一个植物所以比较不相像,这个向量里面是不存在任何语义信息,而且每个词汇向量的维度非常高,将会导致计算和存储的复杂性

Word Embedding

  • 词嵌入方法,它会分配给每一个词汇一个向量,那么一个句子就是一个长度不一的向量,具体的实现不是这里讲解的重点,可自行学习

  • 更多细节:https://www.youtube.com/watch?v=X7PH3NuYW0Q

知道了输入是一个 vector set,那么对应的输出也有以下几种可能,见 图1-3

  • Each vector has a label
    • 每个向量都有一个对应的标签,比如输入 4 个向量,输出 4 个标签
    • For example, in the Pos tagging part-of-speech tagging problem, the corresponding part of speech is given for each vocabulary, for example, I saw a sawthe corresponding output isN V DET N
  • The whole sequence has a label
    • There is only one label for the entire sequence
    • For example, on the issue of Sentiment analysis , the machine needs to indicate whether it is positive or negative for a paragraph.
  • Model decides the number of labels itself
    • The model decides the number of output labels by itself
    • sequence to sequence tasks, such as chatGPT

insert image description here

Figure 1-3 Possible output

We only consider the first case now, that is, each vector has a label. This task is also called Sequence Labeling , so how to solve this problem?

insert image description here

Figure 1-4 sequence labeling solution

It is easier to think that we input each vector into the fully connected network and break them one by one, as shown in Figure 1-4 , but there will be a big problem in this way. If it is a part-of-speech tagging problem, I saw a sawmark it. For the fully connected network, the first saw and the second saw are exactly the same. They are the same vocabulary, so there is no reason for the network to have different outputs. In fact, you You want your first saw to output verbs and the second saw to output nouns, which is impossible for FC.

then what should we do? Is it possible to make FC consider contex context information?

It is possible, we can string together the front and rear vectors and throw them into the Fully-Connected Network , as shown in Figure 1-5

insert image description here

Figure 1-5 sequence labeling solution optimization

So we can give the Fully-Connected Network the information of the entire window, so that it can consider some context and consider the information of other vectors adjacent to the current vector.

But if we have a task that can be solved not by considering the context of a window, but by considering the context of an entire sequence , what should we do?

Then some people may think that this is not easy, so I will open the window a little bigger, big enough to cover the entire sequence, but the sequence we input to the model may be long or short, and may be different every time, so it may be necessary Count your training set, see how long the longest sequence is, and then open a window larger than the longest sequence, but if you open such a large window, it means that your Fully-connected Network needs a lot The parameters, not only the amount of calculation, may also be easy to Overfitting

So is there a better way to consider the information of the entire input sequence? This is the self-attention technology that will be introduced next.

1.2 Self-attention mechanism

The way self-attention works is to eat a whole sequence of information, and then how many vectors you input, it will output as many vectors

insert image description here

图1-6 self-attention运作方式

经过 self-attention 后的 4 个 vector 都是考虑了一整个 sequence 以后才得到的,再把这些考虑了整个句子的 vector 丢到 Fully-Connected Network 中得到最终的输出,如此一来,你的 Fully-Connected Network 它就不是只考虑一个非常小的范围或一个小的 window,而是考虑整个 sequence 的信息,再来决定现在应该要输出什么样的结果,这个就是 self-attention

当然你可以使用多次 self-attention,将 self-attention 和 Fully-Connected Network 交替使用,self-attention 来处理整个 sequence 的信息,而 Fully-Connected Network 专注于处理某一个位置的信息

insert image description here

图1-7 self-attention和Fully-Connected Network交替使用

有关于 self-attention 最知名的相关文章就是 Attention is all you need,在这篇文章中,Google 提出了我们熟知的 Transformer 这样的 Network 架构,Transformer 里面最重要的一个 Module 就是 self-attention。

After talking for so long, how does self-attention work?

The input of self-attention is a string of vectors, and this vector may be the input of your entire Network, or it may be the output of a hidden layer, use a \boldsymbol aa to represent instead ofx \boldsymbol xX to indicate that it may have done some processing before, input a rowa \boldsymbol aAfter the a vector, self-attention needs to output another row ofb \boldsymbol bb vectors, eachb \boldsymbol bb is considered alla \boldsymbol agenerated by a

insert image description here

Figure 1-8 Internal implementation of self-attention

Next is to explain how to generate b 1 \boldsymbol b^{\boldsymbol 1}b1 this vector, know how to generateb 1 \boldsymbol b^{\boldsymbol 1}bAfter 1 vector,b 2 \boldsymbol b^{\boldsymbol 2}b2 b 3 \boldsymbol b^{\boldsymbol 3} b3 b 4 \boldsymbol b^{\boldsymbol 4} b4 的产生你也就知道了,那么怎么产生 b 1 \boldsymbol b^{\boldsymbol 1} b1 呢?

首先根据 a 1 \boldsymbol a^{\boldsymbol 1} a1 找到 sequence 里面跟 a 1 \boldsymbol a^{\boldsymbol 1} a1 相关的其它向量。我们做 self-attention 的目的是为了考虑整个 sequence,但是我们又不希望将所有的信息放在一个 window 里面,所有我们有一个特别的机制,这个机制就是说找出整个很长的 sequence 里面到底哪些部分是重要的,哪些部分跟判断 a 1 \boldsymbol a^{\boldsymbol 1} a1 是哪一个 label 是有关系的。

那么每一个向量跟 a 1 \boldsymbol a^{\boldsymbol 1} a1 的关联程度我们用一个数值 α \alpha α 来表示,接下来的问题就是 self-attention 这个 module 怎么自动决定两个向量之间的关联性呢?比如你给它两个向量 a 1 \boldsymbol a^{\boldsymbol 1} a1 a 4 \boldsymbol a^{\boldsymbol 4} a4,它怎么决定 a 1 \boldsymbol a^{\boldsymbol 1} a1 a 4 \boldsymbol a^{\boldsymbol 4} a4 有多相关,然后给它一个数值 α \alpha α 呢?

insert image description here

图1-9 self-attention相关性机制

那么这边你就需要一个计算 attention 的模块,它的输入是两个向量,直接输出 α \alpha α 数值,你就可以把 α \alpha α 数值当做两个向量的关联的程度,那具体怎么计算这个 α \alpha α 数值呢?比较常见的做法有 dot-productAdditive

dot-product

  • 把输入的两个向量分别乘以两个不同的矩阵 W q W^q Wq andW k ​​W^kWk , left vector multiplied byW q W^qWq this matrix, the right vector is multiplied byW k W^kWk this matrix, then getq \boldsymbol qq followed byk \boldsymbol kk these two vectors
  • Then put q \boldsymbol qq followed byk \boldsymbol kk to dodot productis to multiply themelement-wiseand add them together to getα \alphaa

Additive

  • Also multiply the two input vectors by two different matrices W q W^qWq andW k ​​W^kWk , to getq \boldsymbol qq followed byk \boldsymbol kk these two vectors
  • Then string it up and throw it into tanh tanhIn the t anh function, get α \alphaafter a Transforma

In short, there are many different ways to calculate Attention, you can calculate this α \alphaα 的数值,可以计算这个关联的程度,但是在接下来的讨论里面我们都只用左边 dot-product 这个方法,这是目前最常用的方法,也是 Transformer 里面的方法。

insert image description here

图1-10 Attention的计算方法

接下来怎么把它套用到 self-attention 里面呢?对于 a 1 \boldsymbol a^{\boldsymbol 1} a1 需要分别计算它与 a 2 \boldsymbol a^{\boldsymbol 2} a2 a 3 \boldsymbol a^{\boldsymbol 3} a3 a 4 \boldsymbol a^{\boldsymbol 4} a4 之间的关联性,也就是计算它们之间的 α \alpha α,那怎么做呢?

你把 a 1 \boldsymbol a^{\boldsymbol 1} a1 乘上 W q W^q Wq 得到 q 1 \boldsymbol q^{\boldsymbol 1} q1,这个 q \boldsymbol q q 有个名字叫做 Query,也就是搜寻查询的意思,然后接下来对于 a 2 \boldsymbol a^{\boldsymbol 2} a2 a 3 \boldsymbol a^{\boldsymbol 3} a3 a 4 \boldsymbol a^{\boldsymbol 4} a4 你都要去把它乘上 W k W^k Wk 得到 k \boldsymbol k k 这个 vector,这个 k \boldsymbol k k 有个名字叫做 Key,也就是关键字的意思。那你把这个 Query q 1 \boldsymbol q^{\boldsymbol 1} q1 和这个 Key k 2 \boldsymbol k^{\boldsymbol 2} k2 计算 inner product 就得到 α 1 , 2 \alpha_{1,2} α1,2,代表向量 1 和向量 2 之间的相关性,其中 Query 是由向量 1 提供,Key 是由向量 2 提供, α \alpha α 关联性也有一个名称叫做 attention score

同理可计算出 α 1 , 3 \alpha_{1,3} α1,3 α 1 , 4 \alpha_{1,4} α1,4,如 图1-11 所示

insert image description here

图1-11 attention score计算

其实一般在实际应用中, q 1 \boldsymbol q^{\boldsymbol 1} q1 也会和自己计算关联性,所以你也要把 a 1 \boldsymbol a^{\boldsymbol 1} a1 乘上 W k W^k Wk 得到 k 1 \boldsymbol k^{\boldsymbol 1} k1,然后计算它的关联性 α 1 , 1 \alpha_{1,1} α1,1

insert image description here

图1-12 attention score计算(考虑自关联)

我们算出 a 1 \boldsymbol a^{\boldsymbol 1} a1 和每个向量的关联性之和,接下来会做一个 softmax,输出为一组 α ′ \alpha' α,在这边不一定要用 softmax,你可以尝试用别的东西也没有问题,比如 ReLU,

insert image description here

图1-13 attention score计算经过softmax

得到 α ′ \alpha' α 以后,我们就要根据 α ′ \alpha' α 抽取出 sequence 里面重要的信息,接下来我们要根据关联性,根据这个 attention 的分数来抽取重要的信息,怎么抽取重要的信息呢?

我们会把 a 1 \boldsymbol a^{\boldsymbol 1} a1 a 4 \boldsymbol a^{\boldsymbol 4} a4 这边每一个向量乘上 W v W^v Wv 得到新的向量,分别用 v 1 \boldsymbol v^{\boldsymbol 1} v1 v 2 \boldsymbol v^{\boldsymbol 2} v2 v 3 \boldsymbol v^{\boldsymbol 3} v3 v 4 \boldsymbol v^{\boldsymbol 4} v4 to represent, then putv 1 \boldsymbol v^{\boldsymbol 1}v1 v 2 \boldsymbol v^{\boldsymbol 2} v2 v 3 \boldsymbol v^{\boldsymbol 3} v3 v 4 \boldsymbol v^{\boldsymbol 4} v4 Each vector is multiplied by the attention score, and all are multiplied byα ′ \alpha'a and add it up to getb 1 \boldsymbol b^{\boldsymbol 1}b1

insert image description here

Figure 1-14 Extract information based on attention score

If a certain vector gets a higher score, such as a 1 \boldsymbol a^{\boldsymbol 1}a1 anda 2 \boldsymbol a^{\boldsymbol 2}a2 is highly correlated, thisα 1 , 2 ′ \alpha'_{1,2}a1,2The obtained value is very large, then we get b 1 \boldsymbol b^{\boldsymbol 1} after doing Weighted SumbThe value of 1 may be closer tov 2 \boldsymbol v^{\boldsymbol 2}v2 , whoever has a higher attention score, whosev \boldsymbol vv will be the Dominant result you draw.

OK! So far we have explained how to get b 1 \boldsymbol b^{\boldsymbol 1} from a whole sequenceb1

insert image description here

Figure 1-15 Calculation of b1

What needs to be emphasized here is b 1 \boldsymbol b^{\boldsymbol 1}b1 tob 4 \boldsymbol b^{\boldsymbol 4}b4 does not need to be generated sequentially, you do not need to finishb 1 \boldsymbol b^{\boldsymbol 1}b1 and countb 2 \boldsymbol b^{ \boldsymbol 2}b2 and calculateb 3 \boldsymbol b^{\boldsymbol 3}b3 countb 4 \boldsymbol b^{\boldsymbol 4}b4 b 1 \boldsymbol b^{\boldsymbol 1} b1 tob 4 \boldsymbol b^{\boldsymbol 4}b4 They are calculated all at once.

1.3 Understanding from the perspective of matrix multiplication

From a 1 \boldsymbol a^{\boldsymbol 1}a1 toa 4 \boldsymbol a^{\boldsymbol 4}a4 getb 1 \boldsymbol b^{\boldsymbol 1}b1 tob 4 \boldsymbol b^{\boldsymbol 4}b4 is the operation process of self-attention. Next, we will re-tell how the self-attention we just talked about works from the perspective of matrix multiplication.

We know that every a \boldsymbol aa must generateq \boldsymbol qq k \boldsymbol k k v \boldsymbol vv , what would it look like to express this operation in terms of matrix operations?
qi = W qai \boldsymbol q^{\boldsymbol i} = W^q \boldsymbol a^{\boldsymbol i}qi=Wqai
eacha \boldsymbol aa must be multiplied byW q W^qWq getsq \boldsymbol qq , then we can puta 1 \boldsymbol a^{\boldsymbol 1}a1 toa 4 \boldsymbol a^{\boldsymbol 4}a4 put together as a matrix withIII to represent, while the matrixIII has four columns, and each column isa 1 \boldsymbol a^{\boldsymbol 1}a1 toa 4 \boldsymbol a^{\boldsymbol 4}a4 , NaIII timesW q W^qWq to get another matrix, we useQQQ to represent,QQQ isq 1 \boldsymbol q^{\boldsymbol 1}q1 toq 4 \boldsymbol q^{\boldsymbol 4}q4

insert image description here

Figure 1-16 Matrix angle analysis q generation

So we start from a 1 \boldsymbol a^{\boldsymbol 1}a1 toa 4 \boldsymbol a^{\boldsymbol 4}a4 to getq 1 \boldsymbol q^{\boldsymbol 1}q1 toq 4 \boldsymbol q^{\boldsymbol 4}q4 This matter is to put the matrixIII is our input multiplied by another matrixW q W^qWq , whileW q W^qWq is actually the parameter of Network, putIII timesW q W^qWq to getQQQ , whileQQThe four columns of Q areq 1 \boldsymbol q^{\boldsymbol 1}q1 toq 4 \boldsymbol q^{\boldsymbol 4}q4

Then how to generate k \boldsymbol kk andv \boldsymbol vwhat about v ? Its operation is actually the same asq \boldsymbol qq is exactly the same, asshown in Figure 1-17:

insert image description here

Figure 1-17 matrix angle analysis kv generation

所以每一个 a \boldsymbol a a 怎么得到 q \boldsymbol q q k \boldsymbol k k v \boldsymbol v v 呢?其实就是把输入的这个 vector set 乘上三个不同的矩阵,你就得到了 q \boldsymbol q q k \boldsymbol k k v \boldsymbol v v

那么接下来每一个 q \boldsymbol q q 都会去跟每一个 k \boldsymbol k k 去计算 inner product 得到 attention 的分数,那得到 attention score 这件事情,如果从矩阵操作的角度来看,它在做什么样的事情呢?

比如 α 1 , i = ( k i ) T q 1 \alpha_{1,i} = (\boldsymbol k^{\boldsymbol i})^T \boldsymbol q^{\boldsymbol 1} α1,i=(ki)Tq1,我们可以将 k 1 \boldsymbol k^{\boldsymbol 1} k1 k 4 \boldsymbol k^{\boldsymbol 4} k4 拼接起来看作是一个矩阵的四个 row,再把这个矩阵乘上 q 1 \boldsymbol q^{\boldsymbol 1} q1 就得到另外一个向量,这个向量里面的值就是 attention score, α 1 , 1 \alpha_{1,1} α1,1 α 1 , 4 \alpha_{1,4} α1,4

insert image description here

图1-18 矩阵角度分析attention score(一)

我们不止是 q 1 \boldsymbol q^{\boldsymbol 1} q1 要对 k 1 \boldsymbol k^{\boldsymbol 1} k1 k 4 \boldsymbol k^{\boldsymbol 4} k4 去计算 attention score,还有 q 2 \boldsymbol q^{\boldsymbol 2} q2 q 3 \boldsymbol q^{\boldsymbol 3} q3 q 4 \boldsymbol q^{\boldsymbol 4} q4 也要按照上述流程计算 attention score

insert image description here

Figure 1-19 Matrix angle analysis attention score (2)

So how are these attention scores calculated? You can see it as two matrices KKK andQQThe multiplication of Q , the row of a matrix is​​k 1 \boldsymbol k^{\boldsymbol 1}k1 tok 4 \boldsymbol k^{\boldsymbol 4}k4 , the column of another matrix is​​q 1 \boldsymbol q^{\boldsymbol 1}q1 toq 4 \boldsymbol q^{\boldsymbol 4}q4 , get the matrixAAA , the matrixAAA store isQQQ andKKThe attention score between K

Then we will normalize the attention score next, and do softmax for each column here, so that the value of each column adds up to 1. As mentioned earlier, softmax is not the only option. Of course, you are completely You can choose other operations, such as ReLU, and the results will not be worse

insert image description here

图1-20 矩阵角度分析attention score(三)

接下来,我们需要通过 attention 分数矩阵 A ′ A' A 计算得到 b \boldsymbol b b,那么 b \boldsymbol b b 是怎么被计算出来的呢?你就把 v 1 \boldsymbol v^{\boldsymbol 1} v1 v 4 \boldsymbol v^{\boldsymbol 4} v4 拼起来,当成是 V V V 矩阵的四个 column,然后让 V V V 乘上 A ′ A' A 得到最终的输出矩阵 O O O

insert image description here

图1-21 矩阵角度分析输出b的产生

O O O 矩阵里面的每一个 column 就是 self-attention 的输出,也就是 b 1 \boldsymbol b^{\boldsymbol 1} b1 b 4 \boldsymbol b^{\boldsymbol 4} b4

所以说 self-attention 的整个操作是先产生了 q \boldsymbol q q k \boldsymbol k k v \boldsymbol vv , and then according toq \boldsymbol qq to find the relevant position, and thenv \boldsymbol vv does weighted sum, in fact, this series of operations is just a series of multiplication of matrices

insert image description here

Figure 1-22 The whole process of matrix angle analysis self-attention

Part III is the input of self-attention, which is a set of vectors, that is, vector set. This set of vectors is put together as a matrixIIThe column of I , then this input is multiplied by three matricesW q W^qWq W k W^k Wk W v W^v Wv getQQQ K K K V V V three matrices, nextQQQ timesKTK^TKT gets the matrixAAA,QuarantineAAA 经过一些处理得到 A ′ A' A,我们会把 A ′ A' A 称作 Attention Matrix,最后将 A ′ A' A 乘上 V V V 得到 O O O O O O 就是 self-attention 这个 layer 的输出。所以 self-attention 输入是 I I I 输出是 O O O,self-attention 中唯一需要学习的参数就是 W q W^q Wq W k W^k Wk W v W^v Wv,是通过 training data 找出来的。

1.4 Multi-head Self-attention

self-attention 它还有一个进阶的版本叫做 Multi-head Self-attention,它在现在的使用非常广泛。在机器翻译、在语音辨识任务用比较多的 head 反而可以得到比较好的结果,至于需要多少的 head,这又是一个 hyperparameter 需要自己去调节。

So why do you need more heads? In self-attention we pass q \boldsymbol qq to find the relevantk \boldsymbol kk , but there are many different forms of this matter, and many different definitions. So maybe we can't just have oneq \boldsymbol qq , we should have multipleq \boldsymbol qq , differentq \boldsymbol qq is responsible for different kinds of correlations.

So if you want to do Multi-head Self-attention, you first put a \boldsymbol aa is multiplied by a matrix to getq \boldsymbol qq , then you putq \boldsymbol qq is multiplied by the other two matrices to getqi , 1 \boldsymbol q^{\boldsymbol {i,1}}qi,1 q i , 2 \boldsymbol q^{\boldsymbol {i,2}} qi , 2 , wherei \boldsymbol ii represents the position, 1 and 2 represent the numberq of this position \boldsymbol qq

insert image description here

Figure 1-23 2 heads self-attention (1)

Figure 1-23 represents that we have two heads. We think that there are two different correlations in this problem. We need to generate two different heads to find two different correlations. Since q \boldsymbol qThere are two q , then the correspondingk \boldsymbol kk andv \boldsymbol vv also needs to have two, so do the same for the other position and you'll get twoq \boldsymbol qq twok \boldsymbol kk twov \boldsymbol vv

insert image description here

Figure 1-24 2 heads self-attention (2)

So how to do self-attention? It is still exactly the same as the operation we talked about before, except that the 1 type is done together, and the 2 type is done together, that is to say, q 1 \boldsymbol q^{\boldsymbol 1}q1 When calculating the attention score, there is no need to care aboutk 2 \boldsymbol k^{\boldsymbol 2}k2 了,它就只管 k 1 \boldsymbol k^{\boldsymbol 1} k1 就好

insert image description here

图1-25 2 heads self-attention(三)

所以 q i , 1 \boldsymbol q^{\boldsymbol {i,1}} qi,1 就跟 k i , 1 \boldsymbol k^{\boldsymbol {i,1}} ki,1 算 attention, q i , 1 \boldsymbol q^{\boldsymbol {i,1}} qi,1 就跟 k j , 1 \boldsymbol k^{\boldsymbol {j,1}} kj,1 算 attention,得到 attention score,在计算 weighted sum 的时候也不要管 v 2 \boldsymbol v^{\boldsymbol 2} v2 了,看 v i , 1 \boldsymbol v^{\boldsymbol {i,1}} vi,1 v j , 1 \boldsymbol v^{\boldsymbol {j,1}} vj,1 就好,所以你把 attention 的分数乘以 v i , 1 \boldsymbol v^{\boldsymbol {i,1}} vi,1,把 attention 的分数乘以 v j , 1 \boldsymbol v^{\boldsymbol {j,1}} vj,1,接下来就得到了 b i , 1 \boldsymbol b^{\boldsymbol {i,1}} bi,1图1-25 所示

这边只用了其中一个 head,那也可以用另外一个 head 也做一模一样的事情去计算 b i , 2 \boldsymbol b^{\boldsymbol {i,2}} bi,2图1-26 所示

insert image description here

图1-26 2 heads self-attention(四)

如果你有 8 个 head,有 16 个 head 也是同样的操作,这里是以两个 head 作为例子来演示。

接下来你可能会把 b i , 1 \boldsymbol b^{\boldsymbol {i,1}} bi,1 b i , 2 \boldsymbol b^{\boldsymbol {i,2}} bi,2 拼接起来,然后再通过一个 transform 得到 b i \boldsymbol b^{\boldsymbol i} bi,然后送到下一层去。

insert image description here

图1-27 2 heads self-attention(五)

这个就是 Multi-head attention,self-attention 的一个变形

1.5 Positional Encoding

那讲到目前为止,你会发现说 self-attention 的这个 layer 少了一个也许很重要的信息,这个信息就是位置的信息,对于一个 self-attention 而言,每一个 input 它是出现在 sequence 的最前面还是最后面,它是完全没有这个信息的,对不对。

对于 self-attention 而言,1 和 4 的距离并没有非常远,2 和 3 的距离也没有说非常近,所有的位置之间的距离都是一样的,没有任何一个位置距离比较远,也没有任何位置距离比较近。

但是这样子的设计可能存在一些问题,因为有时候位置的信息也许很重要,举例来说,我们在做这个 POS tagging 词性标记的时候,也许你知道说动词比较不容易出现在句首,所以我们知道说某一个词汇它是放在句首,那它是动词的可能性就比较低,会不会这样子的位置的信息往往也是有用的呢?

到目前为止我们讲到的 self-attention 的操作里面,它根本就没有位置的信息,所以你在做 self-attention 的时候,如果你觉得位置信息是一个重要的事情,那你可以把位置的信息把它塞进去。

怎么把位置的信息塞进去呢?这边就要用到一个叫做 positional encoding 的技术。你需要为每一个位置设定一个 vector 叫做 positional vector,这边用 e i \boldsymbol e^{\boldsymbol i} ei 来表示,上标 i \boldsymbol i i 代表位置,每一个不同的位置就有不同的 vector,比如 e 1 \boldsymbol e^{\boldsymbol 1} e1 是一个 vector, e 2 \boldsymbol e^{\boldsymbol 2} e2 是一个 vector, e 128 \boldsymbol e^{\boldsymbol {128}} e128 也是一个 vector,不同的位置都有一个专属的 e \boldsymbol e e,然后把这个 e \boldsymbol e e 加到 a i \boldsymbol a^{\boldsymbol i} ai 上就结束了。

insert image description here

图1-28 positional encoding

上述操作就等于告诉 self-attention 位置的信息,如果它看到说 a i \boldsymbol a^{\boldsymbol i} ai 有被加上 e i \boldsymbol e^{\boldsymbol i} ei 它就知道说现在出现的位置应该是在 i \boldsymbol i i 这个位置,那这个 e i \boldsymbol e^{\boldsymbol i} ei 是什么样子呢?最早的这个 transformer 也就是 Attention is all you need 那篇 paper 里面,它用的 e i \boldsymbol e^{\boldsymbol i} ei图1-29 所示

insert image description here

图1-29 ei

它的每一个 column 就代表一个 e \boldsymbol e e , the first position ise 1 \boldsymbol e^{\boldsymbol 1}e1 The second position ise 2 \boldsymbol e^{\boldsymbol 2}e2 By analogy, each position has an exclusivee \boldsymbol ee , hopefully by giving each position a differente \boldsymbol ee , when your model is processing this input, it can know what the position information of the current input looks like

Such a positional vector is hand-crafted , that is, it is artificially set. There are many problems in the artificial vector. Suppose I only set this vector to 128, then the length of my sequence is now if What if it is 129? However, there is no such problem in the paper Attention is all you need . Its vector is generated through a certain rule and a very magical sin cos function. In fact, you don't have to generate it like this. This positional encoding is still a problem to be studied. You can create new methods, and even say that it can be learned from data.

1.6 Many application

那这个 self-attention 当然是用得很广,比如 transformer 这个东西,比如在 NLP 领域有一个东西叫做 BERT,BERT 里面也有用到 self-attention,所以 self-attention 在 NLP 上的应用是大家都耳熟能详的

但 self-attention 不只是可以用在 NLP 相关的应用上,它还可以用在很多其它的问题上,比如语音辨识、图像任务,例如 Truncated Self-attentionSelf-Attention GANDEtection Transformer(DETR)

那好,接下来我们来对比下 Self-attention 和 CNN 之间的差异或者关联性

如果我们用 self-attention 来处理一张图片,代表说假设你要处理一个 pixel,那它产生 query,其它 pixel 产生 key,你在做 inner product 的时候,你考虑得不是一个小的区域,而是整张图片的信息。

insert image description here

图1-30 self attention处理图片

但是在做 CNN 的时候我们会画出一个 receptive field,每一个 filter 只考虑 receptive field 范围里面的信息,所以我们比较 CNN 和 self attention 的话,我们可以说 CNN 可以看作是一种简化版的 self attention,因为在做 CNN 的时候,我们只考虑 receptive field 里面的信息,而在做 self attention 的时候,我们是考虑整张图片的信息。所以说 CNN 是一个简化版的 self attention

Or you can say it the other way around, self-attention is a complicated CNN . In CNN, we need to define the receptive field. Each filter only considers the information in the receptive field, and the scope and size of the receptive field are determined by people. . As far as self-attention is concerned, we use attention to find out the relevant pixels, as if the receptive field is learned automatically, the network decides what the shape of the receptive field looks like, and the network decides to use a certain pixel As the center, which pixels we really need to consider and which pixels are relevant, so the scope of the receptive field is no longer manually delineated, but let the machine learn it by itself.

More details: On the Relationship between Self-Attention and Convolutional Layers

In the above paper, I will rigorously tell you in a mathematical way that CNN is actually a special case of self-attention. As long as self-attention is set with appropriate parameters, it can do exactly the same thing as CNN, so self-attention It is a more flexible CNN, and CNN has restricted self-attention. As long as self-attention passes through certain designs and certain restrictions, it will become CNN

既然 CNN 是 self-attention 的一个 subset,self-attention 比较 flexible,比较 flexible 的 model 当然需要更多的 data,如果你的 data 不够,就有可能 overfitting,而小的 model,比较有限制的 model,它适合在 data 小的情形,它可能比较不会 overfitting

接下来我们再来对比下 Self-attention 和 RNN

RNN 和 self-attention 一样都是要处理 input 是一个 sequence 的状况,其 output 都是一个 vector,而且 output 的一排 vector 都会给到 Fully-Connected Network 做进一步的处理

insert image description here

图1-31 self-attention v.s. RNN(一)

那 self-attention 和 RNN 有什么不同呢?当然一个非常显而易见的不同就是说 self-attention 考虑了整个 sequence 的 vector,而 RNN 只考虑了左边已经输入的 vector,它没有考虑右边的 vector,但是 RNN 也可以是双向的,如果是双向的 RNN 的话其实也是可以看作考虑了整个 sequence 的 vector

insert image description here

图1-32 self-attention v.s. RNN(二)

但是我们假设把 RNN 的 output 和 self-attention 的 output 拿来做对比的话,就算你用双向的 RNN 还是有一些差别的,如 图1-32 所示,对于 RNN 来说假设最右边的 vector 要考虑最左边的这个输入,那它必须要将最左边的输入存到 memory 里面,然后接下来都不能够忘掉,一路带到最右边才能够在最后一个时间点被考虑,但对 self-attention 来说没有这个问题,它只要最左边输出一个 query 最右边输出一个 key,只要它们匹配起来就可以轻易的提取信息,这是 RNN 和 self-attention 一个不一样的地方。

insert image description here

图1-32 self-attention v.s. RNN(三)

还有另外一个更主要的不同是 RNN 在处理的时候,你 input 一排 sequence 然后 output 一排 sequence 的时候,RNN 是没有办法平行化的,你要先产生前面一个向量,才能产生后面一个向量,它没有办法一次处理,没有平行一次处理所有的 output。但 self-attention 有一个优势,是它可以平行处理所有的输出,你 input 一排 vector 再 output vector 的时候是平行产生的,并不需要等谁先运算完才把其它运算出来,每一个 vector 都是同时产生出来的,所以在运算速度方面 self-attention 会比 RNN 更有效率。

现在很多的应用都在把 RNN 的架构逐渐改成 self-attention 的架构了

更多细节:Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Self-attention has a lot of deformations, and the amount of calculation of self-attention is very large. How to reduce the amount of calculation of self-attention is a key point in the future. Self-attention was first used on Transformer. When talking about Transformer, it actually refers to this self-attention. In a broad sense, Transformer refers to self-attention, so various self-attention deformations are called later. xxxformer, such as Linformer, Performer, Reformer, etc., what kind of self-attention is fast and good, this is still an unresolved problem

More details: Long Range Arena: A Benchmark for Efficient Transformers Efficient Transformers: A Survey

2. Transformer

After talking about Self-attention, we can finally talk about Transformer

The following content comes from the video explanation of Transformer by Mr. Li Hongyi

Video link: [Machine Learning 2021] Transformer

2.1 Prerequisite knowledge

What are Transformers? Transformer is a Sequence-to-sequence (Seq2seq) model , so what is the Seq2seq model?

我们之前在讲解 self-attention 的时候,如果你的 input 是一个 sequence,对应的 output 存在几种情况,一种是 input 和 output 的长度一样,一种是直接 output 一个东西,还有一种情况就是 Seq2seq model 需要解决的,我们不知道 output 多长,由模型自己决定 output 的长度

那有什么样的应用是我们需要用到这种 Seq2seq model,也就是 input 是一个 sequence,output 也是一个 sequence,但是我们不知道 output 的长度,应该由模型自己决定 output 的长度,如 图2-1 所示

insert image description here

图2-1 Seq2seq model应用

一个很好的应用就是语音辨识,其输入是声音讯号,输出是语音所对应的文字,当然我们没有办法根据输入语音的长度推出 output 的长度,那怎么办呢?由模型自己决定,听一段语音,自己决定它应该要输出几个文字;还有一个应用就是机器翻译,让模型读一种语言的句子,输出另外一种语言的句子;甚至还有更复杂的应用,比如说语音翻译,给定一段语音 machine learning,它输出的不是英文,它直接把它听到的语音信号翻译成中文

在文字上你也可以用 Seq2seq model,举例来说你可以用 Seq2seq model 来训练一个聊天机器人,需要收集到大量的聊天训练数据,各式各样的 NLP 的问题都可以看作是 Question Answering(QA) 问题,而 QA 的问题就可以用 Seq2seq model 来解。必须要强调的是,对多数 NLP 的任务或对多数的语音相关的任务而言,往往为这些任务定制化模型,你会得到更好的结果

Seq2seq model is a very powerful model, it is a very useful model, we are now to learn how to do seq2seq this thing, the general Seq2seq model will be divided into two parts, one part is Encoder and the other part is Decoder, your input The sequence is processed by the Encoder and then the processed result is thrown to the Decoder. The Decoder decides what sequence it will output. Later we will elaborate on the internal architecture of the Encoder and Decoder

insert image description here

Figure 2-2 Seq2seq model composition

2.2 Encoder

Next, let's talk about the Encoder part. What the Seq2seq model Encoder has to do is to output a row of vectors to another row of vectors. Many models can do this, such as self-attention, RNN, CNN, and the Encoder in Transformer Use self-attention

insert image description here

Figure 2-3 Encoder of Seq2seq model

We use a simple picture to illustrate the Encoder as shown in Figure 2-4 . The current Encoder will be divided into many, many blocks. Each block inputs a row of vectors and outputs a row of vectors. Each block is not a layer of neural network. , but what several layers are doing. What each block does in Transformer's Encoder is shown in Figure 2-4

insert image description here

Figure 2-4 Encoder of simplified Transformer

After inputting a sequence, consider the information of the entire sequence through a self-attention, output another row of vectors, and then throw it into the Fully-Connected feed forward network, and then output another row of vectors, and this row of vectors is a block Output.

What it does in the original Transformer is more complicated. As shown in Figure 2-5, a design is added to the Transformer. We will not only output the vector after self-attention, but also output the vector. The vector adds its input to get the new output. This kind of network architecture is called residual connection. After getting the residual structure, one more thing is called normalization. What is used here is not batch normalization but layer normalization

insert image description here

Figure 2-5 Encoder of the original Transformer

Layer normalization is simpler than batch normalization. It does not need to consider batch information, input a vector, and output a vector. First, it will calculate the mean and std for different dimensions in the same feature and the same sample, and then make a norm.

After getting the output of normalization, enter it into the FC network. There is also a residual structure on the FC network side. Finally, the residual result will be the output of the Encoder in the Transformer after layer normalization.

The Encoder in the Transformer paper is shown in Figure 2-6. The whole structure is the part we just explained. Positional Encoding is added to the input to obtain position information, and then there is a Multi-Head Attention , which is self-attention. block, Add & Norm is Residual plus Layer norm, this complex block is used in BERT, BERT is Transformer Encoder

insert image description here

Figure 2-6 Encoder of Transformer paper

After learning this, you may have a lot of confusion, so why is the Encoder designed in this way? Wouldn't it be okay not to design it this way?

OK ! ! ! It doesn't have to be designed like this. The design of the Encoder network architecture is based on the original paper, but the design of the original paper does not mean that it is the best

To Learn more…

2.3 AT Decoder

Decoder 有两种分别是 Autoregressive Decoder 和 Non-Autoregressive Decoder

Autoregresive Decoder 是怎么运作的呢?我们拿语音辨识任务为例讲解,语音辨识的流程如 图2-7 所示,语音辨识就是输入一段声音输出一段文字,输入的声音信号经过 Encoder 之后输出一排 vector,然后送入到 Decoder 中产生语音辨识的结果。

insert image description here

图2-7 语音辨识流程(一)

那 Decoder 怎么产生这个语音辨识的结果呢?那首先你要给它一个特殊的符号,这个特殊的符号代表开始,Decoder 吃到 START 之后呢会吐出一个向量,这个 vector 的长度和 Vocabulary Size 一样长,向量中的每一个 row 即每一个中文字都会对应到一个数值,经过 softmax 之后这个向量的数值就对应一个分布,其总和为 1,那么分数最高的那个中文字就是最终的输出。

接下来你把 Decoder 第一次的输出 当做是 Decoder 新的 input,除了 START 特殊符号外还有 作为输入,根据这两个输入 Decoder 就会得到一个输出 ,接下来继续把 当做是 Decoder 新的输入,Decoder 根据三个输入输出得到 ,依此类推最终得到语音辨识的结果 机器学习,如 图2-8 所示

insert image description here

图2-8 语音辨识流程(二)

There is a key point here marked with a red dotted line, that is to say, the input seen by the Decoder is its own output at the previous point in time, and the Decoder will regard its own output as the next input, so when When our Decoder generates a sentence, it may actually see the wrong thing, for example, it misidentifies the instrument as the air of the weather , then the Decoder sees the wrong recognition result, it still has to find a way to use the wrong The identification result produces an output that is expected to be correct

insert image description here

Figure 2-9 Decoder error identification

Will it cause problems if the Decoder produces wrong output, and then it is eaten by the Decoder itself? Will it cause the problem of Error Propagation? This is possible, so we need to input some wrong things to the Decoder when training the model, which will be mentioned in the training part

Then let's look at the internal structure of the Decoder, as shown in Figure 2-10

insert image description here

Figure 2-10 Decoder internal structure

Let's compare the Encoder and Decoder. If we cover the middle part, there is not much difference between the Encoder and the Decoder. There is a slight difference here. On the Decoder side, the Multi-Head Attention block is also added. A Maksed

insert image description here

Figure 2-11 Encoder vs Decoder

In the original self-attention, one row of vectors is input and another row of vectors is output. Each output vector has to see the complete input before making a decision. So output b 1 \boldsymbol b^{\boldsymbol 1}b1 is actually based ona 1 \boldsymbol a^{\boldsymbol 1}a1 toa 4 \boldsymbol a^{\boldsymbol 4}a4 All information.

When we convert self-attention to Masked self-attention, what is the difference? The difference is that now we can no longer look at the right part, which produces b 1 \boldsymbol b^{\boldsymbol 1}b1 , we can only considera 1 \boldsymbol a^{\boldsymbol 1}a1 information, you can no longer considera 2 \boldsymbol a^{\boldsymbol 2}a2 a 3 \boldsymbol a^{\boldsymbol 3} a3 a 4 \boldsymbol a^{\boldsymbol 4} a4 similarly producesb 2 \boldsymbol b^{\boldsymbol 2}b2 , we can only considera 1 \boldsymbol a^{\boldsymbol 1}a1 a 2 \boldsymbol a^{\boldsymbol 2} a2 information, you can no longer considera 3 \boldsymbol a^{\boldsymbol 3}a3 a 4 \boldsymbol a^{\boldsymbol 4} a4 And so on, this is Masked self-attention, asshown in

insert image description here

图2-12 Self-attention v.s. Masked Self-attention

To be more specific, what you do is that when we want to generate b 2 \boldsymbol b^{\boldsymbol 2}b2 , we only use the Query of the second position to calculate the Attention with the Key of the first position and the Key of the second position, and ignore it in the third and fourth positions, and do not calculate the Attention

insert image description here

Figure 2-13 Masked Self-attention instance

So why do you need to add Masked? This matter is actually very intuitive. Looking back, the operation of the Decoder is to output one by one, and its output is generated one by one, so it is first a 1 \boldsymbol a^{\boldsymbol 1 }a1 and thena 2 \boldsymbol a^{\boldsymbol 2}a2 anda 3 \boldsymbol a^{\boldsymbol 3}a3 anda 4 \boldsymbol a^{\boldsymbol 4}a4,这跟原来的 self-attention 不一样,原来的 self-attention 的 a 1 \boldsymbol a^{\boldsymbol 1} a1 a 4 \boldsymbol a^{\boldsymbol 4} a4 是一整个输进你的 model 里面的,我们在讲 Encoder 的时候,Encoder 是一次性把 a 1 \boldsymbol a^{\boldsymbol 1} a1 a 4 \boldsymbol a^{\boldsymbol 4} a4 读进去,但是对 Decoder 而言先有 a 1 \boldsymbol a^{\boldsymbol 1} a1 才有 a 2 \boldsymbol a^{\boldsymbol 2} a2 才有 a 3 \boldsymbol a^{\boldsymbol 3} a3 才有 a 4 \boldsymbol a^{\boldsymbol 4} a4,所以实际上当你有 a 2 \boldsymbol a^{\boldsymbol 2} a2 要计算 b 2 \boldsymbol b^{\boldsymbol 2} b2 的时候你是没有 a 3 \boldsymbol a^{\boldsymbol 3} a3 a 4 \boldsymbol a^{\boldsymbol 4} a4 , so you have no way to puta 3 \boldsymbol a^{\boldsymbol 3}a3 a 4 \boldsymbol a^{\boldsymbol 4} a4 is taken into account, so it is emphasized in the original Transformer paper that the Decoder's Self-attention is aSelf-Attention with Masked

Ok, so far we have talked about the operation of the Decoder, but there is still a very critical issue here, that is, the Decoder must determine the length of the output Sequence by itself, but what is the length of the output Sequence? we don't know

But we expect the model to learn by itself how long the ouput sequence should be when we give it an Input Sequence today, but in our current Decoder operating mechanism, the model does not know when it should stop, so how to make it Where does the model stop?

We need to prepare a special end symbol so that Decoder can output the end symbol, as shown in Figure 2-14

insert image description here

Figure 2-14 Decoder adds End

We expect that after the Decoder generates the sequence , and then uses the sequence as the input of the Decoder, the Decoder will be able to output the output as shown in Figure 2-15 , then the entire process of the Decoder generating the Sequence is over, as shown in Figure 2-15 shown

insert image description here

Figure 2-15 Decoder adds End (2)

The above is the way Autoregressive(AT) Decoder works

2.4 NAT Decoder

Next, let's briefly talk about the operation process of the Decoder of the Non-Autoregressive (NAT) Model. Unlike the AT Decoder, which generates one word at a time, the NAT Decoder generates the entire sentence at a time. How can the entire sentence be generated at one time? Woolen cloth? The Decoder of NAT may eat a whole row of START Tokens, let it generate a row of Tokens at a time, and it is over. It only needs one step to complete the generation of a sentence

insert image description here

Figure 2-16 AT vs NAT

Then you may ask a question here. Didn’t you just say that you don’t know the length of the output sequence? Then how do we know how many START will be used as the input of the NAT Decoder? You can have several methods. The first is that you train a Classifier, which takes the input of the Encoder and outputs a number, representing the length of the output of the Decoder. This is a possible method.

Another possible way is to give it a bunch of START Tokens regardless of the three seven twenty-one, and then you see where the output of the special symbol ends, and output the one to the right of END as if it has no output, and it is over.

NAT Decoder 的好处当然有平行化,速度快,且比较能够控制输出的长度,比如语音合成任务,那你在做语音合成的时候假设你突然想让你的系统讲快一点,那你可以把 Classifier 的 ouput 除以二,限制 NAT Decoder 的输出长度。语音合成任务是可以用 Seq2seq model 来做,最知名的是一个叫做 Tacotron 的模型,它是 AT 的 Decoder,还有另外一个模型叫做 FastSpeech,它是 NAT 的 Decoder

NAT Decoder 目前是一个热门的研究主题,虽然它表面上看起来有种种的厉害之处,尤其是平行化,但是 NAT Decoder 的性能往往都不如 AT Decoder,为什么性能不好呢?其实有一个叫做 Multi-Modality 的问题,这里就不细讲了

2.5 Cross attention

接下来我们要讲 Encoder 和 Decoder 如何进行交流的,也就是我们之前遮住的部分,它叫做 Cross attention 是连接 Encoder 和 Decoder 之间的桥梁,如 图2-17 所示

insert image description here

图2-17 Transformer中的Cross attention

那这部分你会发现它有两个输入来自于 Encoder 的输出,那这个模组实际上是怎么运作的呢?那接下来根据 图2-18 说明一下

insert image description here

图2-18 Cross attention运作

首先 Encoder 输入一排向量输出一排向量,我们把它叫做 a 1 \boldsymbol a^{\boldsymbol 1} a1 a 2 \boldsymbol a^{\boldsymbol 2} a2 a 3 \boldsymbol a^{\boldsymbol 3} a3 , and the Decoder will eatSTARTto get a vector through self-attention, and then make a Transform on this vector, that is, multiply it by a matrix to get a Query calledq \boldsymbol qq , thena 1 \boldsymbol a^{\boldsymbol 1}a1 a 2 \boldsymbol a^{\boldsymbol 2} a2 a 3 \boldsymbol a^{\boldsymbol 3} a3 all generate keys, getk 1 \boldsymbol k^{\boldsymbol 1}k1 k 2 \boldsymbol k^{\boldsymbol 2} k2 k 3 \boldsymbol k^{\boldsymbol 3} k3 , theq \boldsymbol qq andk \boldsymbol kk to calculate the Attention score to getα 1 ′ \alpha'_1a1 α 2 ′ \alpha'_2a2 α 3 ′ \alpha'_3a3,当然你也可能会做 softmax,把它稍微做一下 normalization,接下来再把 α 1 ′ \alpha'_1 α1 α 2 ′ \alpha'_2 α2 α 3 ′ \alpha'_3 α3 乘上 v 1 \boldsymbol v^{\boldsymbol 1} v1 v 2 \boldsymbol v^{\boldsymbol 2} v2 v 3 \boldsymbol v^{\boldsymbol 3} v3 再把它 weighted sum 加起来得到 v \boldsymbol v v,这个 v \boldsymbol v v 接下来会丢到 Fully-Connected Network 做接下来的处理,那这个步骤 q \boldsymbol q q 来自于 Decoder, k \boldsymbol k k v \boldsymbol v v 来自于 Encoder,这个步骤就叫做 Cross Attention,所以 Decoder 就是凭借着产生一个 q \boldsymbol q q 去 Encoder 这边抽取信息出来当做接下来的输入,这个就是 Cross Attention 的运作过程

2.6 Training

我们已经讲了 Encoder,讲了 Decoder,讲了 Encoder 和 Decoder 怎么互动的,接下来我们简单聊聊训练的部分

现在假如我们要做一个语音辨识任务,那么首先你要有大量的训练数据,然后 train 一个 Seq2seq model 使得其最终的输出要和真实标签越接近越好,两个分布差异性越小越好,具体来说是计算你 model predict 的值和你的 Ground truth 之间的 cross entropy,然后 minimize cross entropy,当然这件事情和分类任务很像

那这边有一个事情值得我们注意,在 training 的时候 Decoder 的输入是什么?Decoder 的输入是 Ground Truth 是正确答案!!!,在训练的时候我们会给 Decoder 看正确答案,那这件事情叫做 Teacher Forcing:using the ground truth as input,那在 inference 的时候 Decoder 没有正确答案看,它有可能会看到错误的东西,那这之间有一个 Mismatch

Decoder 训练的时候看到的全部是正确答案,但在测试的时候可能会看到错误的东西,这个不一致的现象叫做 Exposure Bias,这会导致说 Decoder 看到错误的东西后继续产生错误,因为它在 train 的时候看到的都是正确答案呀,所以说会导致在 inference 的时候一步错,步步错,那么怎么解决这个问题呢?有一个可以思考的方向就是给 Decoder 的输入加一些错误的东西,没错,就这么直觉,那这招就叫做 Scheduled Sampling

那接下来我们聊聊训练过程中的一些小 Tips。首先是 Copy Mechanism 复制机制,那这个可能是 Decoder 所需要的,比如我们在做 chat-bot 聊天机器人的时候,复制对话是一个需要的技术,需要的能力,比如你对机器说 你好,我是周童鞋 那机器可能会回复你说 周童鞋你好,很高兴认识你,对机器来说它并不需要从头创造 周童鞋 这一段文字,它要学的也是是从使用者人的输入去 Copy 一些词汇当做是它的输出;或者是在做摘要的时候,你可能更需要 Copy 这样子的技能,所谓的摘要就是你要 train 一个 model,去读一篇文章然后产生这篇文章的摘要,对摘要这个任务而言,其实从文章里面复制一些信息出来可能是一个很关键的能力

其它的还有 Guided Attention、Beam Search 这里就不一一细说了

好,那以上我们就讲完了 Transformer 和种种的训练技巧

3. 实战

Transormer 实战,手写复现

参考博客:The Annotated Transformer

参考代码:https://github.com/harvardnlp/annotated-transformer/

The Annotated Transformer 在这篇文章中以逐行实现的形式介绍了 Transformer 原始论文的 注释 版本,强烈建议阅读原文!!!

更多细节可查看 Pytorch 官方实现 Pytorch Transformer 源码Pytorch Transformer API

3.1 Prelims

导入必要的包和库

import torch
import torch.nn as nn
from torch.nn.functional import log_softmax
import copy
import math
import time

3.2 Model Architecture

insert image description here

Figure 3-1 Transformer overall architecture

The overall structure of Transformer ( Figure 3-1 ) has been very clear from Mr. Li Hongyi's explanation before. In simple terms, Transformer is composed of Encoder and Decoder. Encoder inputs a sequence x = ( x 1 , . . . , xn ) \boldsymbol x = (x_1,...,x_n)x=(x1,...,xn) output another sequencez = ( z 1 , . . . , zn ) \boldsymbol{z}=(z_1,...,z_n)z=(z1,...,zn) , givenz \boldsymbol zz Decoder produces an output sequencey = ( y 1 , . . . , yn ) \boldsymbol y = (y_1,...,y_n)y=(y1,...,yn) , it is worth noting that when Decoder ouputs the next token, it will use the output token at the previous moment as an additional input

class EncoderDecoder(nn.Module):
    """
    一个标准的 Encoder-Decoder 架构. 可作为其它模型的基础
    """
    
    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator

    def forward(self, src, tgt, src_mask, tgt_mask):
        "处理带有 mask 的 src 和 target sequence"
        return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)
    
    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)

    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)
class Generator(nn.Module):
    "定义标准 linear + softmax 步骤"
    
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        return log_softmax(self.proj(x), dim=-1)

Transformer = Encoder + Decoder + Generator(Linear + log_softmax) consists of three parts as a whole

3.2.1 Encoder

Encoder 由 N N N 个相同 layers 堆叠而成( N = 6 N = 6 N=6)

def clones(module, N):
    "产生 N 个相同的 layers"
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])
class Encoder(nn.Module):
    "Encoder 由 N 个 layers 堆叠而成"
    
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm   = LayerNorm(layer.size)
    
    def forward(self, x, mask):
        "将 input 和 mask 一依次通过各层"
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

那接下来我们需要在两个 sub-layer 中实现 residual connection,然后去做 layer normalization

class LayerNorm(nn.Module):
    "layerborm 模块的构建"

    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std  = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

每个 sub-layer 的输出是 L a y e r N o r m ( x + S u b l a y e r ( x ) ) \mathrm{LayerNorm}(x+\mathrm{Sublayer}(x)) LayerNorm(x+Sublayer(x)) 其中 S u b l a y e r ( x ) \mathrm{Sublayer}(x) Sublayer(x) 是 sub-layer 本身实现的功能,我们在每个 sub-layer 的输出上应用 dropout,然后再加上 sub-layer 的输入做 normalization

为了确保 residual connection,模型中所有的 sub-layer 以及 embedding layer 的输出维度设置为 d m o d e l = 512 d_{model} = 512 dmodel=512

class SublayerConnection(nn.Module):
    """
    A residual connection(layernorm之后)
    """
    
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm    = LayerNorm(size)
        self.dropout =  nn.Dropout(dropout)
    
    def forward(self, x, sublayer):
        "residual connection 应用于任何具有相同 size 的 sublayer"
        return x + self.dropout(sublayer(self.norm(x)))

每个 layer 都拥有两个 sub-layer,第一个 layer 就是一个 multi-head self-attention mechanism,第二个 layer 就是一个 position-wise fully connected feed-forward network

class EncoderLayer(nn.Module):
    "Encoder 由 self-attention 和 feed forward 两部分组成"

    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn    = self_attn
        self.feed_forward = feed_forward
        self.sublayer     = clones(SublayerConnection(size, dropout), 2)
        self.size         = size

    def forward(self, x, mask):
        "参考 图3-1 左边连接"
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)

代码的各部分在 Encoder 图中的说明见 图3-2

insert image description here

图3-2 Encoder代码对应图

3.2.2 Decoder

Decoder 也是由 N N N 个相同 layers 堆叠而成( N = 6 N = 6 N=6)

class Decoder(nn.Module):
    "带 mask 的通用 N 层 decoder"

    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm   = LayerNorm(layer.size)
    
    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)

Decoder 除了拥有 Encoder 的两个 sub-layer 之外还插入了第三个 sub-layer,也就是我们之前讲过的 Encoder 和 Decoder 之间的桥梁 Cross-attention,它也是一个 multi-head attention 不同的是它吃 Encoder 的 output,它由 Decoder 产生 q \boldsymbol q q 通过 Encoder 的 k \boldsymbol k k v \boldsymbol v v 抽取信息。与 Encoder 类似,对于 Decoder 中的每个 sub-layer 我们也需要实现 residual connection 和 layer normalization

class DecoderLayer(nn.Module):
    "DecoderLayer 由 self-attention cross-attention 和 feed forward 三部分组成"

    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size         = size
        self.self_attn    = self_attn
        self.src_attn     = src_attn
        self.feed_forward = feed_forward
        self.sublayer     = clones(SublayerConnection(size, dropout), 3)

    def forward(self, x, memory, src_mask, tgt_mask):
        "参考 图3-1 右边连接"
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        return self.sublayer[2](x, self.feed_forward)

The creation of the mask in Masked Multi-Head Self-attention ensures that what the Decoder sees is only the output of the previous moment, without considering the output of the later moment

def subsequent_mask(size):
    "Maks 输出后续位置"
    attn_shape = (1, size, size)
    subsequent_mask = torch.triu(torch.ones(attn_shape), diagonal=1).type(torch.uint8)
    return subsequent_mask == 0

The attention mask below shows the position (column) where each tgt word (row) is allowed to see, ensuring that the Decoder blocks attention to future words during the training process

insert image description here

Figure 3-3 mask visualization

The description of each part of the code in the Decoder diagram is shown in Figure 3-4

insert image description here

Figure 3-4 Decoder code corresponding diagram

3.2.3 Attention

The attention function can be described as mapping a query and a set of key-values ​​to an output, where the query, key, value, and output are all vectors, and the output is calculated through the weighted sum of the value, where the weight assigned to each value is calculated by the compatibility function of the query and the corresponding key

We call this special attention Scaled Dot-Product Attention , its input is determined by dk d_kdkDimension query, key and dv d_vdvThe value composition of the dimension, we calculate dot products for query and all keys, and then divide by dk \sqrt{d_k}dk Use softmax to obtain the weight on value, as shown in Figure 3-5

insert image description here

图3-5 Scaled Dot-Product Attention

上述就是 self-attention 的实现过程,我们再来回顾下李宏毅老师从矩阵的角度来分析 self-attention 的整个运作过程(详细内容见 1.3 小节)

那么首先我们会对 input 拼接成矩阵 I I I 分别乘上 W q W^q Wq W k W^k Wk W v W^v Wv 三个不同的矩阵得到 Q Q Q K K K V V V图3-6 所示

insert image description here

图3-6 矩阵角度分析qkv产生

那么接下来每一个 q \boldsymbol q q 都会跟每一个 k \boldsymbol k k 去计算 dot product 得到 attention score,从矩阵角度分析就是矩阵 K K K Q Q Q 相乘, K K K 矩阵的 row 就是 k 1 \boldsymbol k^{\boldsymbol 1} k1 tok 4 \boldsymbol k^{\boldsymbol 4}k4 Q Q The column of the Q matrix is​​q 1 \boldsymbol q^{\boldsymbol 1}q1 toq 4 \boldsymbol q^{\boldsymbol 4}q4 get the matrixAAA , the matrixAAA store isQQQ andKKThe attention score between K

Then we will normalize the attention score, and do softmax for each column here, so that the value of each column adds up to 1, as shown in Figure 3-7

insert image description here

Figure 3-7 Matrix angle analysis attention score

Next, we need to pass the attention score matrix A ′ A’A' The output is calculated, how is the output calculated? you just letVVV timesA ' A'A , as shown in Figure

insert image description here

图3-8 矩阵角度分析输出的产生

在实际的应用中,我们计算的输出的矩阵为:
A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d k ) V \mathrm{Attention}(Q,K,V) = \mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V Attention(Q,K,V)=softmax(dk QKT)V

def attention(query, key, value, mask=None, dropout=None):
    "计算 'Scaled Dot Product Attention'"
    d_k    = query.size(-1) # query 维度
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        # 将 mask 部分设置很大的负数,通过 softmax 后将这些得分转化为接近于零的概率
        scores = scores.masked_fill(mask == 0, -1e9) 
    p_attn = scores.softmax(dim=-1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

最常见的两种 Attention 的计算方式是 Dot-product 和 Additive,在前面我们已经讨论过了,这边就不再赘述了(详细内容见 1.2 小节),dot-product 算法的实现与我们之前描述的一样,除了缩放系数 1 d k \frac{1}{\sqrt{d_k}} dk 1 的不同。我们采用 dot-product 来实现 Attention 的计算,虽然 Dot-product 和 Additive 二者在理论上的复杂性相似,但 dot-product 在实践中要快得多,而且空间效率更高,因为它可以用高度优化的矩阵乘法代码来实现

接下来我们简单聊聊为什么在实际应用中 Dot-product 需要进行缩放

首先,当 d k d_k dkWhen small, the dot-product and additive algorithms perform similarly, because the small dk d_kdkdoes not cause the dot product to be too large or too small in magnitude, and the additive algorithm does not require additional scaling operations

However, when dk d_kdkWhen it is larger, the dot product algorithm will cause the magnitude of the dot product to become larger, so why does the magnitude of the dot product become larger? We can assume that q \boldsymbol qq andk \boldsymbol kk is an independent random variable with mean 0 and variance 1. The result of the dot product isq ⋅ k = ∑ i = 1 dkqikiq \cdot k =\sum_{i=1}^{d_k}q_ik_iqk=i=1dkqikiIts mean is 0 and variance is dk d_kdk. Therefore, with dk d_kdkThe increase of , the magnitude of the dot product will also increase

When the magnitude of the dot product becomes larger, after the softmax function, the obtained attention score will show a smaller gradient value, even close to zero. This can lead to vanishing or exploding gradients, making it difficult to train the model

In order to offset the impact of the larger magnitude of the dot product, we scale the dot product to 1 dk \frac{1}{\sqrt{d_k}}dk 1. Through the scaling operation, it can be ensured that the magnitude of the dot product will not be too large, so as to keep the softmax function in the region with a large gradient and avoid the situation where the gradient is extremely small.

self-attention 它有一个进阶的版本叫做 Muti-head Self-attention 如 图3-9所示 它也被应用在 Transformer 当中,其计算如下:
M u l t i H e a d ( Q , K , V ) = C o n c a t ( h e a d 1 , . . . , h e a d h ) W O w h e r e   h e a d i = A t t e n t i o n ( Q W i Q , K W i K , V W i V ) \mathrm{MultiHead}(Q,K,V) = \mathrm{Concat}(\mathrm{head_1},...,\mathrm{head_h})W^O \\ \mathrm{where}\ \mathrm{head_i} = \mathrm{Attention}(QW_i^Q,KW_i^K,VW_i^V) MultiHead(Q,K,V)=Concat(head1,...,headh)WOwhere headi=Attention(QWiQ,KWiK,VWiV)
其中参数矩阵 W i Q ∈ R d m o d e l × d k W_i^Q\in\mathbb{R}^{d_\mathrm{model}\times d_k} WiQRdmodel×dk W i K ∈ R d m o d e l × d k W_i^K\in\mathbb{R}^{d_{\mathrm{model}}\times d_k} WiKRdmodel×dk W i V ∈ R d m o d e l × d v W_i^V\in\mathbb{R}^{d_{\mathrm{model}}\times d_v} WiVRdmodel×dv W O ∈ R h d v × d m o d e l W^O\in\mathbb{R}^{hd_v \times d_{\mathrm{model}}} WORhdv×dmodel

insert image description here

图3-9 multi-head attention

在接下来的应用中,我们试用 h = 8 h = 8 h=8 代表 8 头注意力, d k = d v = d m o d e l / h = 64 d_k = d_v = d_{\mathrm{model}}/h = 64 dk=dv=dmodel/h=64

class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        ""
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0  # 断言
        # 我们假设 d_v 总等于 d_k
        self.d_k = d_model // h
        self.h   = h
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn    = None
        self.dropout = nn.Dropout(p=dropout)
    
    def forward(self, query, key, value, mask=None):
        "图3-9的实现"
        if mask is not None:
            # 同样的 mask 应用到所有的 head 中
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)
    
        # 1) 线性变换 d_model => h x d_k
        query, key, value = [
            lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
            for lin, x in zip(self.linears, (query, key, value))
        ]

        # 2) 根据变换后的 vector 计算 attention score
        x, self.attn = attention(
            query, key, value, mask=mask, dropout=self.dropout
        )

        # 3) Concat
        x = (
            x.transpose(1, 2)
            .contiguous()
            .view(nbatches, -1, self.h * self.d_k)
        )
        del query
        del key
        del value
        return self.linears[-1](x)

Let's analyze the following input transformation code:

query, key, value = [
            lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
            for lin, x in zip(self.linears, (query, key, value))
        ]

What exactly does this code do? It mainly performs linear transformation and projection on the input query, key, and value, and performs some dimension transformation operations. The purpose is to adjust the input data to a form suitable for the multi-head attention mechanism. Its work should be as shown in Figure 3-10 Contents in the red box

insert image description here

Figure 3-10 Multi-head attention mechanism input transformation

Let's analyze the multi-head attention splicing code again:

x = (
    x.transpose(1, 2)
    .contiguous()
    .view(nbatches, -1, self.h * self.d_k)
)

In the above sample code, the concatenation of multi-head attention is realized through dimension transformation, x is the result of attention calculation, and the dimension is (nbatches, h, seq_len, d_k), where nbatches represents the batch size, h represents the number of heads, and seq_len represents the sequence length , d_k represents the dimension of each head.

First, the dimension transformation is performed through the transpose function, and the dimension is (nbatches, h, seq_len, d_k)transposed to (nbatches, seq_len, h, d_k), which is to place the attention calculation results of each head on the adjacent dimension.

Then, use the contiguous function to ensure that the data is contiguous in memory for subsequent view transformations.

Finally, the shape of the result is adjusted from to by the view function , that is, the attention results of all heads are stitched together along the last dimension.(nbatches, seq_len, h, d_k)(nbatches, seq_len, h * d_k)

3.2.4 Position-wise Feed-Forward Networks

In Encoder and Decoder, in addition to the attention layer, it also includes a fully connected feed-forward network,

Does it contain the following two linear transformations, with a ReLU activation
FFN ( x ) = max ⁡ ( 0 , x W 1 + b 1 ) W 2 + b 2 \mathrm{FFN}(x)=\max(0, xW_1+b_1)W_2+b_2FFN(x)=max(0,xW1+b1)W2+b2
It can be regarded as two convolution operations with kernel size equal to 1, and the input and output dimensions are dmodel = 512 d_{\mathrm{model}}=512dmodel=512 middle layer dimension isdff = 2048 d_{ff} = 2048dff=2048

class PositionwiseFeedForward(nn.Module):
    "FFN(feed-forward network) 的实现"

    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        return self.w_2(self.dropout(self.w_1(x).relu()))

3.2.5 Embeddings and Softmax

Similar to other Seq2seq models, in Transformer we use embeddings to convert input tokens and output tokens into dmodel d_{\mathrm{model}}dmodeldimension vectors, we also use a linear layer and softmax to convert the output of the Decoder into the probability of predicting the next-token

In Transformer we share the same weight matrix between the two Embedding layers of Encoder and Decoder and the linear transformation layer before predicting softmax, and multiply the weight by dmodel \sqrt{d_{\mathrm{model}}}dmodel Scale so that the value range of the embedding vector fits better

Specifically, for the embedding layers of the encoder and decoder, they both use the same weight matrix to convert the input token indices into embedding vectors. This means that the encoder and decoder share the same representation of word embeddings, allowing them to better understand and process the same vocabulary. ( from chatGPT )

In addition, the same weight matrix is ​​also used in the linear transformation layer before predicting softmax. The role of this layer is to convert the output of the decoder into a probability distribution for predicting the next token. By sharing the weight matrix, the transition between the encoder and decoder can be made more consistent and coherent, improving the performance and generalization of the model.

class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model

    def forward(self, x):
        return self.lut(x) * math.sqrt(self.d_model)

3.2.6 Positional Encoding

Since recursion and convolution are not included in the Transformer, in order for the model to take advantage of the order of the sequence, we must add some information about the relative or absolute position of the tokens in the sequence. For this, we add " position codes", these position codes have the same dimension dmodel d_{\mathrm{model}}dmodel, which makes the two additive ( see Section 1.5 for details )

Of course, there are many implementations of positional encoding, so we can use sine and cosine functions of different frequencies to achieve:
PE ( pos , 2 i ) = sin ⁡ ( pos / 1000 0 2 i / dmodel ) PE ( pos , 2 i + 1 ) = cos ⁡ ( pos / 1000 0 2 i / dmodel ) PE_{(pos,2i)} = \sin(pos/10000^{2i/d_{\mathrm{model}}}) \\ PE_{( pos,2i+1)} = \cos(pos/10000^{2i/d_{\mathrm{model}}})PE( p os , 2 i )=sin ( pos / 1000 02 i / dmodel)PE( p os , 2 i + 1 )=cos(pos/100002 i / dmodel)
wherepos posp os stands for position,iii represents the dimension, that is to say, each dimension of the position code corresponds to a sine wave, and the wavelength forms a geometric progression, from2π 2\pi2 π10000 ⋅ 2 π 10000 \cdot 2\pi100002 π We chose this function because we hypothesized that it would allow the model to easily learn to notice by relative position, since for any fixed offsetkkFor k ,PE pos + k PE_{pos+k}PEpos+kcan be expressed as a linear function PE pos PE_{pos}PEpos

In addition, we performed dropout on the sum of the embedding and positional encoding of the Encoder and Decoder, and P drop = 0.1 P_{drop} = 0.1Pdrop=0.1

class PositionalEncoding(nn.Module):
    "位置编码实现"

    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        # 在对数空间中计算一次位置编码
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1) # [5000,1]
        div_term = torch.exp(
            torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
        )
        # 从第 0 维度开始,每隔 2 个维度进行切片
        pe[:, 0::2] = torch.sin(position * div_term)
        # 从第 1 维度开始,每隔 2 个维度进行切片
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0) # [1,5000,512]
        self.register_buffer("pe", pe)
    
    def forward(self, x):
        x = x + self.pe[:, : x.size(1)].requires_grad_(False)
        return self.dropout(x)

Part of the code for position encoding is as follows:

pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1) # [5000,1]
div_term = torch.exp(
    torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
)
# 从第 0 维度开始,每隔 2 个维度进行切片
pe[:, 0::2] = torch.sin(position * div_term)
# 从第 1 维度开始,每隔 2 个维度进行切片
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0) # [1,5000,512]

In the above code, the calculation method of div_term is to use the form of natural logarithm, and use torch.exp()the function to perform exponential operation instead of directly calculating 10000^{2i/d_model} . The purpose of this is to avoid numerical overflow or underflow problems that may occur in exponential operations.

At the same time, the position encoding vector uses the sine and cosine functions alternately. Specifically, assuming d_model is 512, then the dimension of the position encoding is 512. Among the 512 dimensions, the position codes corresponding to the odd dimensions are generated using the sine function , and the position codes corresponding to the even dimensions are generated using the cosine function . The purpose of this is to make the vector representations between adjacent positions have different properties, so that the model can better capture the order and position relationship of the word sequence. As shown in Figure 3-11

insert image description here

Figure 3-11 Position coding example

3.2.7 Full Model

Define a function to build a complete model

def make_model(
    src_vocab, tgt_vocab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1
):
    "从超参数中构建模型"
    c    = copy.deepcopy
    attn = MultiHeadedAttention(h, d_model)
    ff   = PositionwiseFeedForward(d_model, d_ff, dropout)
    position = PositionalEncoding(d_model, dropout)
    model = EncoderDecoder(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N),
        nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
        nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
        Generator(d_model, tgt_vocab),
    )

    # 非常重要的步骤
    # 使用 Glorot / fan_avg 初始化参数
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)
    return model

3.2.8 Inference

We make predictions through the forward propagation of the model. Of course, since the model has not been trained, its output is randomly generated. Next, we'll build the Loss function and try to train our model to memorize numbers from 1 to 10

def inference_test():
    test_model = make_model(11, 11, 2)
    test_model.eval()
    src = torch.LongTensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
    src_mask = torch.ones(1, 1, 10)

    memory = test_model.encode(src, src_mask)
    ys = torch.zeros(1, 1).type_as(src)

    for i in range(9):
        out = test_model.decode(
            memory, src_mask, ys, subsequent_mask(ys.size(1)).type_as(src.data)
        )
        prob = test_model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.data[0]
        ys = torch.cat(
            [ys, torch.empty(1, 1).type_as(src.data).fill_(next_word)], dim=1
        )

    print("Example Untrained Model Prediction:", ys)

def run_tests():
    for _ in range(10):
        inference_test()

run_tests()

The output is as follows:

Example Untrained Model Prediction: tensor([[ 0,  6,  2,  3, 10, 10,  1,  1,  1,  1]])
Example Untrained Model Prediction: tensor([[0, 1, 3, 0, 1, 0, 1, 0, 1, 3]])
Example Untrained Model Prediction: tensor([[ 0,  4,  2,  9,  3, 10,  6,  9,  3, 10]])
Example Untrained Model Prediction: tensor([[0, 0, 0, 0, 2, 2, 2, 2, 2, 2]])
Example Untrained Model Prediction: tensor([[0, 2, 2, 2, 2, 3, 2, 2, 3, 3]])
Example Untrained Model Prediction: tensor([[ 0,  4,  8,  3,  7, 10,  3,  2,  2,  2]])
Example Untrained Model Prediction: tensor([[0, 5, 0, 2, 5, 0, 2, 5, 0, 2]])
Example Untrained Model Prediction: tensor([[0, 7, 9, 4, 3, 3, 3, 9, 4, 5]])
Example Untrained Model Prediction: tensor([[0, 6, 6, 6, 6, 6, 6, 6, 6, 6]])
Example Untrained Model Prediction: tensor([[0, 5, 5, 5, 5, 5, 5, 5, 5, 5]])

So far, the network structure of the model has been built. Next, let’s briefly talk about part of the model training.

3.3 Model Training

In this section we only introduce some details of the model training process, please refer to the original text for details

Let's pause for a while, and first introduce some tools needed to train the Transformer Model. First, we define a Batch object to save the src and target sentences used for training and the corresponding mask

3.3.1 Batches and Masking

class Batch:
    """在训练时用于保存带 mask 的数据对象."""

    def __init__(self, src, tgt=None, pad=2):
        self.src = src
        self.src_mask = (src != pad).unsqueeze(-2)
        if tgt is not None:
            self.tgt   = tgt[:, :-1]
            self.tgt_y = tgt[:, 1:]
            self.tgt_mask = self.make_std_mask(self.tgt, pad)
            self.ntokens  = (self.tgt_y != pad).data.sum()
    
    @staticmethod
    def make_std_mask(tgt, pad):
        "Create a mask to hide padding and future words"
        tgt_mask = (tgt != pad).unsqueeze(-2)
        tgt_mask = tgt_mask & subsequent_mask(tgt.size(-1)).type_as(
            tgt_mask.data
        )
        return tgt_mask

In the above sample code, src_mask is the input mask tensor, which is generated by setting the element equal to pad in src to False. The function of src_mask is to shield the padding token in the input sequence to ensure that the model will not generate any attention in the padding part. force. In the Transformer model, padding tokens are usually used to align sentences of different lengths for batch training. By setting where the padding markers are located to False in src_mask, the model ignores these padding parts when computing attention.

For example, your input sequence is machine learning , and your input sequence has a fixed maximum length, such as 512, then the model needs to fill it to the same dimension as the maximum length. Filled parts can be indicated using specific padding flags (eg pad = 2), and then set the value of these padding positions to False in src_mask to ensure that the model does not pay attention to the filled part

It should be noted that the padding operation is usually done in the data preprocessing stage, while the generation of src_mask is carried out during the model training process, the purpose is to dynamically generate masks in each batch to adapt to inputs of different lengths sequence.

self.tgt represents the input of the Decoder, and the last token is removed, which is the content of the red box in Figure 3-12 , that is, START machine learning , which needs to calculate the mask, and the Decoder cannot see the future results

For tat_mask, in the Decoder, in addition to occluding the fill mark like src_mask, it also needs to mask the information of the subsequent position, so it also needs to generate a mask through the subsequent_mask function to mask the information of the subsequent position

self.tgt_y represents machine learning , it is used to calculate the number of non-filled tokens self.ntokens, it does not need to calculate the mask

insert image description here

Figure 3-12 self.tgt

Next, we will create a training status object to track some information during the training process

3.3.2 Training Loop

class TrainState:
    """跟踪 number of steps, example, and tokens processed"""

    step: int = 0           # Steps in the current epoch
    accum_step: int = 0     # Number of gradient accumulation steps
    samples: int = 0        # total # of examples used
    tokens:  int = 0        # total # of tokens processed
def run_epoch(
    data_iter,
    model,
    loss_compute,
    optimizer,
    scheduler,
    mode="train",
    accum_iter=1,
    train_state=TrainState()
):
    "Train a single epoch"
    start = time.time()
    total_tokens = 0
    total_loss   = 0
    tokens  = 0
    n_accum = 0
    for i, batch in enumerate(data_iter):
        out = model.forward(
            batch.src, batch.tgt, batch.src_mask, batch.tgt_mask
        )
        loss, loss_node = loss_compute(out, batch.tgt_y, batch.ntokens)
        # loss_node = loss_node / accum_iter
        if mode == "train" or mode == "train+log":
            loss_node.backward()
            train_state.step += 1
            train_state.samples += batch.src.shape[0]
            train_state.tokens  += batch.ntokens
            if i % accum_iter == 0:
                optimizer.step()
                optimizer.zero_grad(set_to_none=True)
                n_accum += 1
                train_state.accum_step += 1
            scheduler.step()
        
        total_loss += loss
        total_tokens += batch.ntokens
        tokens += batch.ntokens
        if i % 40 == 1 and (mode == "train" or mode == "train+log"):
            lr = optimizer.param_groups[0]["lr"]
            elapsed = time.time() - start
            print(
                (
                    "Epoch Step: %6d | Accumulation Step: %3d | Loss: %6.2f "
                    + "| Tokens / Sec: %7.1f | Learning Rate: %6.1e"
                )
                % (i, n_accum, loss / batch.ntokens, tokens / elapsed, lr)
            )
            start = time.time()
            tokens = 0
        del loss
        del loss_node
    return total_loss / total_tokens, train_state

3.3.3 Optimizer

Adam optimizer β 1 = 0.9 \beta_1 = 0.9 b1=0.9 β 2 = 0.98 \beta_2 = 0.98b2=0.98 ϵ = 1 0 − 9 \epsilon = 10^{-9}ϵ=109

The learning rate is updated according to the following formula:
lrate = dmodel − 0.5 ⋅ min ⁡ ( step _ num − 0.5 , step _ num ⋅ warmup _ steps − 1.5 ) lrate = d_{\mathrm{model}}^{-0.5} \cdot \ min(step\_num^{-0.5},step\_num \cdot warmup\_steps^{-1.5})lrate=dmodel0.5min(step_num0.5,step_numwarmup_steps1.5)
其中 w a r m u p _ s t e p s = 4000 warmup\_steps = 4000 warmup_steps=4000

Note : This part is very important, you need to use this setting to train the model

def rate(step, model_size, factor, warmup):
    """
    we have to default the step to 1 for LambdaLR function
    to avoid zero raising to negative power
    """
    if step == 0:
        step == 1
    return factor * (
        model_size ** (-0.5) * min(step ** (-0.5), step * warmup ** (-1.5))
    )

3.3.4 Regularization

Label Smoothing ϵ ls = 0.1 \epsilon_{ls} = 0.1ϵls=0.1 Use KL dive loss instead of one-hot target distribution

class LabelSmoothing(nn.Module):
    "Implement label smoothing"

    def __init__(self, size, padding_idx, smoothing=0.0):
        super(LabelSmoothing, self).__init__()
        self.criterion   = nn.KLDivLoss(reduction="sum")
        self.padding_idx = padding_idx
        self.confidence  = 1.0 - smoothing
        self.smoothing   = smoothing
        self.size      = size
        self.true_dist = None

    def forward(self, x, target):
        assert x.size(1) == self.size
        true_dist = x.data.clone()
        true_dist.fill_(self.smoothing / (self.size - 2)) # 减 2 包含 START 和 END token
        # 将 target 中每个元素的索引位置对应的 true_dist 中的值替换为 self.confidence
        true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
        true_dist[:, self.padding_idx] = 0
        mask = torch.nonzero(target.data == self.padding_idx)
        if mask.dim() > 0:
            true_dist.index_fill_(0, mask.squeeze(), 0.0)
        self.true_dist = true_dist
        return self.criterion(x, true_dist.clone().detach())

When calculating Loss, we use the KL divergence loss function instead of the traditional cross-entropy loss function. In the LabelSmoothing class, the input of the KL divergence loss function self.criterion is the smoothed distribution of the model output x and the real label true_dist

So far, the actual combat part of Transformer has come to an end. You can check the relevant content by yourself for the subsequent actual data training. The blogger only introduces some details to realize the content

3.4 Summary of actual combat

Because bloggers don’t know much about NLP-related tasks, they were confused at the beginning and could only follow the reference blog to follow the gourd. However, according to the network structure of Transformer, after reviewing the courses taught by Mr. Li Hongyi, they gradually start to get acquainted

Start with the network structure first. The Transformer model is divided into two parts: Encoder + Decoder. The reference blog is to build a Generator class separately from the Linear + softmax part of the Decoder part. Both Encoder and Decoder have residual connect, mainly through the class SublayerConnection To achieve it, it implements Layernorm + sublayer + dropout, and the sublayer in the Encoder is mainly Multi-Head Attention and Fully Connected Feed-Forward Network, and it is similar in the Decoder, except that Multi-Head Attention has a mask

It is difficult to build the Multi-Head Attenion part of the entire network structure, but it is relatively easy to realize it in combination with the matrix perspective of Mr. Li Hongyi. In addition, the positional encoding part is also worth noting. This implementation is mainly through the use of different Sine-cosine function of frequency to achieve position encoding

There are also some details worthy of our attention in the model training part. The first is the generation of the mask. The generation of tgt_mask should take into account that the Decoder cannot see the information of the subsequent position , so it is slightly different from the generation of src_mask. In addition, in order to reduce the impact of the model on training Sensitivity to noise and overfitting in the data, the label smooth distribution is used instead of the traditional one-hot encoded target distribution, and the KL divergence loss function is used to measure the difference between the model output and the smoothed distribution

Generally speaking, we can grasp the overall direction according to the network structure diagram. For some details, you can ask chatGPT

Summarize

The blogger has never been interested in NLP-related content before, and he has heard about Transformer for a long time, but he has never found a suitable explanation for the blogger. This time, he wants to learn BEVFormer to supplement the knowledge about Transformer. After listening to Li Hongyi After the teacher's very popular explanation, I have an overall understanding of Transformer.

To put it simply, Transformer is a Seq2seq model , which consists of three parts: Encoder, Decoder and Bridge Cross attention. Among them, Encoder uses Self-attention, and Decoder is very similar to Encoder, but it uses Multi- Head Attention, Bridge Cross attention q generated by Decoder \boldsymbol qq passes Encoder'sk \boldsymbol kk v \boldsymbol vv Extract information as the next input

A brief understanding of Transformer has a foundation for subsequent learning of BEVFormer, and at the same time I have a certain interest in tasks related to NLP ASR. Thank you very much Mr. Li Hongyi! ! !

Guess you like

Origin blog.csdn.net/qq_40672115/article/details/130678952