[Wanzi long text] In-depth analysis of Transformer and attention mechanism (including complete code implementation)

In-depth analysis of Transformer and attention mechanism

In the article "Graphic NLP Model Development: From RNN to Transformer" , I introduced the development and evolution of the NLP model, and showed you the architecture and shortcomings of each technology in an intuitive graphical way. Some readers reported that although the graphic method is intuitive, it lacks depth. Considering that Transformer is the cornerstone of a large model, this article will focus on an in-depth analysis of Transformer and attention mechanism.

insert image description here

Figure 1. The development of large language models after Transformer

This will be my longest article so far, covering almost everything necessary about Transformer and attention mechanism, including self-attention, query, key, value, multi-head attention, masked multi-head attention and Transformer architecture, And a complete PyTorch code implementation. I hope that after reading this article, everyone can have a deep understanding of Transformer.

The birth background and significance of Transformer

Let's start with the history of Transformer and attention mechanism. In fact, the attention mechanism appeared earlier than Transformer. Attention mechanisms were first used in computer vision in 2014 to try to understand what a neural network was looking at when it made a prediction. This is the first step in trying to understand the output of a Convolutional Neural Network (CNN). In 2015, attention mechanisms first appeared in the field of natural language processing (NLP) for aligning machine translation. Finally, in 2017, the attention mechanism was added to the Transformer network for language modeling. Since then, Transformers have surpassed the prediction accuracy of RNN and become the most advanced technology in the field of NLP.

The problem with RNNs

The emergence of Transformers has replaced RNN's position in the NLP field. In the final analysis, it is because RNN has some problems and Transformer solves them.

Problem 1. Long-range dependency problem

RNN has long-range dependency problems and is not suitable for long texts. Whereas Transformer networks almost exclusively use attention modules. Attention helps to establish connections between any part of the sequence, so there is no long-range dependency problem. For Transformer, long-range dependencies are handled in the same way as short-range dependencies.

Problem 2. Gradient disappearance and gradient explosion

RNNs suffer from vanishing and exploding gradients. And Transformer has almost no gradient disappearance or gradient explosion problem. In a Transformer network, the entire sequence is trained simultaneously, so there is little problem of vanishing or exploding gradients.

Problem 3. Low training performance

RNNs require more training steps to reach local/global optima. We can think of RNNs as very deep unrolled networks whose size depends on the length of the sequence. This will result in many parameters, and most of these parameters are interrelated. This results in longer training times and more steps for optimization. Transformer requires fewer training steps than RNN.

Problem 4. Unable to parallelize

RNNs cannot be computed in parallel. Because RNN is a sequence model, that is, all calculations in the network occur sequentially, and each step of operation depends on the output of the previous step, so it is difficult to parallelize. The Transformer network allows parallel computing and can give full play to the advantages of GPU parallel computing.

attention mechanism

word embedding

It is difficult for computers to use natural language text directly, so in NLP, the first step is to convert natural language words into vectors. Converting words in text into equal-length vectors is embedding . Each dimension in the embedding has potential meaning. For example, the first dimension could characterize the "masculinity" of a word. The larger the number in the first dimension, the more likely the word is associated with men. This is just an example for the convenience of everyone's understanding. In practice, it is difficult to expose the meaning of the vector dimension.

There is no universal standard for word embeddings. Embeddings for the same word vary across various neural networks, and also across training stages. Embeddings start with random values ​​and are constantly adjusted during training to minimize neural network error.

Aggregating the embeddings of each word in the sentence yields an embedding matrix , where each row in the matrix represents a word embedding.

self-attention

For example, such as the following sentence:
Xiaomei is very beautiful and she is also very nice\text{Xiaomei is very beautiful and she is also very nice}Xiaomei is very beautiful and she is a very nice person.
If we only look at the word "person" in the sentence, we will find that"and"and"yet"are the two closest words to it, but these two words do not Bringing any contextual information,the words"Xiaomei"and"good"are more closely related to "people"**. The second half of the sentence means "Hello little beauty". This example tells us that the proximity of words is not always related to meaning, and the context is more important.

When this sentence is entered into the computer, the program will treat each word as a token ttt , each token has a word embeddingAAA. _ But these word embeddings have no context. So the idea of ​​the attention mechanism is to apply some kind of weight or similarity, let the initial word embeddingAAA obtains more contextual information to obtain the final word embeddingYYY

insert image description here

Figure 2. Self-attention example explanation

In the embedding space, similar words appear closer or have similar embeddings. For example the word "programmer" has more to do with "code" and "development" than it does with "lipstick". Likewise, "lipstick" has more to do with "eye shadow" and "foundation" than with the word "rocket."

So, intuitively, if the word "programmer" appears at the beginning of a sentence and the word "code" appears at the end, they should provide better context for each other. We use this idea to find the weight vector WWW , by multiplying word embeddings (dot product) to get more context. So, in the sentence"Mei is pretty and nice", instead of using the word embeddings as-is, we multiply the embeddings of each word with each other. The calculation formula demonstration below can better illustrate this point.

  1. Discovery weight

{ a 1 a 1 = w 11 a 1 a 2 = w 13 a 1 a 3 = w 13 ⋮ a 1 an = w 1 normalize → w 11 w 13 w 13 ⋮ w 1 n } Recompute the weights of the first vector (1) \begin{cases} a_1 a_1 = w_{11} \\ a_1 a_2 = w_{13} \\ a_1 a_3 = w_{13} \\ \vdots \\ a_1 a_n = w_{1n} \end{cases } \qquad\underrightarrow{normalize}\qquad \begin{rcases} w_{11} \\ w_{13} \\ w_{13} \\ \vdots \\ w_{1n} \end{rcases} \quad \text {recalculate the weights of the first vector} \tag{1} a1a1=w11a1a2=w13a1a3=w13a1an=w1n normalizew11w13w13w1n recalculate the weights of the first vector(1)

  1. Get word embeddings with context

w 11 a 1 + w 12 a 2 + w 13 a 3 + ⋯ + w 1 n a n = y 1 w 21 a 1 + w 22 a 2 + w 23 a 3 + ⋯ + w 2 n a n = y 2 ⋮ w n 1 a 1 + w n 2 a 2 + w n 3 a 3 + ⋯ + w n n a n = y n (2) w_{11}a_1+w_{12}a_2+w_{13}a_3+\dots+w_{1n}a_n = y_1 \\ w_{21}a_1+w_{22}a_2+w_{23}a_3+\dots+w_{2n}a_n = y_2 \\ \vdots \\ w_{n1}a_1+w_{n2}a_2+w_{n3}a_3+\dots+w_{nn}a_n = y_n \\ \tag{2} w11a1+w12a2+w13a3++w1nan=y1w21a1+w22a2+w23a3++w2 nan=y2wn 1a1+wn 2a2+wn 3a3++wnnan=yn(2)

As shown in the above calculation formula, we first multiply (dot product) the initial embedding of the first word with the embeddings of all other words in the sentence to find a new set of weights. This set of weights ( w 11 w_{11}w11to w 1 n w_{1n}w1n) will be normalized (commonly used softmax). Next, this set of weights is multiplied by the initial embeddings of all the words in the sentence
w 11 a 1 + w 12 a 2 + w 13 a 3 + ⋯ + w 1 nan = y 1 (3) w_{11}a_1+w_{12 }a_2+w_{13}a_3+\dots+w_{1n}a_n = y_1 \tag{3}w11a1+w12a2+w13a3++w1nan=y1(3)
w 11 w_{11} w11to w 1 n w_{1n}w1nrecord the first word a 1 a_1a1Context weight. So when we multiply these weights to each word, we are actually re-weighting all other words to the first word. So in a sense, the word "Xiaomei" is now leaning more towards "pretty" and "good" than the words that come after it. This provides some context in a way.

Repeating this for all words gives each word in the sentence some contextual information from the other words. Representing this process in vector form is compact, see the formula below.
softmax ( A ⋅ AT ) = WW ⋅ A = Y (4) \begin{aligned} \text{softmax}(A \sdot A^T) &= W \\ W \sdot A &= Y \end{aligned} \tag{4}softmax(AAT)WA=W=Y( 4 )
The weights here are not trained, and the order or proximity of words has no influence on each other. Furthermore, the process is independent of sentence length, that is, the number of words in a sentence does not matter. This method of adding context to words in sentences is calledself-attention.

query, key, value

The problem with self-attention is that it doesn't train anything. So we naturally think that if we add some trainable parameters to it, the network should be able to learn some patterns to provide better context. So the idea of ​​Query, Key, and Value was introduced.

Let's reuse the previous example - "Xiaomei is beautiful and nice" . In the self-attention formulation, we find that the initial word embedding VVV appears 3 times. The first two times are used as the word vector in the sentence and other words (including itself) to get the weight; the third time is multiplied by the weight to get the final word embedding with context. Word embeddings AAappearing in these three placesA We give them three terms:query (Query), key (Key), value (Value).

Let's say we want all words to match the first word v 1 v_1v1resemblance. We can let v 1 v_1v1as a query. Then, compare the query with all words in the sentence ( v 1 v_1v1to vn v_nvn) for dot product, here v 1 v_1v1to vn v_nvnIt is the key. So the combination of query and key gives us weights. These weights are then combined with all words as values ​​( v 1 v_1v1to vn v_nvn) multiplied. This is the query (Query), key (Key), value (Value). The following formula is a good indication of the corresponding parts of the query (Query), key (Key), and value (Value).
softmax ( A ⏟ Query ⋅ AT ⏟ Key ) = WW ⋅ A ⏟ Value = Y (5) \begin{aligned} \text{softmax}(\underbrace{A}_{\text{Query}} \sdot \underbrace{ A^T}_{\text{Key}}) &= W \\ W \sdot \underbrace{A}_{\text{Value}} &= Y \end{aligned}\tag{5}softmax(Query AKey AT)WValue A=W=Y( 5 )
So where to add the trainable parameter matrix? it's actually really easy. We know that if a1 × k 1 \times k1×The vector of k is multiplied by ak × kk \times kk×A matrix of k , the result is a1 × k 1 \times k1×vector of k . If we putA 1 A_1A1to A n A_nAnEach key in (the shape of each Key is 1 × k 1 \times k1×k ) with ak × kk \times kk×The matrix WKW^Kof kWK (Key matrix) multiplication. Similarly, let the query vector and matrixWQW^QWQ (Query matrix) is multiplied, let the value vector and the matrixWVW^VWV (Value matrix) is multiplied, then the matrixWKW^KWK W Q W^Q WQ andWVW^VWBoth values ​​of V can be trained by a neural network and provide better context than just using self-attention.

After adding the trainable parameter matrix, our query (Query), key (Key), value (Value) vector can be written as:
Q = AWQK = AWKV = AWV (6) Q = AW^Q\\ K = AW^K\ \V = AW^V\\ \tag{6}Q=AWQK=AWKV=AWV( 6 )
Substitute into formula (5) to get a new expression:
softmax ( Q ⋅ KT ) = WW ⋅ V = Y (5) \begin{aligned} \text{softmax}(Q \sdot K^T) &= W \\ W \sdot V &= Y \end{aligned}\tag{5}softmax(QKT)WV=W=Y( 5 )
Connect the two parts together to get
Y = softmax ( Q ⋅ KT ) ⋅ V (6) Y = \text{softmax}(Q \sdot K^T) \sdot V \tag{6}Y=softmax(QKT)V(6)

attention

With the basic concepts of Query, Key, and Value, let's take a look at the official steps and formulas behind the attention mechanism. For your convenience, I will explain the attention mechanism through an example of a database query.

In the database, if we want to query qqq and keyki k_ikiRetrieve some value vi v_ivi, we can perform operations that use a query to identify the key that corresponds to a particular value. The figure below shows the steps to retrieve data in the database. Suppose we send a query to the database, and some manipulation can find out which key in the database is most similar to the query. Once the key is found, the value corresponding to the key is returned as output. In the graph, the operation finds that the query is Key 4most similar to , so it outputs theKey 4 value corresponding to .Value 4

insert image description here

Figure 2. Database value retrieval process

Attention is similar to this database fetching technique, but in a probabilistic manner.
attension ( q , k , v ) = ∑ i similarity ( q , ki ) vi (7) \text{attension}(q, k, v) = \sum_i \text{similarity}(q, k_i)v_i \tag{ 7}attension(q,k,v)=isimilarity(q,ki)vi(7)

  1. attention mechanism measurement query qqq and each key valueki k_ikisimilarities between.
  2. Returns a weight for each key value representing this similarity.
  3. Finally, a weighted combination of all values ​​in the database is returned as output.

In a sense, the only difference between attention and database retrieval is that in database retrieval we get a specific value as input, whereas in attention we get a weighted combination of values. For example, in the attention mechanism, if a query Key 1is Key 4most similar to and , then these two keys will get the most weight, and the output will Value 1be Value 4the combination of and .

The figure below shows the steps required to obtain the final attention value from the query, key and value.

insert image description here

Figure 3. Steps to get attention value

Each step is explained in detail below.

first step

The first step involves keys and queries and corresponding similarity measures. query qqq will affect the similarity. All we have to do is compute the similarity by query and key, where both query and key are embedding vectors. SimilaritySSS is defined as the queryqqq and keykkA certain function of k can be calculated using various methods. Some common similarity calculation functions are listed below:
S i = f ( q , k ) = { q T ⋅ ki … … dot product q T ⋅ ki / d … … scaled dot product ( d is the dimension of the key vector) q T ⋅ W ⋅ ki … … general dot product ( W is the weight matrix through which W projects the query vector into a new space) Kernel method … … uses nonlinear functions to Vectors q and k are mapped to a new space S_i = f(q, k) = \begin{cases} q^T \sdot k_i &\dots\dots &\text{dot product}\\ q^T \sdot k_i /\ sqrt{d} &\dots\dots &\text{scaled dot product}(d\text{is the dimension of the key vector})\\ q^T \sdot W \sdot k_i &\dots\dots &\text{ General dot product}(W\text{is a weight matrix, through}W\text{projects the query vector to a new space})\\ \text{kernel method} &\dots\dots &\text{uses nonlinear function Map vectors }q\text{and }k\text{to new space}\\ \end{cases}Si=f(q,k)= qTkiqTki/d qTWkiKernel method……………………dot productScaled dot product ( d is the dimension of the key vector )General dot product ( W is the weight matrix by which the query vector is projected into a new space )Map vectors q and k to a new space with a nonlinear function
The similarity can be a simple dot product of the query and the key, or a scaled dot product, where qqq andkkThe dot product of k divided by the dimensionddsquare root of d . These are the two most commonly used techniques for computing similarity. The weight matrixWWW projects the query to the new space and then with the keykkk does the dot product. Kernel methods, on the other hand, can use non-linear functions as similarity calculations.

second step

The second step is to find the weight aaa . General useSoftMaxis done. The formula is as follows:
ai = exp ⁡ ( S i ) ∑ j exp ⁡ ( S j ) (8) a_i = \frac{\exp(S_i)}{\sum_j\exp(S_j)}\tag{8}ai=jexp(Sj)exp(Si)( 8 )
Here the similarity is connected to the weights, like a fully connected layer.

third step

The third step is softmax ( a ) \text{softmax} (a)The result of softmax ( a ) and the corresponding valueVVA weighted combination of V. aaThe first value of a is multiplied by VVThe first value of V , then with aaThe second value of a and VVThe products of the second value of V are added, and so on. The final output is the attention value we need.
attension value = ∑ iai V i (9) \text{attension value} = \sum_i a_iV_i \tag{9}attension value=iaiVi( 9 )
Summary

To sum up these 3 steps, in the query qqq and keykkWith the help of k , we obtain the attention value, which is the valueVVA weighted sum/linear combination of V , with weights coming from some similarity between the query and the key.

In order to facilitate the demonstration, I used specific vector values ​​to demonstrate the calculation process in the above explanation. In fact, if written in the form of vector operations, the formula will be more concise.
Attension ( Q , K , V ) = softmax ( QKT ) V (10) \text{Attension}(Q, K, V) = \text{softmax}(QK^T)V\tag{10}Attension(Q,K,V)=softmax(QKT)V( 10 )
In the original paper, the researchers divided the self-attention matrix bythe QQQ (orK, VK, VK,V ) dimension to prevent the inner product from becoming too large.
Attension ( Q , K , V ) = softmax ( QKT d ) V (11) \text{Attension}(Q, K, V) = \text{softmax}\Big(\frac{QK^T}{\sqrt{ d}}\Big)V \tag{11}Attension(Q,K,V)=softmax(d QKT)V(11)

Neural Network Representation of Attention Mechanism

insert image description here

Figure 4. Neural network representation of the attention module

The figure above shows the neural network representation of the attention module. The word embeddings are first passed into linear layers which have no "bias" term and thus do nothing but matrix multiplications. One layer represents "keys", another layer represents "queries", and the last layer represents "values". Performing a matrix multiplication between the key and the query, followed by normalization, we get the weights. These weights are then multiplied by values ​​and added to get the final attention vector. This module can be used in neural networks and is called an "attention block". Multiple such attention blocks can be added to provide more context. The biggest advantage of attention blocks is that we can get gradient backpropagation to update attention blocks (key, query, value weights).

mask attention

In machine translation or text generation tasks, we often need to predict the probability of the next word, which we can only see one word at a time. At this time, attention can only be placed on the next word, not on the second or subsequent words. In short, attention cannot have nontrivial superdiagonal components.

We can modify the attention by adding a masking matrix to remove the neural network's knowledge of the future.
Attension ( Q , K , V ) = softmax ( QKT d + M ) V (12) \text{Attension}(Q, K, V) = \text{softmax}\Big(\frac{QK^T}{\ sqrt{d}}+M\Big)V \tag{12}Attension(Q,K,V)=softmax(d QKT+M)V( 12 )
whereMMM is a mask matrix, which is defined as:
M = ( mi , j ) i , j = 0 nmi , j = { 0 i ≥ j − ∞ i < j (13) M = (m_{i,j})^ n_{i,j=0}\\ m_{i,j} = \begin{cases} 0 & i \ge j \\ -\infin & i \lt j \end{cases}\tag{13}M=(mi,j)i,j=0nmi,j={ 0iji<j( 13 )
MatrixMMThe superdiagonal of M is set to negative infinity so that softmax will compute it as 0.

multi-headed attention

为了克服使用单一注意力的一些缺陷,研究人员又引入了多头注意力。让我们回到最开始的例子——“小美长得很漂亮而且人还很好” 。这里“人”这个词,在语法上与“小美”和“好”这些词存在某种意义或关联。这句话中“人”这个词需要理解为“人品”,说的是小美的人品很好。仅仅使用一个注意力机制可能无法正确识别这三个词之间的关联,这种情况下,使用多个注意力可以更好地表示与“人”相关的词。这减少了注意力寻找所有重要词的负担,增加找到更多相关词的机会。

为此,让我们添加更多线性层作为键、查询和值。这些线性层并行训练,并且彼此具有独立的权重。图下图所示,每个值、键和查询都为我们提供了 3 个输出,而不是一个输出。这 3 组键和查询给出3种不同的权重。然后将这 3 个权重与 3 个值进行矩阵乘法,得到 3 个输出。 将这 3 个注意力连接起来,最终给出一个最终注意力输出。

insert image description here

图5. 具有 3 个线性层的多头注意力

上面演示中的 3 不是个定值,仅仅是为了演示选择的一个随机数。在实际场景中,这个值可以是任意数量的线性层,每一层被成为一个"头" ( h ) (h) (h)。也就是说,可以有任意数量 h h h 个线性层,提供 h h h 个注意力输出,然后将它们连接在一起。而这正是多头注意力(multiple heads)名称的由来。 下图是多头注意力的简化版,具有 h h h 头。

insert image description here

图6. 具有 h 层的多头注意力

After understanding the working principle of multi-head attention, the formula expression of multi-head attention is very simple, basically the structure of the above picture:
head i = Attension ( QW i Q , KW i K , VW i V ) MultiHead = Concat ( head 1 , head 2 , … , head k ) WO (14) \text{head}_i = \text{Attension}(QW_i^Q, KW_i^K, VW_i^V)\\ \text{MultiHead} = \text{Concat }(\text{head}_1, \text{head}_2, \dots, \text{head}_k)W^O \tag{14}headi=Attension(QWiQ,KWiK,VWiV)MultiHead=Concat(head1,head2,,headk)WO( 14 )
So far, we have introduced the mechanisms and ideas behind query, key, value, attention and multi-head attention, which have covered all important modules of the Transformer network. In the next step, we can start to learn how to combine these modules together to form a Transformer network.

Transformer 网络

Transformer 来自 Google 2017年发表的 Attention Is All You Need (2017) 这篇论文。一经推出就收到业界极大关注。目前 Transformer 已经取代 RNN 成为 NLP 乃至计算机视觉(Vision Transformers)领域的最佳模型,当下炙手可热的 ChatGPT 就是从 Transformer 发展而来。

下图展示了 Transformer 的网络结构:

insert image description here

图7. Transformer Network

Transformer 网络由两部分组成——编码器和解码器。

在 NLP 任务中,编码器用于对初始句子进行编码,而解码器用于生成处理后的句子。Transformer 的编码器可以并行处理整个句子,因此比 RNN 更快更好——RNN 一次只能处理句子中的一个词。

编码器

insert image description here

图8. Transformer 网络的编码器部分

编码器网络从输入开始。 首先整个句子被一次性输入网络,然后将它们嵌入到“输入嵌入”块中。接着将“位置编码”添加到句子中的每个词。位置编码对理解句子中每个单词的位置至关重要。如果没有位置嵌入,模型会将整个句子视为一个装满词汇的袋子,没有任何顺序或意义。

输入嵌入

句子中的每个词需要使用 embedding 空间来获得向量嵌入。嵌入只是将任何语言中的单词转换为其向量表示。举个例子,如图9所示,在 embedding 空间中,相似的词有相似的 embeddings,例如“猫”这个词和“喵”这个词在 embedding 空间中会落得很近,而“猫”和“芯片”在空间中会落得更远。

insert image description here

图9. 输入嵌入

位置编码

同一个词在不同的句子中可以表示不同的含义。 例如 “你人真好”,这句话中“人”这个词(位置 2)表示人品;而另语句 “你是个好人” ,这句话中“人”这个词 (位置 5)表示人类。这两句话文字基本相同,但含义完全不同。为了帮助更好地理解语义,研究人员引入了位置编码。位置编码是一个向量,可以根据单词在句子中的上下文和位置提供信息。

在任何句子中,单词一个接一个地出现都蕴含着重要意义。如果句子中的单词乱七八糟,那么这句话很可能没有意义。但是当 Transformer 加载句子时,它不会按顺序加载,而是并行加载。由于 Transformer 架构在并行加载时不包括单词的顺序,因此我们必须明确定义单词在句子中的位置。这有助于 Transformer 理解句子词与词之间的位置。这就是位置嵌入派上用场的地方。位置嵌入是一种定义单词位置的向量编码。在进入注意力网络之前,将此位置嵌入添加到输入嵌入中。 图 10 给出了输入嵌入和位置嵌入在输入注意力网络之前的直观理解。

insert image description here

图10. 位置嵌入的直观理解

There are various ways to define positional embeddings. In the original paper Attention is All You Need , the authors use alternating sin-cosine functions to define positional embeddings, as shown in Figure 5.
PE ( pos , 2 i ) = sin ⁡ ( pos 1000 0 2 i / dmodel ) PE ( pos , 2 i + 1 ) = cos ⁡ ( pos 1000 0 2 i / dmodel ) (15) \begin{aligned} PE_{ (pos, 2i)} &= \sin(\frac{pos}{10000^{2i/d_{model}}})\\ PE_{(pos, 2i+1)} &= \cos(\frac{pos }{10000^{2i/d_{model}}}) \end{aligned}\tag{15}PE( p os , 2 i )PE( p os , 2 i + 1 )=sin(100002 i / dmodelpos)=cos(100002 i / dmodelpos)( 15 )
wherepos posp os is position,iii is the dimension.

This embedding algorithm works well on text data, but it doesn't work on image data. So there can be multiple ways of embedding object locations (text/images), and they can be fixed or learned during training. The basic idea is that positional embeddings allow the Transformer architecture to understand where words are in a sentence, rather than obfuscating words to confuse meaning.

When the input embedding and position embedding are done, the embedding flows into the most important part of the encoder, which contains two important blocks - "multi-head attention" and "feedforward network".

multi-headed attention

The principle of multi-headed attention has been explained in detail before, if you are not clear, you can click here to learn and review.

The multi-head attention block receives as input a vector (sentence) containing subvectors (words in the sentence) and then computes the attention between each position and all other positions of the vector.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-6IwgSdvo-1684998135770)(C:\Users\Jarod\Pictures\transformer7.png)]

Figure 11. Scaled dot product attention

The figure above demonstrates the scaled dot product attention. Zoom click attention is very similar to self-attention, except that scaling (Scale) and mask (Mask) are added after the first matrix multiplication (matmul). The scaling is defined in the original paper as follows:
Scale = 1 / d output = QTK / d (16) \text{Scale} = 1/ \sqrt d\\ \text{output} = Q^TK/ \sqrt d \tag {16}Scale=1/d output=QTK/d ( 16 )
whereQTKQ^TKQT Kis the result of multiplying the query and key matrix,ddd is the dimensionality of word embeddings.

The scaled result is passed to the mask layer. Masking layers are optional and useful for tasks like text generation, machine translation, etc.

The network structure of the attention module has been mentioned before, you can refer to the neural network representation of the attention mechanism , and I won’t repeat it here.

Multi-head attention accepts multiple keys, queries, and values, provides multiple attention outputs through multiple scaled dot-product attention blocks, and finally concatenates multiple attentions to get a final attention output. Multi-headed attention has also been explained in detail earlier, you can refer to Multi-headed attention .

In simple terms: the main vector (sentence) contains sub-vectors (words) - each word has a positional embedding. The attention computation treats each word as a "query" and finds the "keys" corresponding to other words in the sentence, and then performs a convex combination on the corresponding "values". In multi-head attention, multiple values, queries, and keys are selected, providing multiple attention (better word embeddings and context). These multiple attentions are concatenated to give the final attention value (context combination of all words from all multiple attentions), which works better than using a single attention block.

Add & Norm and Feedforward

The next module is Add & Norm , which takes the residual connection of the original word embedding, adds it to the multi-head attention embedding, and then normalizes it to a standard normal distribution with mean 0 and variance 1.

The result of Add & Norm will be sent to the Feedforward module, and an Add & Norm block will be added after the Feedforward module.

The entire multi-head attention and feed-forward module repeats nn in the encodern times (hyperparameter).

decoder

insert image description here

Figure 12. The decoder part of the Transformer network

The output of the encoder is also a sequence of embeddings, one per position, where each position embedding contains not only the embedding of the original word at that position, but also information about other words it has learned using attention.

The output of the encoder is sent to the decoder part of the Transformer network, as shown in Figure 12. The purpose of a decoder is to produce output. In the original paper Attention is All You Need , the decoder is used for sentence translation (eg from Chinese to English). So the encoder will take the Chinese sentence and the decoder will translate it into English. In other applications, the decoder part of the Transformer network is not necessary, so I won't elaborate on it too much.

The Transformer decoder works as follows (take the machine translation task in the original paper as an example):

  1. In machine translation tasks, the decoder accepts Chinese sentences (for Chinese to English translation). As with the encoder, first a word embedding and a position embedding need to be added and fed to a multi-head attention block.
  2. The self-attention module will generate an attention vector for each word in an English sentence, which is used to represent how related one word is to another word in the sentence.
  3. The attention vectors in English sentences are then compared with those in Chinese sentences. This is the part where the Chinese to English word mapping happens.
  4. In the last few layers, the decoder predicts the translation of the Chinese word into the most likely English word.
  5. The whole process is repeated multiple times to obtain a translation of the entire text data.

The corresponding relationship between each of the above steps and the decoder network module is as follows:

insert image description here

Figure 13. The role of different decoder blocks in sentence translation

Most of the modules in the decoder have been seen in the encoder before, so I won't go into details here.

Implement Transformer with PyTorch

To build our own Transformer model from scratch, the following steps need to be followed:

  1. Import necessary libraries and modules
  2. Define the basic modules: multi-head attention, position feedforward network, position encoding
  3. Build encoder and decoder layers
  4. Combine the encoder and decoder layers to build a complete Transformer model
  5. Prepare sample data
  6. training model

We will complete the above work step by step.

Import necessary libraries and modules

Let's start by importing the necessary libraries and modules. Building Transformer requires the following libraries and modules:

import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import math
import copy

Define the basic module

Next, we will define the basic building blocks of the Transformer model.

multi-headed attention

Multi-head attention has been discussed in detail before, and its structure is shown in Figure 6. In simple terms, the multi-head attention mechanism computes the attention between each pair of positions in the sequence. It consists of multiple "attention heads" that capture different aspects of the input sequence.

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, V)
        return output
        
    def split_heads(self, x):
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        
    def combine_heads(self, x):
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)
        
    def forward(self, Q, K, V, mask=None):
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))
        
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        output = self.W_o(self.combine_heads(attn_output))
        return output

MultiHeadAttentionClass initializes the module with input parameters and a linear transformation layer. It computes attention scores, reshape input tensors into multiple heads, and combines attention outputs from all heads. forward()method computes multi-head self-attention, allowing the model to focus on several different aspects of the input sequence.

Position Feedforward Network

class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionWiseFeedForward, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

PositionWiseFeedForwardClass that extends PyTorch's nn.Moduleand implements positional feed-forward networks. This class is initialized with two linear transformation layers and a ReLU activation function. forward()method applies these transformation and activation functions sequentially to compute the output. This process enables the model to take into account the location of input elements when making predictions.

location code

Position encoding is used to inject the position information of each token in the input sequence. It uses sine and cosine functions of different frequencies to generate position codes.

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        super(PositionalEncoding, self).__init__()
        
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer('pe', pe.unsqueeze(0))
        
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

PositionalEncodingThe class is initialized with input parameters d_modeland max_seq_length, creating a tensor to store positional encoded values. This class computes the sine and cosine of even and odd indices div_termrespectively . forward()method computes the positional encoding by adding the stored positional encoding value to the input tensor, thus enabling the model to capture the positional information of the input sequence.

With these basic blocks, we can start to build the encoder layer and decoder layer.

encoder layer

The encoder layer structure of the Transformer is shown in Figure 8. The encoder part of the Transformer network . The encoder layer consists of a multi-head attention layer, a position feed-forward layer, and two layer normalization layers.

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

EncoderLayerThe class is initialized with input parameters and components, including a multi-head attention module, a positional feed-forward network module, a two-layer normalization module, and a dropout layer. forward()method computes the encoder layer output by applying self-attention, adding the attention output to the input tensor, and normalizing the result. It then computes the position feed-forward output, combines it with the normalized self-attention output, and normalizes the final result before returning the processed tensor.

decoder layer

The decoder layer structure of the Transformer is shown in Figure 9. The decoder part of the Transformer network . The decoder layer consists of two multi-head attention layers, a position feed-forward layer, and three layer normalization layers.

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, enc_output, src_mask, tgt_mask):
        attn_output = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn_output))
        attn_output = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        return x

DecoderLayerThe class is initialized with input parameters and components such as a multi-head attention module for masked self-attention and cross-attention, a positional feed-forward network module, a three-layer normalization module, and a dropout layer.

forward()method computes the decoder layer output by performing the following steps:

  1. Compute the masked self-attention output and add it to the input tensor, followed by dropout and layer normalization.
  2. The cross-attention output between the decoder and encoder outputs is computed and added to the normalized masked self-attention output, followed by dropout and layer normalization.
  3. The position feed-forward output is computed and combined with the normalized cross-attention output, followed by dropout and layer normalization.
  4. Return the processed tensor.

These operations enable the decoder to generate target sequences from the input and the encoder output.

Transformer model

With the encoder and decoder, we can combine the encoder and decoder to create a complete Transformer model.

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):
        super(Transformer, self).__init__()
        self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_length)

        self.encoder_layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.decoder_layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

        self.fc = nn.Linear(d_model, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def generate_mask(self, src, tgt):
        src_mask = (src != 0).unsqueeze(1).unsqueeze(2)
        tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(3)
        seq_length = tgt.size(1)
        nopeak_mask = (1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool()
        tgt_mask = tgt_mask & nopeak_mask
        return src_mask, tgt_mask

    def forward(self, src, tgt):
        src_mask, tgt_mask = self.generate_mask(src, tgt)
        src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
        tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))

        enc_output = src_embedded
        for enc_layer in self.encoder_layers:
            enc_output = enc_layer(enc_output, src_mask)

        dec_output = tgt_embedded
        for dec_layer in self.decoder_layers:
            dec_output = dec_layer(dec_output, enc_output, src_mask, tgt_mask)

        output = self.fc(dec_output)
        return output

TransformerClasses combine previously defined modules to create a complete Transformer model. During initialization, the Transformer module sets the input parameters and initializes various components, including embedding layers for source and target sequences, positional encoding modules, encoding and decoding layer modules for creating stacked layers, linear layer and dropout layer.

generate_mask()method creates binary masks for the source and target sequences, used to ignore padding tokens and prevent the decoder from processing future tokens. forward()The method computes the output of the Transformer model through the following steps:

  1. Use generate_mask()the method to generate source and destination masks.
  2. Compute source and target embeddings, and apply positional encoding and dropout.
  3. The source sequence is processed through the encoder layer, updating enc_outputtensors .
  4. Process the target sequence through the decoder layer, using enc_outputand masking, and updating dec_outputthe tensor.
  5. Apply a linear projection layer to the decoder output to get the final output.

The above steps enable the Transformer model to process an input sequence and generate an output sequence based on the combined features of its components.

Prepare sample data

src_vocab_size = 5000
tgt_vocab_size = 5000
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
max_seq_length = 100
dropout = 0.1

transformer = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout)

# 生成随机样本数据
src_data = torch.randint(1, src_vocab_size, (64, max_seq_length))  
tgt_data = torch.randint(1, tgt_vocab_size, (64, max_seq_length))  

For the convenience of demonstration, I randomly generate sample data here. In actual development, you can use larger datasets and preprocess text.

training model

Once the data is ready, the model can be trained.

criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

transformer.train()

for epoch in range(100):
    optimizer.zero_grad()
    output = transformer(src_data, tgt_data[:, :-1])
    loss = criterion(output.contiguous().view(-1, tgt_vocab_size), tgt_data[:, 1:].contiguous().view(-1))
    loss.backward()
    optimizer.step()
    print(f"Epoch: {
      
      epoch+1}, Loss: {
      
      loss.item()}")

The above is how to use Pytorch to build a simple Transformer from scratch.

Summarize

All large language models are trained using Transformer encoder or decoder blocks. Therefore, it is very important to have a deep understanding of Transformer networks. Hope this article helps you.

Guess you like

Origin blog.csdn.net/jarodyv/article/details/130867562