Write your own chatgpt: the principle and implementation of the Attention mechanism

The success of large models such as chatgpt depends on an algorithm breakthrough, which is the attention mechanism. This mechanism allows the neural network to more effectively extract and identify the inherent rules from the language. At the same time, it supports multi-channel parallel operations. Therefore, compared with the original natural language processing algorithm, it can speed up training through concurrent methods. It has been improved hundreds of times, so it is possible to use massive training data, coupled with ultra-large-scale computing power, and use ultra-large-scale parallel computing to promote network training. Before this mechanism was invented, the original algorithm could not achieve parallel computing at all. Therefore, The training efficiency of the network is extremely low, which is why the ability of the network using the attention mechanism far exceeds that of the original model.

So what is the attention mechanism? If you search on the Internet, you will find a lot of related content, but I think there are very few that can explain it clearly. If you want to understand its principles, you need to have a good mathematical foundation. I am here Try to use more popular language to explain it. In the absence of rigorous mathematical logical derivation, the logical description in ordinary language will inevitably miss the mark, but the main purpose is to give everyone a basic perceptual understanding. Later we will use the code practice method Enhance our rational understanding of the algorithm.

The so-called attention mechanism is essentially a kind of "harmony". Since the computer is based on 0,1 binary, the result is either 0 or 1, and there is no gray area. However, this "one word and one word" conclusion cannot be applied to deep learning. Students who have deep learning experience can see that the network gives The answers are often based on probability. In application, we choose the conclusion with the highest probability as the final result. The essence of the attention mechanism is to assign a ratio to multiple possible answers, then multiply these ratios with the corresponding answers, and finally add up to become the final answer.

Give a specific example. Suppose we want to predict the growth of a couple's children at age 10. First, we need to use a mathematical method to describe the so-called "growth status". In deep learning, we often use vectors to represent the objects that need to be described. Therefore, we use a vector to describe the child's "growth status", for example, V (child) = {height, weight, face shape, heart rate, mental state, health status...}, that is to say, we use a vector composed of a series of indicators. Describe the child's situation. The question now is how do we get the values ​​of each component in the vector V (10-year-old child). A straightforward method is to take out the corresponding vectors of the child's father, mother, grandpa, grandma, grandpa, grandma, uncle, aunt, uncle, aunt, aunt, etc. at the age of 10, and compare these vectors according to a "certain ratio" Summed up, the resulting vector can be used as the child's "growth status" vector when he is 10 years old.

The problem now is how to determine the "certain proportion", what proportion of the vector corresponds to the father, what proportion of the vector corresponds to the mother, and what proportion of the vectors correspond to the grandparents and other relatives? Obviously, as direct parents, the proportion of father and mother will definitely be larger. The farther the biological relationship is, the proportion will naturally decrease accordingly. Another problem here is whether the child we want to infer is a boy or a girl. If the gender of the child is different, the vector ratio corresponding to each relative must change accordingly. We use the variable query to represent the child information to be speculated. For example, query(boy) represents the growth of a boy at age 10, and query(girl) represents the growth of a girl at age 10.

Suppose we now have a function f that can calculate the ratio of the corresponding relative vectors, for example, f(query(boy), V(father)) = a1, which means that when we want to predict the growth of the boy when he is ten years old, the father The proportion of the "growth situation" vector at the age of 10 is a1, so we calculate f(query(boy), v(mother))=a2,...f(query(boy),v(aunt))=a11, where a1 + a2 +… + a11 = 1, so we can calculate the boy’s “growth situation” vector when he is 10 years old as a1V (father) + a2 V(mother)+…+a11*V(aunt).

In the above algorithm, we use key to represent the vectors such as v(father) and v(mother) used to calculate the proportion in f. The v(father) and v(( Mom) and other vectors, we use value to represent it. Obviously, the collection object corresponding to key and the collection object corresponding to value are the same. This situation is called self-attention. In many application scenarios, the objects corresponding to key and value are very different. It may be different. For natural language processing, such as chatgpt, it uses the self-attention mechanism, that is, the key and value are the same set.

In fact, there are still big problems in the above algorithm, that is, there are many influencing factors that have not been taken into account. For example, a series of factors such as the boy’s country, ethnicity, era, socioeconomic development, national culture, nationality of parents and relatives, family financial status, family education status, etc. We predict that a child who grew up in Switzerland will be different from a child who grew up in Afghanistan. The results of children must be very different. These differences are caused by various external factors. The question is how to calculate the impact of these differences? The first problem is how do we know what external factors will affect a child's growth? Since we cannot carefully exhaust all external factors, in deep learning algorithms, a matrix is ​​usually used to represent the size of the matrix. The larger it is, the more likely it is to include possible impacts and thus the more accurate the predictions will be. We use Wq to represent the impact of unknown factors on the child. These factors will definitely affect the impact ratio of the corresponding relatives, so we use Wk to represent the impact received by the corresponding relatives when calculating the ratio, so the above f(query(boy) , v (father)) becomes f (Wqquery (boy), Wkv (father)), and the ratio calculations for other relatives must be the same. Multiply the corresponding matrix. Finally, these factors will also affect the process of "mixing the mud". We use Wv to represent the impact of related factors on the mud, so a1*(WvV(Dad))+a2(Wv*V(aunt))V(mother))+…+a11(Wv

The process described above is a "human-speaking" description of the attention mechanism. The task of the algorithm is to calculate Wq, Wk, and Wv through a large amount of data. After determining these variables, we can obtain the corresponding results through operations. In natural language processing, the understanding of words in a sentence actually depends on other words in the same sentence. For example, the word apple, do you think it should correspond to the fruit apple, or to the technology giant apple? Obviously we need to judge by the context of the sentence. The corresponding sentence is, please buy a bag of apple and orange. The apple here is fruit. Why can we be so sure? Because the existence of bag and orange determines its meaning. Here apple The corresponding word is query, and all other words correspond to key and value. Among them, apple has the greatest relationship with bug and orange, and other words have little connection with it. So we want the computer to understand apple here, that is, let the computer calculate a1v(please) + a2v(buy)+…a4v(bag)+…a6V(apple)+…+a8*(orange), since words such as please and buy have a very low impact on the meaning of apple, their corresponding influence coefficient values ​​need to be small, but the words bag, orange The influence is very large, so the corresponding coefficient will be relatively large. The word apple itself is not very helpful for the computer to understand it, so the corresponding coefficient will be very small. In the same way, for the sentence apple released new phone, the word phone has a great influence on the understanding of apple. Through this word, we can determine that apple here corresponds to a technology company rather than a fruit. Therefore, when "harmony", the coefficient corresponding to phone It will be very big.

Based on the above discussion, we simulate the entire self-attention algorithm process through code. First, we use vectors to represent words in sentences. In the previous chapter, we saw that the length of the vector corresponding to a word may be very large. Here we only need to understand the principle, so we use a vector of length 4 to represent the word:

import numpy as np
print("步骤 1,随机生成 3 个长度为 4 的向量来表示含有三个单词的句子")
#向量的数值不重要
x = np.array(
    [
        [1.0, 0.0, 1.0, 0.0],
        [0.0, 2.0, 0.0, 2.0],
        [1.0, 1.0, 1.0, 1.0]
    ]
)
print(x)

The result after running the above code is:

步骤 1,随机生成 3 个长度为 4 的向量来表示含有三个单词的句子
[[1. 0. 1. 0.]
 [0. 2. 0. 2.]
 [1. 1. 1. 1.]]

We use graphics to express related processes. Through changes in graphics, we can more easily grasp the algorithm process. The first is to initialize three word vectors:
Please add image description

The second part initializes the Wq, Wv, and Wk described earlier. Similarly, the values ​​of their internal components are not important at all. The neural network will confirm their specific values ​​through training:

print("步骤 2,确认 Wq, Wv, Wk,由于要跟上面向量做乘法,因此他们的行数是 4,列数可以任意取值,注意 w_query 和 w_key 列数取值要相同")
w_query = np.array([
    [1, 0, 1],
    [1, 0, 0],
    [0, 0, 1],
    [0, 1, 1]
])
print(f"Wq is: {
      
      w_query}")

w_key = np.array([
    [0, 0, 1],
    [1, 1, 0],
    [0, 1, 0],
    [1, 1, 0]
])
print(f"Wk is :{
      
      w_key}")

w_value = np.array([
    [0, 2, 0],
    [0, 3, 0],
    [1, 0, 3],
    [1, 1, 0]
])

print(f"Wv is :{
      
      w_value}")

The three added transformation matrices are shown in the figure below:
Please add image description

In natural language processing, each word in a sentence will play the role of query, and each word will be used as key and value. Therefore, we need to multiply each word by w_q to calculate "sum" for the next step. To prepare for the "thin mud" ratio, we do the following:

print("每个向量都会作为 query 使用,因此他们都要乘以 w_query 为下一步计算分配比率做准备")
Q = np.matmul(x, w_query)
print(f"query matrix is: {
      
      Q}")

Since each word has to bear key and value, they also need to be multiplied by the matrices w_k, w_v respectively. The code is as follows:

print("每个向量都会作为 key 使用,因此他们也需要乘以 w_key")
K = np.matmul(x, w_key)
print(f"key matrix is : {
      
      K}")

print("每个向量都会作为 value 使用,因此也需要乘以 w_value")
V = np.matmul(x, w_value)
print(f"value matrix is {
      
      V}")

The above calculation process is shown in the figure below:
Please add image description

As can be seen from the above figure, word vector 1 is multiplied by matrices Q, K, V respectively to obtain Q1, K1, V1. Similarly, word vector 2 is multiplied by Q, K, V to obtain Q2, K2, V2, words Vector 3 is multiplied by Q, K, V to get Q3, K3, V3. In order to prevent the lines from being too messy, I did not connect word vector 2 and word vector 3 to the three multiplication operation symbols in the above picture, but we need to know Q2, The source of K2, V2, Q3, K3, V3 is the result of the same operation on word vector 2 and vector 3.

Next we calculate the "distribution ratio", which is calculated as follows:
s o f t m a t ( Q ∗ K T / s q r t ( d k ) ) softmat(Q * K^T / sqrt(d_k )) softmat (QKT/sqrt(dk))
The value of d_k corresponds to the number of words in the sentence, so the value is 3, and its square root is 1.75. When we round it up, it is 1, so the distribution ratio is calculated as follows :

from scipy.special import softmax
print(f"计算分配比率")
'''
k_d 对应 Q * K^t 后矩阵的维度,由于 Q的列数和 K^t 都是 3*3 矩阵,相乘后矩阵
每行的列数为 3,因此 k_d = sqrt(t) 就约等于 1
'''
k_d = 1
attention_scores = Q @ K.transpose() / k_d
print(attention_scores)

Pay attention to Q @ K.transpose() here. Here we actually multiply Q1, Q2, Q3 by (K1, K2, K3) respectively to obtain the sum and mud ratio for each vector. For example, the sum ratio for the first word vector is (Q1 * K1^t, Q2 * K2 ^2, Q3 * K3 ^t), where K1 ^ t is the transposition of the vector K1, that is, it is changed from the row Convert the vector to a column vector so that the two vectors can be multiplied.

Let’s take a look at the sum and mud ratio corresponding to each vector after the above operation, as shown in the figure below:
Please add image description
Next, we do softmax on the distribution ratio obtained by each word Normalization processing, that is, making the sum of the proportions equal to 1, the relevant code is as follows:

print("通过 softmax 将分配比率正规化,也就是使得各比率之和为 1")
attention_scores[0] = softmax(attention_scores[0])
attention_scores[1] = softmax(attention_scores[1])
attention_scores[2] = softmax(attention_scores[2])
#第一个单词对应的 value 分配比率
print(attention_scores[0])
#第二个单词对应的 value 分配比率
print(attention_scores[1])
#第三个单词对应的 value 分配比率
print(attention_scores[2])

The above operation corresponds to the following figure:
Please add image description
Finally, we multiply the distribution ratio by V[0], V[1], V[2], and then add them up to get the Corresponding to the result vector, we will perform the corresponding operation on word 1 below. The operation on other words is exactly the same:

print("计算和稀泥结果")
print(V[0])
print(V[1])
print(V[2])

'''
计算第二,第三个单词分配比率时,只要把 attention_scores[0][i](i=1,2,3)换成
attention_scores[1][i], attention_scores[2][i]即可
'''

attention1 = attention_scores[0].reshape(-1, 1)
print(f"第一个单词的和稀泥分配比为:{
      
      attention1}")

print("第一个向量和稀泥给第一个单词的数量为:")
attention1 = attention_scores[0][0] * V[0]
print(attention1)

print("第二个向量和稀泥给第一个单词的数量为:")
attention2 = attention_scores[0][1] * V[1]
print(attention2)
print("第三个向量和稀泥给第一个单词的数量为")
attention3 = attention_scores[0][2] * V[2]
print(attention3)

print("将上面 3 个 attention 加总就是针对第一个单词和稀泥的结果")
attention_input1 = attention1 + attention2 + attention3
print(f"第一个单词的和稀泥结果:{
      
      attention_input1}")

The above operation corresponds to the following figure:
Please add image description
In the above operation process, the matrices Q, K, V are the parameters to be trained by the network. The above is the basic process of the self-attention mechanism. In the model used by chatGPT, the attention mechanism is slightly different from the above. It is called multihead-attention, which means that the above process is divided into 8 parallel processes and promoted simultaneously.

In our simulation, the length of the word vector is only 4, while in the application of chatGPT the length is at least 512. After chatGPT3.5, the length will definitely be longer, but the algorithm flow is similar. Let's assume that the corresponding length of the word vector is 512. The so-called multihead refers to dividing the word vector with a length of 512 into 8 sub-vectors, each vector has a length of 64, and then each sub-vector performs the operation described above, and these 8 sub-vectors can The above processes are executed simultaneously in a concurrent manner.

Let’s also simulate the multihead process. Assume that 8 subvectors with a length of 64 complete the operations described above. The results are as follows:

import numpy as np 

print("模拟 8 个长度为 64 的子向量完成 attention 操作后的结果,这里我们设定 Q,K,V 对应的列都是 64,因此操作结果得到的就是长度为 64 的向量")
head1 = np.random.random((3, 64))
head2 = np.random.random((3, 64))
head3 = np.random.random((3, 64))
head4 = np.random.random((3, 64))
head5 = np.random.random((3, 64))
head6 = np.random.random((3, 64))
head7 = np.random.random((3, 64))
head8 = np.random.random((3, 64))

print("将 8 个 3*64 向量在水平方向拼接变成 3*512 向量")
output_attention = np.hstack((head1, head2, head3, head4, head5, head6, head7, head8))
print(output_attention)

The above code simulates the results of 8 sub-vectors after going through the process described above, and then horizontally splices the 8 364 results into 3512 matrix, according to the transformer architecture we described earlier, the next step is to normalize this result:
Please add image description

The multi-Head Attention in the above picture is the result of what we described earlier. Let's see what Add & Norm does. It actually executes a function called LayerNormalization. The calculation process of this function is as follows:
LayerNormalization(v) = r * (v - u)/a + b
Here r, u, a, b are parameters that can be calculated, here v is the input data of the function, which corresponds to a vector. If the length of the vector corresponding to input v is d, that is, v has d parameters v=(v1, v2, …vd), then u = (v1 + v2 +… + vd)/d, a is the standard of the input vector v Variance a = sqrt((v1-u) ^ 2 + ...(vd-u)^2), and finally b is a vector with the same dimension as v. This b is also a parameter vector that can be trained by the network.

It should be noted that the "Add & Norm" layer receives two inputs. One is the input vector to the multi-head attention layer, which is the result of adding the word vector to the position vector. The other is the multi -The result output by head attention. This is because the input word vector warns multi-head attention that after this layer, some information in the word vector may be lost, so adding up the results of the word vector multi-head attention can ensure The principle is that the information contained in the word vector is not lost. This operation is the "Add" in "Add & Norm".

Going up is the Feedforward (FFN) layer, which is actually a simple two-layer forward network. The first layer contains 2048 nodes, and the second layer contains 512 nodes. The activation function uses ReLU. The required input vector length is 512, and the output vector is also 512. Its corresponding calculation is:
FFN(x) = max(O, x * W1 + b1) * W2 + b2
W1 is the connection parameter between the first layer and the input layer, and W2 is the connection parameter between the first layer and the second layer.

The above is our brief introduction to the transformer model. There are still many important details that are difficult to describe in words. Later we use the transformer architecture to create two language translation models. Through practical combat, perhaps we can have a further understanding of the theory. For more information, please search at Station B coding disney.

Guess you like

Origin blog.csdn.net/tyler_download/article/details/134505471