Attention and Self-Attention [10,000-word dismantling of Attention, the most detailed explanation of the attention mechanism in the entire network]

Previous article From RNN to Attention We introduced the Attention mechanism under the Encoder-Decoder framework of RNN to solve the gradient descent and performance bottleneck problems in the RNN model, as shown in the following figure:
RNN-attention

The picture above is the Encoder-Decoder framework that introduces the Attention mechanism. The above figure no longer only has a single semantic code C, but has multiple codes such as C1, C2, and C3. When predicting Y1, the attention of Y1 is on C1, then we use C1 as the semantic code, when predicting Y2, the attention of Y2 is focused on C2, then we use C2 as the semantic code, and so on , simulating the human attention mechanism.

Although we have talked about the principle of Attention in RNN, it is introduced based on the Encoder-Decoder framework. The Attention mechanism does not have to be based on the Encoder-Decoder framework. So what is its essential idea?


start of text

1. Attention

The Attention mechanism was first applied in computer vision, and then began to be applied in the NLP field. It really flourished in the NLP field, because BERT and GPT performed surprisingly well in 2018, and then became popular. The cores of Transformer and Attention began to be focused on by everyone.

If you use a graph to express the position of Attention, it looks roughly like this:

attention

Attention , from the way it is named, obviously borrows from the human attention mechanism, and its core logic is "From focus on all to focus on focus

Concentrate limited attention on key information, thereby saving resources and obtaining the most effective information quickly.

baby

The Attention mechanism can better solve the problem of sequence long-distance dependence, and has parallel computing capabilities

However, Attention does not have to be used under the Encoder-Decoder framework, it can be separated from the Encoder-Decoder framework.

  • The picture below is a diagram of the principle after breaking away from the Encoder-Decoder framework:

frame

  • 3-step decomposition of the Attention principle:

insert image description here
Use the information of Query (query object) to filter out important information from Values ​​( queried object ). Simply put, it is to calculate the degree of correlation between each information in Query and Values .

Through the above figure, Attention can usually be described as follows, expressed as mapping Query (Q) and key-value pairs (the Values ​​are split into key-value pairs) to the output, where query, each key, each Values ​​are all vectors, and the output is the weight of all values ​​in V (queried object) , where the weight is calculated by Query and each key , and the calculation method is divided into three steps:

Step 1: Calculate the similarity between Query and each Key to get the similarity score s

The second step: softmax the s score into a probability distribution between [0,1]

Step 3: Use [a1, a2, a3...an] as the weight matrix to weight and sum the Value to get the final Attention value

The general formula is as follows:

insert image description here
Specific steps are as follows:insert image description here

Tell a story to help you understand:

story 1

  • There are many books (value) in the library management (source), for the convenience of searching, we numbered the books (key). When we want to learn about Marvel (query), we can read books related to anime, movies, and even World War II (Captain America).
  • In order to improve efficiency, not all books will be carefully read. For Marvel, anime and movie-related ones will be read carefully (higher weight), but World War II only needs to be scanned briefly (lower weight).
  • When we read all of them, we will have a comprehensive understanding of Marvel.

Two, Self - Attention

2.1 The difference between Attention and Self-Attention

1. Attention:

The traditional Attention mechanism occurs between the elements of Target and all elements in Source .
In the Encoder-Decoder framework of general tasks, the content of the input Source and the output Target are different. For example, for English-Chinese machine translation, the Source is an English sentence, and the Target is the corresponding translated Chinese sentence.

2. Self - Attention

Self-Attention, as the name implies, refers not to the Attention mechanism between Target and Source, but to the Attention mechanism that occurs between the internal elements of Source or between the internal elements of Target. The specific calculation process is the same, but the calculation object has changed. , which is equivalent to Query=Key=Value, and the calculation process is the same as attention.
( For example, when calculating the weight parameters in Transformer, to convert the text vector into the corresponding QKV, only the corresponding matrix operation needs to be performed at the Source, and the information in the Target is not used. )

The self-attention mechanism is a variant of the attention mechanism, which reduces the dependence on external information and is better at capturing the internal correlation of data or features

The application of self-attention mechanism in text mainly solves the problem of long-distance dependence by calculating the mutual influence between words.
The following figure is an example of self-attention:
insert image description here

We want to know its in this sentence, what it refers to in this sentence, and which words are related to it, then we can use its as Query, and then use this sentence as Key and Value to calculate the attention value, Find the word most related to its in this sentence. Through self-attention we found that its most relevant in this sentence is Law and application

To summarize the differences:

  1. The key point of Self-attention is to stipulate that all three of KQV come from X. Find keypoints in X by X. It can be seen that QKV is equal, and they are all obtained by linear transformation of word vectors. It is not Q=V=K=X, but X is obtained through linear transformation of W k , W q , and W v .
  2. Attention is to find important information in V through a query variable Q, K is changed from V, QK=A, AV = Z (attention value), Z is actually another representation of V, which can also be called a word vector, V with syntactic and semantic features
  3. In other words, self-attention has two more constraints than attention:
    (1) Q=K=V (same source)
    (2) Q, K, and V need to follow the approach of attention

2.2 The purpose of introducing the self-attention mechanism

The input received by the neural network is many vectors of different sizes, and there is a certain relationship between different vectors and vectors, but the relationship between these inputs cannot be fully utilized during actual training, resulting in extremely poor model training results.
For example:

机器翻译问题(序列到序列的问题,机器自己决定多少个标签)
词性标注问题(一个向量对应一个标签)
语义分析问题(多个向量对应一个标签)等文字处理问题

2.3 Detailed explanation of Self-Attention

insert image description here

For each input vector a, a vector b is output after the blue part of self-attention. This vector b is obtained by considering the influence of all input vectors on a1. There are four word vectors corresponding to a. Outputs four vectors b.

As shown below:

insert image description here

  • The above picture looks complicated, but it is actually calculating the similarity between a1 and [a1, a2, a3, a4] respectively, and finally get b1. a1~a4, may be input or output of hidden layer.
  • a1~a4 belong to the information of the whole source, this step is to calculate the relationship between the whole source information

insert image description here

  • Taking a1 as an example, dot-product two parameter matrices W q and W k ​​respectively to obtain q 1 and k 1 . (q = query, k = key)
  • a 1,1 represents the degree of similarity or relationship between a 1 and a 1 ; a 1,2 represents the degree of similarity or relationship between a 1 and a 2 ; similarly a 1,3 and a 1,4 .
  • After getting the degree of correlation between each vector and a1, use softmax to calculate an attention distribution, so that the degree of correlation is normalized, and you can see which vectors are most related to a1 through the value.

insert image description here

V is the same as Q and K, V 1 = W V * a 1 ,

If the correlation between a 1 and a 2 is relatively high, α 1, 2 is relatively large, then the obtained output b 1 may be relatively close to v 2 , that is, the attention score determines the weight of the vector in the result;

Analysis in matrix form

insert image description here

Combining 4 inputs a into a matrix I, this matrix has 4 columns, that is, a 1 to a 4 , and I is multiplied by the corresponding weight matrix W to obtain the corresponding matrix Q, K, V, which represent query and key respectively and value.
The three W matrices (W q , W k and W v ) are the parameters we need to learn.

insert image description here

  • Use the obtained Q and K to calculate the correlation between every two input vectors, that is, to calculate the value α of attention. There are many ways to calculate α, usually by dot multiplication.
  • Each value in the matrix A records the size α of the Attention of the corresponding two input vectors, and A' is a matrix normalized by softmax.

insert image description here

Using the obtained A' and V, calculate the output vector b of the self-attention layer corresponding to each input vector a

insert image description here

To summarize the self-attention operation process, the input is I and the output is O


三、Multi-head Self-attention

The advanced version of Self-attention Multi-head Self-attention, multi-head self-attention mechanism

Because there are many different forms of correlation, there are many different definitions, so sometimes there is not only one q, there must be multiple qs, and different qs are responsible for different kinds of correlations. Just like in language information, the grammatical and semantic features are very complex, and only one layer of QKV is not enough to handle complex tasks.

insert image description here

In the figure above, there are two heads, which represent two different correlations of this problem .

  • Similarly, there need to be multiple k and v, and the calculation method of the two k and v is the same as q, and ki and vi are calculated first, and then multiplied by two different weight matrices.

  • So how to do self-attention after calculating q, k, and v?

  • It is the same as the above process, except that the first type is done together, the second type is done together, two independent processes, and two b (b i1 , b i2 ) are calculated.

  • This is just an example of two heads, and the process of multiple heads is the same, and b is calculated separately.

  • Finally, concatenate b i1 and b i2 into a matrix and then multiply the weight matrix W to get b^i, which is
    the output of the self-attention vector ai, as shown in the figure below:

insert image description here


四、Positional Encoding

When training self attention, the information about the position is actually missing , and there is no difference between before and after. The a1, a2, and a3 mentioned above do not represent the order of the input, but only the number of input vectors. Unlike RNN, which has Obvious sequence. Self-attention is input and output at the same time .

How to reflect location information in Self-Attention?

In order to solve the sequence order information lost by Attention, the proponent of Transformer proposed Position Embedding , that is, before performing Attention calculation on the input X, add position information to the word vector of X, that is to say, the word vector of X is:
X final_embedding = Embedding + Positional Embedding
insert image description here
another representation:
insert image description here


5. Expansion of Self-Attention

5.1 Self-attention vs RNN

insert image description here
After introducing Self Attention, it is easier to capture long-distance interdependent features in sentences.

RNN or LSTM needs to perform sequence calculations in order. For long-distance interdependent features, it takes information accumulation of several time steps to connect the two. The farther the distance is, the less likely it is to be effectively captured.

  • Self-Attention will directly connect the connection between any two words in the sentence through one calculation step during the calculation process, so the distance between long-distance dependent features is greatly shortened, which is conducive to the effective use of these features.

  • In addition, Self Attention also directly helps to increase the parallelism of calculations. It just makes up for the two shortcomings of the attention mechanism, which is the main reason why Self-Attention is gradually being widely used.

5.2 Self-attention vs CNN

insert image description here

Self-Attention can actually be regarded as a CNN based on global information.

  • The convolution kernel of traditional CNN is considered to be regulated, and can only extract the information in the convolution kernel for image feature extraction, but Self-Attention focuses on the internal feature information of the source, and can "learn" the most suitable one from a global perspective. The "convolution kernel" maximizes the extraction of image feature information.
  • In the case of a small amount of data, the training effect of Self-Attention is poor, not as good as CNN;
  • In the case of a large amount of data, the training effect of Self-Attention is better than CNN;

insert image description here

insert image description here

5.3 Advantages of Self-attention

  1. Fewer parameters : Compared with CNN and RNN, it has fewer parameters and less complexity. Therefore, the requirements for computing power are also smaller.
  2. Fast speed : Attention solves the problem that RNN and its variant models cannot be calculated in parallel. The calculation of each step of the Attention mechanism does not depend on the calculation result of the previous step, so it can be processed in parallel like CNN.
  3. Good effect : Before the introduction of the Attention mechanism, there was a problem that everyone had been very distressed: long-distance information would be weakened, just like people with weak memory cannot remember past events.

六、Masked Self-attention

Transformer uses the Masked Self Attention model, which needs to be explained in conjunction with Transformer's dynamic process, so it will be explained in the next Transformer article.

insert image description here

The mask is to cover the gray area with 0 along the diagonal, so that the model cannot see the future information, as shown in the figure below, and after the softmax is done, the result of the horizontal axis is 1

In a nutshell: Decoder makes a mask to make the behavior of the training phase and the testing phase consistent, there will be no gaps, and the model will not see future information to avoid overfitting.

Guess you like

Origin blog.csdn.net/weixin_68191319/article/details/129218551