This article mainly wants to understand the following questions:
1. What is Self-Attention?
2. What are Q, K and V?
Okay, without further ado, let’s get straight to the point.
Q, K, and V represent query, key, and value respectively. This is easily reminiscent of Python's dictionary data structure. Suppose we have a dictionary that stores the height (key) and weight (value) relationship of a class.
{'160':50,'166':57,'173':65,'177':67,...}
Here we take the first three groups and display them in table form:
Suppose you receive a query with a height of 162, which is not in the dictionary. You want to predict the weight information of the modified height through the existing dictionary data. Intuitive inference is that the weight may be between 50-57. But if we want to quantitatively calculate the predicted value of weight, we can solve it by weighted average.
The distance between 162 and 160 is 2, the distance between 162 and 166 is 4, and the distance between 160 and 166 is 6, then 162->160 takes the weight of 4/6, and 162->166 takes 2 /6 weight.
Then map it to weight and weight it, and you can estimate that the weight corresponding to 162 is about 50*(4/6)+57*(2/6)=52.33, which is close to 50 and in line with expectations.
Because 162 is between [160,166], it is easy to assign more weights to them here and pay more attention to them. The closer the weight is, the greater the weight assigned. They are assigned 2/3 and 1/3 of the attention respectively. Weights.
But in the dictionary, other key-value pairs (key, value) may also have an impact on the query, but we have not used them. So how can we use all the data in the dictionary to make the estimated value more accurate?
Assume that a function is used to represent the attention weight corresponding to and , then the prediction of body weight The value can be derived in the following way:
where is a function that can represent correlation. Taking Gaussian kernel as an example, then
where is the attention score, and is the attention weight.
Put the corresponding value into the formula to get it (the number is not a tensor, so x is used)
In this way, we can use other elements in the dictionary to estimate weight through height. This is the attention mechanism.
The above shows the case where the input data is one-dimensional. The situation is similar when the query, key, and value are multi-dimensional.
Before introducing the multi-dimensional situation, we will introduce the attention scoreThe calculation methods are as follows:
Most of the current models are scaled dot product models. The Transformer model uses the scaled dot product attention mechanism instead of the additive model. This is because the scaled dot product attention performs better in practical applications. It has better computational efficiency, is suitable for long sequences, and has better gradient propagation properties, making training more stable. This design is to improve the performance and efficiency of the model. Here we take the dot product model as an example.
对于和,
The rest and and and in the same way deal with.
Then you can get:
Convert it to matrix form:
Right now:
It will also be divided by a feature dimension to scale the score to make the gradient more stable. Multiplication may cause gradient explosion problems.
That is:
This is the common scaled dot product model. After softmax, large values are amplified and small values are suppressed, which allows the model to pay more attention to places with greater weight.
If the QKVs are all the same matrix, then it is a self-attention model.
Use to represent one of the matrices
So
It doesn’t make much sense to do this directly. In the actual transformer, QKV will be linearly transformed using linear (learnable parameters), mapped to different linear spaces, and will Divide it into multiple heads, and each head can learn different things to increase the diversity of features and provide more expressive capabilities for the model.
Then you can get
That is, the formula of the self-attention model