Trying to help you understand the essence of transformer attention mechanism (Self-Attention) in one article

This article mainly wants to understand the following questions:

1. What is Self-Attention?

2. What are Q, K and V?

Okay, without further ado, let’s get straight to the point.

Q, K, and V represent query, key, and value respectively. This is easily reminiscent of Python's dictionary data structure. Suppose we have a dictionary that stores the height (key) and weight (value) relationship of a class.

{'160':50,'166':57,'173':65,'177':67,...}

Here we take the first three groups and display them in table form:

        Suppose you receive a query with a height of 162, which is not in the dictionary. You want to predict the weight information of the modified height through the existing dictionary data. Intuitive inference is that the weight may be between 50-57. But if we want to quantitatively calculate the predicted value of weight, we can solve it by weighted average.

The distance between 162 and 160 is 2, the distance between 162 and 166 is 4, and the distance between 160 and 166 is 6, then 162->160 takes the weight of 4/6, and 162->166 takes 2 /6 weight.

Then map it to weight and weight it, and you can estimate that the weight corresponding to 162 is about 50*(4/6)+57*(2/6)=52.33, which is close to 50 and in line with expectations.

Because 162 is between [160,166], it is easy to assign more weights to them here and pay more attention to them. The closer the weight is, the greater the weight assigned. They are assigned 2/3 and 1/3 of the attention respectively. Weights.

But in the dictionary, other key-value pairs (key, value) may also have an impact on the query, but we have not used them. So how can we use all the data in the dictionary to make the estimated value more accurate?

Assume that a function\alpha(q,k_{i}) is used to represent the attention weight corresponding to q and k, then the prediction of body weight The valuef(q) can be derived in the following way:

f(q)=\alpha(q,k_{1}) v_{1} + \alpha(q,k_{2}) v_{2} + \alpha(q,k_{3}) v_{3} = \sum_{i=1}^{3}\alpha(q,k_{i}) v_{i}

where\alpha is a function that can represent correlation. Taking Gaussian kernel as an example, then

\alpha(q,k_{i})=softmax(-\frac{1}{2}(q-k_{i})^{2})=\frac{e^{-\frac{1}{2}(q-k_{i})^{2}}}{\sum_{j=1}^{3}e^{-\frac{1}{2}(q-k_{j})^{2}}}

where-\frac{1}{2}(q-k_{i})^{2} is the attention score, and softmax(-\frac{1}{2}(q-k_{i})^{2}) is the attention weight.

Put the corresponding value into the formula to get it (the number is not a tensor, so x is used)

f(162)=\alpha(162,160) \times 50 + \alpha(162,166) \times 57 + \alpha(162,173) \times 65 = \sum_{i=1}^{3}\alpha(162,k_{i})\times v_{i} 

In this way, we can use other elements in the dictionary to estimate weight through height. This is the attention mechanism.

The above shows the case where the input data is one-dimensional. The situation is similar when the query, key, and value are multi-dimensional.

Before introducing the multi-dimensional situation, we will introduce the attention score\alpha(q,k_{i})The calculation methods are as follows:

        Most of the current models are scaled dot product models. The Transformer model uses the scaled dot product attention mechanism instead of the additive model. This is because the scaled dot product attention performs better in practical applications. It has better computational efficiency, is suitable for long sequences, and has better gradient propagation properties, making training more stable. This design is to improve the performance and efficiency of the model. Here we take the dot product model as an example.

 对于q_{1}k_{1}

\alpha(q_{1},k_{1})=softmax(q_{1}\cdot k_{1}^{T})

The rest q_{2} and k_{2} and q_{3} and k_{3} in the same way deal with.

\alpha(q_{2},k_{2})=softmax(q_{2}\cdot k_{2}^{T})

\alpha(q_{3},k_{3})=softmax(q_{3}\cdot k_{3}^{T})

Then you can get: 

f(q)=\alpha(q_{1},k_{1}^{T}) v_{1} + \alpha(q_{2},k_{2}^{T}) v_{2} + \alpha(q_{2},k_{3}^{T}) v_{3} = \sum_{i=1}^{3}\alpha(q_{i},k_{i}^{T}) v_{i} 

Convert it to matrix form:

 Right now:

f(Q) = softmax(Q\cdot K^{T})\cdot V

It will also be divided by a feature dimension\sqrt{d_{k}} to scale the score to make the gradient more stable. Multiplication may cause gradient explosion problems.

That is:

f(Q) = softmax(Q\cdot K^{T}/\sqrt{d_{k}})\cdot V

This is the common scaled dot product model. After softmax, large values ​​are amplified and small values ​​are suppressed, which allows the model to pay more attention to places with greater weight.

If the QKVs are all the same matrix, then it is a self-attention model.

UseX to represent one of the matrices

Sof(X) = softmax(X\cdot X^{T}/\sqrt{d_{k}})\cdot X

It doesn’t make much sense to do this directly. In the actual transformer, QKV will be linearly transformed using linear (learnable parametersIN), mapped to different linear spaces, and will Divide it into multiple heads, and each head can learn different things to increase the diversity of features and provide more expressive capabilities for the model.

Then you can get

f(X) = softmax(XW_{q}\cdot X^{T}W_{k}^{T}/\sqrt{d_{k}})\cdot XW_{v}

That is, the formula of the self-attention model

Guess you like

Origin blog.csdn.net/athrunsunny/article/details/133780978