Understanding of attention in Transformer

I have new insights when learning about attention in Transformer. Let’s record it here. First, let’s take a look at the calculation process of attention in Transformer. As shown below.
bold style

The above process is actually the calculation process of executing the following formula.

insert image description here
So how to understand this Q, K, V?
Let's take shopping as an example to understand the meaning of Q, K, and V, and migrate them to target detection.
Q is the search condition, and K is the characteristic attribute. For example, some K focuses on low price, some K focuses on quality, and some K focuses on design.
V is a specific value of the characteristic attribute represented by K.
When we search, we must look at the correlation between the retrieval conditions and the attributes, and the calculation of the correlation in the matrix is ​​realized by dot multiplication.
Dividing by dk makes this value smaller and easier to calculate.
Then use softmax to perform a normalization and attribute ratio to quantify the correlation. (The above figure is an example, K1 is more in line with our expected description, so its value will be larger)
Next, multiply by V to see what the specific value of the attributes described by different K is, that is, to calculate what we need to do in the future. How much attention should be paid to this group of K and V. It can be seen that the increase in attention is the result of the joint efforts of Q, K, and V.
For Q, if he wants to find what he wants more quickly, then Q needs to keep making his search conditions more clear.
And K, V refers to an attribute feature. In order to be noticed (in an epoch, the feature values ​​of K and V constructed by the Encoder will not change, but if you look at multiple epochs, it will also change). Let yourself gradually disappear your irrelevant attributes to make yourself more discerning. In this way, when shopping next time, due to the existence of attention, Q will go directly to K to learn more feature descriptions, and the description of Q will be clearer and the conditions will be stricter.

Expanding to the field of target detection, Q is the object to be found. At first, Q1 said that he was looking for a horse, K1 said that I am a horse, and K2 said that I am also a horse. They all have the attributes of a horse. K1 may be blocked, resulting in only Horseshoe features, and K2 has horseshoes, horse tails, and horse heads, then the correlation between Q1 and k2 will be greater when calculating the correlation, and then look at the specific value. With such continuous learning, Q1 is responsible for finding horses , the characteristics of the horse he is looking for are becoming more and more obvious. In DETR, this Q is to be trained, then Q1 will be responsible for finding the horse in the future, and the rest will be processed in the same way.

Guess you like

Origin blog.csdn.net/pengxiang1998/article/details/129893837