Detailed explanation of attention mechanism (Attention), self-attention mechanism (Self Attention) and multi-head attention (Multi-head Self Attention) mechanism

reference

Thanks to my Internet mentor: the programmer of the water paper
Reference materials and picture sources:
Transformer, GPT, BERT, pre-trained language model The past and present life
[Transformer Series (2)] Super detailed explanation of attention mechanism, self-attention mechanism, multi-head attention mechanism, channel attention mechanism, and spatial attention mechanism< a i=4>[Deformable DETR paper + source code interpretation] Deformable Transformers for End-to-End Object Detection

1. Attention attention mechanism

principle

I (query q) Look-> A picture (queried object v)

When I look at this picture, at first glance, I will judge which things are more important to me and which are less important to me (to calculate the importance of things in Q and V)

The importance calculation is actually the similarity calculation (closer), and the dot product is actually the inner product.
Insert image description here

calculation process

Query object: V = ( v 1 , v 2 , v 3 , . . . ) V = (v1, v2, v3, ...) IN=v1,v2,v3,...

In transformer, K == V

  1. Calculate similarity: Q ∗ k 1 , Q ∗ k 2 , . . . . . . . = s 1 , s 2 , . . s n . Q*k1 , Q *k2, ...... = s1, s2, ..sn. Qk1,Qk2,......=s1,s2,..sn.

  2. Normalization to find the probability: s o f t m a x ( s 1 , s 2 , s 3 , . . . ) = a 1 , a 2 , a 3 , . . a n . softmax (s1, s2, s3, ...) = a1, a2, a3, ..an. softmax (s1,s2,s3,...)=a1,a2,a3,..an.

  3. 更新V为V’: V ′ = ( a 1 ∗ v 1 + a 2 ∗ v 2 + . . . + a n ∗ v n ) V' = (a1*v1+a2*v2+...+an*vn) IN=(a1v1+a2v2+...+anvn)

This will get a new V', replacing V with V'. In addition to representing K and V (K==V), this new V' can also represent Q's information (which part of K pays the most attention to Q and is the most important), and finds out Q's attention to K Where to focus.

2. Self-attention mechanism

Insert image description here

2.1 Self-attention is key! !

K, V, and Q come from the same X, and they have the same origin. That’s why it’s called self-attention

K V Q How to get? Through x and three vector parameters ( W K , W V , W Q W^K, W^V, W^Q INK,INV,INQ) is obtained by multiplying. These three parameter vectors are also what we need to learn.

Insert image description here

2.2 Implementation steps

1. Get K Q V

There is a sentence "T h i n k i n g M a c h i n e s ". There are two words in this sentence. The vectors of the two words are x 1 and x 2 respectively. They are multiplied with (W K , W V , W Q ) 3 matrices to get q 1 , q 2, k 1, k 2, v 1, 6 vectors of v 2. There is a sentence called "Thinking Machines". There are two words in the sentence. The vectors of the two words are x1 and x2 respectively. They are multiplied by three matrices (W^K, W^V, W^Q) to get q1. 6 vectors of q2,k1,k2,v1,v2.You have one phraseThinking< /span>es< /span>,1x,该子子中有两个单词、两个单词的directiondividerhincaMx2, division (WK,INV,INQ3to get a square oneq 1,q2,k1,k2,v1,v26个向量。
Insert image description here

2. MatMul

The score is obtained by dot multiplying q 1 with k 1 and k 2 respectively, and the important information of q 1 on x 1 and x 2 is obtained. The score is obtained by the dot multiplication of q1 and k1 and k2 respectively, and the important information of q1 on x1 and x2 is found.q1partitionk1,k2points earned, 寻找q 1x1,x2's important information
Insert image description here

3. scale + softmax normalization

scale: Normalize the scores to prevent problems with gradient descent.
softmax: Normalize the probability to obtain a1, a2
Insert image description here
After normalization by Softmax, each value is a weight coefficient greater than 0 and less than 1 , and the sum is 0, this result can be understood as a weight matrix W.

ThisW is the attention weight, which contains the relevant information between the word and the sentence and which part is more important.

4. Less

Multiply the score ratio [0.88, 0.12] by [ v 1 , v 2 v1,v2 v1,v2] z 1 = ( a 1 ∗ v 1 + a 1 ∗ v 2 ) z1 = (a1 * v1 + a1*v2) z1=(a1v1+a1v2)

Insert image description here

The obtained new vector z1 is the new word vector of the word thinking. z1 contains the similarity and related information between the word thinking and each word in the sentence "Thinking Machines".

In the same way, the z2 vector can be obtained, which represents the new word vector of machines.

2.3 Defects of self-attention mechanism

  1. Although the self-attention mechanism considers all input vectors, it does not consider the position information of the vectors. In actual word processing problems, words may have different properties in different positions. For example, verbs often appear less frequently at the beginning of sentences. (Solution: Introduce positional encoding)
  2. When the model encodes the information of the current position, it will focus excessively on its own position, and its ability to capture effective information is poor. (Solution: Introduce multi-head attention)

3. Multi-head self-attention mechanism

The computer may need to perform attention several times to actually observe the effective information in the picture, so it performs multi-head attention and then concat-fuses the multi-head attention values.
Insert image description here

3.1 Introduction

Simple understanding: multiple groups of self-attention mechanisms run in parallel, and finally the results are spliced ​​together.
Insert image description here

3.2 Implementation steps

  1. set义多组 W q , W k sum W v Wq, Wk sum Wv Wq, Wksum Wv, generated multi-combination Q, K sum V
  2. Apply the self-attention mechanism to multiple groups respectively, and obtain multiple groups z ( z 0 − z n z (z_0-z_n zz0Withn
  3. multiple combinations z (z 0 − z n z(z_0-z_n zz0Withn) is spliced ​​(cancat), and then multiplied by the matrix W to make a linear change to reduce the dimension, and the final Z is obtained.
    Insert image description here

3.3 Official

Insert image description here
Among them, x is the input feature, z represents query, which is obtained by linear transformation of x through Wq, k is the index of key, q is the index of query, M represents the number of heads of multi-head attention, and m represents the number of Attention head, A m q k A_{mqk} Amqk represents the mth head attention weight (that is, the process up to SoftMax in a above), W m ’ x k W^’_m x_k INmxk is actually value, and the entire process in [ ] is the entire process of picture a, W m W_m INm is the result of applying attention to value through linear transformation (that is, Linear in Figure b) to obtain the output results of different heads, Ω k \Omega_k OhkRepresents the set of all keys.

Guess you like

Origin blog.csdn.net/weixin_45662399/article/details/134384186