Deep understanding of the attention mechanism

This article is mainly to supervise my system to learn the knowledge of the attention mechanism (I have read it many times vaguely, but I have not learned anything... I only found out that I need to supplement the knowledge after reading the paper).
It is mainly divided into the following points:

PART1----What problem is the attention mechanism to solve?

PART2----Mathematical principle of soft attention mechanism (soft attention)

PART3----soft attention mechanism, Encoder-Decoder framework, Seq2Seq

PART4----The principle of the self-attention model

PART5----Multi-head Attention

PART6----Application of Attention Mechanism (Computer Vision Field)

PART7----Cross Attention


What problem does the PART1 attention mechanism solve?

In the learning of neural networks, generally speaking, the more parameters of the model, the stronger the expression ability of the model, and the greater the amount of information stored in the model, but this will bring about the problem of information overload. Then by introducing the attention mechanism, focusing on the information that is more critical to the current task among the numerous input information, reducing the attention to other information, and even filtering out irrelevant information, the problem of information overload can be solved and the efficiency of task processing can be improved. efficiency and accuracy.
This is similar to our human visual attention mechanism . By quickly scanning the global image, we find the target area that needs to be focused on, which is generally referred to as the focus of attention, and then invest more attention in this area to get more attention. It is necessary to pay attention to the detailed information of the target, while ignoring other unimportant and irrelevant information.
The attention mechanism in deep learning is essentially similar to the selective visual attention mechanism of human beings. The core goal is to select information that is more critical to the current task goal from a large number of information .


PART2 Mathematical Principles of Soft Attention Mechanism

(1) The most original model

X = [ x 1 , x 2 , . . . , x N ] X=[x_1,x_2,...,x_N] X=[x1,x2,...,xN] represents N input information. In order to improve the efficiency of the neural network, it is not necessary to let the neural network process all the N input information, but only need to start fromXXSelect some task-related information input in X for calculation. (Let’s talk about the difference between soft attention and hard attention here)The soft attention mechanismmeans that when selecting information, instead of selecting one from N information, it calculates the weight of N input information averaged, and then input to the neural network for calculation. The hard attention mechanismrefers to selecting information at a certain position in the input sequence, such as randomly selecting an information or selecting the information with the highest probability. Generally, soft attention mechanism is used to deal with the problem of neural network.
Here we first introduce the calculation of the attention value separately. The complete workflow of the neural network with attention mechanism is in PART3.

The calculation of the attention value can be divided into two steps:

(1) Calculate the attention distribution on all input information

Given such a scenario: put the input information vector XXX is regarded as an information storage, now given a query vectorqqq , used to find and selectXXSome information in X , then you need to know the index position of the selected information. A soft selection mechanism is adopted, instead of picking out only one piece of information from the stored pieces of information, it extracts some from all the information, and more of the most relevant information is extracted.
Define an attention variablez ∈ [ 1 , N ] z∈[1,N]z[1,N ] to represent the index position of the selected information, that is,z = iz=iz=i means that the iiis selectedi input information, and then calculate the givenqqq andXXIn the case of X , selectiiThe probability of i input informationα i α_iai:
insert image description here
(Add the softmax function here: The meaning of Softmax is that it no longer uniquely determines a certain maximum value, but assigns a probability value to each output classification result, indicating the possibility of belonging to each category. A single output node The binary classification problem generally uses the Sigmoid function on the output node, and the binary classification or multi-classification problem with two or more output nodes generally uses the Softmax function on the output node.) Among them, α
i α_iaiThe resulting probability distribution vector is called the attention distribution (Attention Distribution). s ( xi , q ) s(x_i,q)s(xi,q ) is the attention scoring function, which has the following forms:

insert image description here
Among them, WWW U U Uvvv is a learnable network parameter,ddd is the dimension of the input information.

(2) Calculate the weighted average of the input information according to the attention distribution

Attention distribution α i α_iaiIndicates that in a given query qqWhen q , the input information vectorXXiiin Xi information and queryqqThe degree of correlation of q . Using the soft information selection mechanism to give the results of the query is to summarize the input information in a weighted average manner to obtain the Attention value: the
insert image description here
following is the overall process diagram for calculating the Attention value:
insert image description here

(2) Key-value pair attention mode

More generally, input information can be represented by key-value pairs , then NNN input information can be expressed as:
( K , V ) = [ ( k 1 , v 1 ) , ( k 2 , v 2 ) , . . . , ( k N , v N ) ] (K,V)= [(k_1,v_1),(k_2,v_2),...,(k_N,v_N)](K,V)=[(k1,v1),(k2,v2),...,(kN,vN)] , where Key is used to calculate the attention distributionα i α_iai, Value is used to calculate the aggregated information based on the attention distribution .
Then, the attention mechanism can be regarded as a soft addressing operation: the input information XXX is regarded as the content stored in the memory. The element is composed of the address Key (key) and the value Value (value). There is currently a query of Key=Query (representation of the target task), and the goal is to retrieve the corresponding Value value in the memory. That is, the Attention value. In soft addressing, it is not necessary to strictly satisfy the condition of Key=Query to retrieve stored information. Instead,it determines how much content to take out fromby calculatingthe similarity between the Query and the address Key of the element in the memory. The Value value corresponding to each address Key will be extracted and summed, which is equivalent to calculating the weight of each Value value based on the similarity between Query and Key, and then weighting and summing the Value values ​​to obtain the final The Value value is the Attention value.

It can be summarized into three processes:
step1: Calculate the similarity between the two based on Query and Key . Can be calculated using the additive model, dot product model or cosine similarity listed above to get the attention score si = F ( Q , ki ) s_i=F(Q,k_i)si=F(Q,ki)
step2: Use the softmax function to numerically convert the attention score. On the one hand, normalization can be performed to obtain a probability distribution in which the sum of all weight coefficients is 1. On the other hand, the characteristics of the softmax function can be used to highlight the weight of important elements (this is because the softmax function in the form of an index can be used. The numerical distance is larger):
α i = softmax ( si ) = exp ( si ) ∑ j = 1 N exp ( sj ) α_i=softmax(s_i)=\frac{exp(s_i)}{\sum_{j= 1}^Nexp(s_j)}ai=softmax(si)=j=1Nexp(sj)exp(si)

Step3: Weighted and summed Value according to the weight coefficient : A attention ( ( K , V ) , Q ) = ∑ i = 1 N α ivi Attention((K,V),Q)=\sum_{i=1}^ Nα_iv_iA t e n t i o n (( K ,V),Q)=i=1Naivi

The diagram is as follows:
insert image description here
express these three processes with formulas: attention ( ( K , V ) , q ) = ∑ i = 1 N α ivi = ∑ i = 1 N exp ( s ( ki , q ) ) ∑ j = 1 N exp ( s ( kj , q ) ) vi attention((K,V),q)=\sum_{i=1}^Nα_iv_i=\sum_{i=1}^N\frac{exp(s(k_i ,q))}{\sum_{j=1}^Nexp(s(k_j,q))}v_iattention((K,V),q)=i=1Naivi=i=1Nj=1Nexp(s(kj,q))exp(s(ki,q))vi


PART3 soft attention mechanism and Encoder-Decoder framework

The attention mechanism is a general idea that does not depend on a specific framework, but most attention models are currently used in conjunction with the Encoder-Decoder framework . It can be summed up simply in two steps:

  • Calculate the attention distribution on the given information (that is, judge what information is important and what information is not important, and give different weights respectively)
  • Calculate the weighted average of all input information according to the attention distribution

The following figure is the structure of the combination of the two in the field of text processing (without introducing attention and introducing attention model):
insert image description here

insert image description here
The Encoder-Decoder framework in the field of text processing can be understood intuitively: it can be regarded as a general processing model suitable for processing a sentence (or chapter) to generate another sentence (or chapter). For the sentence pair <Source, Target>, our goal is to give the input sentence Source and expect to generate the target sentence Target through the Encoder-Decoder framework. Source and Target can be in the same language or in two different languages. The Source and Target are composed of their respective word sequences. Encoder is to encode the input sentence Source, and convert the input sentence into an intermediate semantic representation cc through nonlinear transformationc . For the decoder Decoder, its task is to representccc and the historical information that has been generated before to generate words to be generated in the future.

The Encoder-Decoder framework is a common framework mode in the field of deep learning and is widely used in text processing, speech recognition and image processing. Its encoder and decoder are not a specific neural network model, and different models will be applied to different tasks. For example, the RNN model is commonly used in text processing, and the CNN model is generally used in image processing.

The Encoder-Decoder framework with RNN as the encoder and decoder is also called an asynchronous sequence-to-sequence model, that is, the Seq2Seq model .

The following is the RNN Encoder-Decoder framework without the attention mechanism:
insert image description here
Take the Seq2Seq model as an example to compare the model without the attention mechanism and the model with the attention mechanism.

(1) RNN Encoder-Decoder without attention mechanism

When processing sequence data, the RNN Encoder-Decoder framework without the attention mechanism can first use the encoder to convert the sequence XX of variable lengthX is encoded into a fixed-length vector representingCCC , and then use the decoder to decode this vector representation into another sequence yyof variable lengthy , the input sequenceXXX and the output sequenceyyThe length of y may be different.
The paper "Learning phrase representations using RNN encoder-decoder for statiscal machine translation" proposes a RNN Encoder-Decoder structure, as shown in the figure below. In addition, this article first proposed the GRU, a commonly used LSTM variant structure.

insert image description here
Use this structure in text processing, given an input sequence X = [ x 1 , x 2 , . . . , x N ] X=[x_1,x_2,...,x_N]X=[x1,x2,...,xN] , which is a sentence composed of word sequences, such an encoding-decoding process is equivalent to finding another variable-length sequencey = [ y 1 , y 2 , . . . , y T ] y=[y_1,y_2 ,...,y_T]y=[y1,y2,...,yT] conditional probability distribution:p ( y ) = p ( y 1 , y 2 , . . . , y T ∣ x 1 , x 2 , . . . , x N ) p(y)=p(y_1,y_2, ...,y_T|x_1,x_2,...,x_N)p ( and )=p ( and1,y2,...,yTx1,x2,...,xN) . After decoding, this conditional probability distribution can be transformed into a multiplicative form:p ( y ) = ∏ t = 1 T p ( yt ∣ { y 1 , . . . , yt − 1 } , c ) p(y)=\ prod_{t=1}^Tp(y_t|\lbrace{y_1,...,y_{t-1}}\rbrace,c)p ( and )=t=1Tp ( andt{ y1,...,yt1},c )
So after getting the representation vectorccc and all previously predicted words( y 1 , y 2 , . . . , yt − 1 ) (y_1,y_2,...,y_{t-1})(y1,y2,...,yt1) , this model can be used to predict thetttht wordsyt y_tyt, that is to find the conditional probability p ( yt ∣ { y 1 , . . . , yt − 1 } , c ) p(y_t|\lbrace{y_1,...,y_{t-1}}\rbrace,c)p ( andt{ y1,...,yt1},c)

Compared with Figure1 above, we calculate the conditional probability in three steps :
step1: put the input sequence XXThe elements in X are input into the RNN network of Encoder step by step, and the hidden stateht h_tht, and then put all hidden states [ h 1 , h 2 , . . . , h T ] [h_1,h_2,...,h_T][h1,h2,...,hT] are integrated into a semantic representation vectorccc
h t = f 1 ( x t , h t − 1 ) c = q ( { h 1 , h 2 , . . . h T } ) h_t=f_1(x_t,h_{t-1})\quad\quad\quad c=q(\lbrace h_1,h_2,...h_T\rbrace) ht=f1(xt,ht1)c=q({ h1,h2,...hT})

step2: Decoder's RNN network every moment ttt will output a predictedyt y_tyt. According to the semantic representation vector ccc , yt − 1 y_{t-1}predicted at the last momentyt1And Decoder's hidden state st − 1 s_{t-1}st1, calculate the current time ttThe hidden state of t st = f 2 ( yt − 1 , st − 1 , c ) s_t=f_2(y_{t-1},s_{t-1},c)st=f2(yt1,st1,c )
step3: represented by the semantic vector ccc , the predicted wordyt − 1 y_{t-1}yt1and hidden state st s_t in Decoderst, predict the tttht wordsyt y_tyt, that is to find the following conditional probability:
p ( yt ∣ { y 1 , . . . , yt − 1 } , c ) = g ( yt − 1 , st , c ) p(y_t|\lbrace{y_1,.. .,y_{t-1}}\rbrace,c)=g(y_{t-1},s_t,c)p ( andt{ y1,...,yt1},c)=g(yt1,st,c)

It can be seen that when generating each word of the target sentence, the semantic vector cc usedc is the same, which meanspredicting a certain word yt y_tytWhen , any input word has the same importance to it , attention is scattered . That is to say, when each word is generated, [ c 1 , c 2 , . . . c T ] [c_1,c_2,...c_T] are not generated[c1,c2,...cT] such multiple different semantic representations corresponding to each output word.

(2) RNN Encoder-Decoder with attention mechanism

Based on the above paper, the paper "Neural Machine Translation by Jointly Learning to Align and Translate" proposes a new neural network translation model structure, that is, an attention mechanism is added to the RNN Encoder-Decoder framework. , the encoder in this paper is a bidirectional GRU, and the decoder also uses a RNN network to generate sentences.
Use this model for machine translation, then given a sentence X = [ x 1 , x 2 , . . . , x N ] X=[x_1,x_2,...,x_N]X=[x1,x2,...,xN] , after the encoding-decoding operation, generate a target sentence in another languagey = [ y 1 , y 2 , . . . , y T ] y=[y_1,y_2,...,y_T]y=[y1,y2,...,yT] , that is, to calculate the conditional probability of each possible word for searching the most likely word, the formula is as follows:
p ( yi ∣ y 1 , . . . yi − 1 , x ) = g ( yi − 1 , si , ci ) p(y_i| y_1,...y_{i-1},x)=g(y_{i-1},s_i,c_i)p ( andiy1,...yi1,x)=g(yi1,si,ci)

generate ttThe process diagram of t words is as follows:
insert image description here
Compared with the RNN Encoder-Decoder framework without attention mechanism, on the one hand, fromyi y_iyiFrom the conditional probability calculation formula, g ( ⋅ ) g(·)g() The semantic representation vector in this nonlinear function isas the output yi y_iyiThe change of ci c_ici, instead of the constant ccc ; On the other hand, from the above figure, each generated wordyt y_tyt, it is necessary to use the original sentence sequence XXX and other information recalculate a semantic representation vectorci c_ici. Therefore, the key to the RNN Encoder-Decoder framework with the attention mechanism is that the fixed semantic representation vector ccc is replaced by a semantic representation ci c_ithat is constantly changing according to the currently generated wordci

Compared with the above figure, we calculate the generated word yi y_i in four stepsyiThe conditional probability of :
step1 : given the original input sequence (that is, a sentence) X = [ x 1 , x 2 , . . . , x N ] X=[x_1,x_2,...,x_N]X=[x1,x2,...,xN] , input the words one by one into the RNN network of the Encoder, andcalculate the hidden state ht h_t of each input dataht, the encoder used here is a bidirectional RNN, so it is necessary to calculate the hidden states of the forward time loop layer and the reverse time loop layer respectively, and then stitch all the hidden states together: ht ( 1 ) = f ( U ( 1 ) ht
− 1 ( 1 ) + W ( 1 ) xt + b ( 1 ) ) ht ( 2 ) = f ( U ( 2 ) ht + 1 ( 2 ) + W ( 2 ) xt + b ( 2 ) ) ht = ht ( 1 ) ⊕ ht ( 2 ) h_t^{(1)}=f(U^{(1)}h_{t-1}^{(1)}+W^{(1)}x_t+b^{(1 )})\quad\quad\quad\quad h_t^{(2)}=f(U^{(2)}h_{t+1}^{(2)}+W^{(2)}x_t+ b^{(2)})\quad\quad\quad\quad h_t=h_t^{(1)}⊕h_t^{(2)}ht(1)=f(U(1)ht1(1)+W(1)xt+b(1))ht(2)=f(U(2)ht+1(2)+W(2)xt+b(2))ht=ht(1)ht(2)

step2 : In the RNN network of the decoder, the first ttAt time t , according to the known semantic representation vectorct c_tct, yt − 1 y_{t-1} predicted at the last momentyt1, the hidden state st − 1 s_{t-1} in the encoderst1, calculate the hidden state at the current moment st = f 2 ( yt − 1 , st − 1 , ct ) s_t=f_2(y_{t-1},s_{t-1},c_t)st=f2(yt1,st1,ct) ( c t c_t ctWhat is the value of ? )
step3 : find ct c_tct, need to know st − 1 s_{t-1} firstst1
e i j = a ( s i − 1 , h j ) α i j = e x p ( e i j ) ∑ k = 1 K e x p ( e i k ) c i = ∑ j = 1 T x α i j h j e_{ij}=a(s_{i-1},h_j)\quad\quad\quad\quad α_{ij}=\frac{exp(e_{ij})}{\sum_{k=1}^Kexp(e_{ik})}\quad\quad\quad\quad c_i=\sum_{j=1}^{T_x}α_{ij}h_j eij=a(si1,hj)aij=k=1Kexp(ei)exp(eij)ci=j=1Txaijhj
Here eij e_{ij}eijThat is, there is no normalized attention score . a ( ⋅ ) a(·)a() This non-linear function is called the alignment model, its role is to put each word xj x_j in the encoderxjThe corresponding hidden state hj h_jhjand generate the word yi y_i in the decoderyiThe hidden state of the previous word si − 1 s_{i-1}si1For comparison, to calculate each input word xj x_jxjand generate the word yi y_iyidegree of matching between them. The higher the degree of matching, the higher the attention score, then in generating the word yi y_iyi, we need to give more attention to the input word .
Get the attention score eij e_{ij}eijAfterwards, use the softmax function for normalization to obtain the attention probability distribution σ ij σ_{ij}pij. This attention distribution serves as each input word xj x_jxjThe weight of the degree of attention, the hidden state hj h_j corresponding to each input wordhjPerform weighted summation to get each generated word yi y_iyiThe corresponding semantic representation vector ci c_ici, which is the attention value .
step4 : Our purpose is not to find the attention value, but to find the generated word yi y_iyiConditional probability of : p ( yt ∣ { y 1 , . . . , yt − 1 } , c ) = g ( yt − 1 , st , c ) p(y_t|\lbrace{y_1,...,y_{t -1}}\rbrace,c)=g(y_{t-1},s_t,c)p ( andt{ y1,...,yt1},c)=g(yt1,st,c)


PART4 self-attention model

(1) Popular explanation

Let's compare it with the Soft Attention Encoder-Decoder model to get acquainted with the Self-Attention Model.

In the soft attention Encoder-Decoder model , more specifically, in the machine translation model, the content and even length of the input sequence and the output sequence are different, and the attention mechanism occurs between the encoder and the decoder. That is to say, it occurs between the input original sentence sequence and the output generated sequence. At this time, the Query comes from the Target, and the Key and Value come from the Source . The self-attention mechanism in the self-attention model occurs inside the input sequence or inside the output sequence. It can also be understood as the attention calculation mechanism in the special case of Target=Source. At this time, Query, Key and Value come from Source Or Target , which can extract the connection between words that are far apart in the same sentence, such as syntactic features (phrase structure with a certain distance) .

If it is a pure RNN network, the hidden state and output are calculated step by step for the input sequence. For the features that are far away but interdependent, the network is less likely to capture the connection between the two, while inside the sequence After introducing the self-attention mechanism, any two words in the sentence can be directly connected through calculation, making it easier to capture interdependent features.

(2) Theoretical analysis

After having a general understanding of self-attention, we use the formula to define the self-attention model.
The self-attention model is between the input and output of the same layer network (not the final output of the model), using the attention mechanism to dynamically generate the weights of different connections to obtain the output model of the layer network.
As mentioned earlier, the self-attention model can establish long-distance dependencies within the sequence. In fact, it can also be done through a fully connected neural network, but the problem is that the number of connection edges of the fully connected network is fixed, so it cannot handle variable length. the sequence of. The self-attention model can dynamically generate the weights of different connections, how many weights are generated, and the size of the weights are all variable. When a longer sequence is input, only more connection edges need to be generated. As shown in the figure below, the dotted connection edge is dynamically changed.

insert image description here
Use a mathematical formula to express the self-attention mechanism: Suppose the input sequence in a neural layer is X = [ x 1 , x 2 , . . . , x N ] X=[x_1,x_2,...,x_N]X=[x1,x2,...,xN] , the output sequence is H = [ h 1 , h 2 , . . . , h N ] H=[h_1, h_2, ..., h_N] of thesame lengthH=[h1,h2,...,hN] , first put the inputXXX is projected into three different spaces through linear transformation (the meaning of linear projection here is to learn three different matrices), and three sets of vector sequences are obtained: Q
= WQX ∈ R d 3 × NK = WKX ∈ R d 3 × NV = WVX ∈ R d 3 × NQ=W_QX∈R^{d_3×N}\quad\quad\quad\quad K=W_KX∈R^{d_3×N}\quad\quad\quad\quad V=W_VX∈ R^{d_3×N}Q=WQXRd3×NK=WKXRd3×NV=WVXRd3×N

Among them, QQQ K K K V V V is the query vector sequence[ q 1 , . . . , q N ] [q_1,...,q_N][q1,...,qN] , key vector sequence[ k 1 , . . . , k N ] [k_1,...,k_N][k1,...,kN] , value vector sequence[ v 1 , . . . , v N ] [v_1,...,v_N][v1,...,vN] W Q W_Q WQ W K W_K WK W V W_V WVare learnable parameter matrices, respectively.
output vector hi h_ihi表示为:
h i = a t t e n t i o n ( ( K , V ) , q i ) = ∑ j = 1 N α i j v j = ∑ j = 1 N s o f t m a x ( s ( k j , q i ) v j ) h_i=attention((K,V),q_i)=\sum_{j=1}^Nα_{ij}v_j=\sum_{j=1}^Nsoftmax(s(k_j,q_i)v_j) hi=attention((K,V),qi)=j=1Naijvj=j=1Nsoftmax(s(kj,qi)vj)

where i , j ∈ [ 1 , N ] i,j∈[1,N]i,j[1,N ] is the position of the output vector and the input vector sequence, and the connection weightα ij αijα ij is dynamically generated by the attention mechanism.
The overall diagram is:
insert image description here
In this picture, assume that threeXXX sequence (N=3), each of lengthD x D_xDx, then three query vectors are obtained at this time, and each query vector gets a corresponding result to put together without changing the input XXX corresponds to the dimension of N, but the lengthD x D_xDxchanged to D v D_vDv(here D v D_vDvIt can be set arbitrarily, as long as it is operated from XX at this timeX to valueVVThe matrix W v W_vcorresponding to the projection operation of VWvon it). That is to say, through the above operations, the variable-length sequence D x D_x can beDxDynamically generate suitable weights and convert them to a fixed length D v D_vDv

The self-attention mechanism can be used as a layer of a neural network, or it can be used to replace a convolutional layer or a recurrent layer, or it can be cross-stacked with a convolutional layer or a recurrent layer.

In the figure below, the input sequence and the output sequence are both the same sentence. Through the weight dynamically generated by the self-attention mechanism, it can be found that the weight of making and more-difficult is relatively large (the color is dark), so the three words are captured. The connections that exist between them (to form a phrase). Obviously, after introducing Self Attention, it will be easier to capture the long-distance interdependent features in the sentence, because if it is RNN or LSTM, it needs to be calculated sequentially. For the long-distance interdependent features, it will take several time steps. It takes information accumulation to connect the two, and the farther the distance, the less likely it is to be effectively captured.

However, during the calculation process, Self Attention will directly connect the connection between any two words in the sentence through a calculation step, so the distance between long-distance dependent features is greatly shortened, which is conducive to the effective use of these features . Moreover, Self Attention also directly helps to increase the parallelism of calculations, which is mainly reflected in the matrix calculation process of self attention . This is the main reason why Self Attention is gradually being widely used.

insert image description here


PART5 Multi-head Attention

The idea of ​​Multi-head Attention attention is relatively straightforward, that is, there are multiple query vectors Q = [ q 1 , . . . , q M ] Q=[q_1,...,q_M]Q=[q1,...,qM] to search for the desired information from the input in a parallel fashion. The intuitive understanding is that solving the problem requires many different aspects of information, and each query vectorqi q_iqiThe investigations are all different aspects. At this time, different qi q_i are usedqiScore the importance of input information from different angles, and then aggregate to some extent.
That is to repeat the process of calculating attention h times, and then splice the results together. Taking the common Scaled Dot-Product Attention calculation as an example, it is roughly divided into two steps: step1:
map Q, K, and V through the parameter matrix, and then do attention, repeat this process h times
step2: splicing the output of multiple heads together
M ulti Head ( ( K , V ) , q ) = Concat ( ( ( K , V ) , q 1 ) , . . . , ( ( K , V ) , q M ) ) MultiHead((K,V),q)=Concat(((K,V),q_1),...,((K,V),q_M))MultiHead((K,V),q)=Concat(((K,V),q1),...,((K,V),qM))

The so-called Multi-head is to do the same thing several times (parameters are not shared), and then stitch the results together. Let’s review the self-attention in the previous section: For example, the corresponding XX at this timeThe dimension of X isD x D_xDx, a total of N, each time through a matrix WQ W_QWQWhen projecting it into a query matrix, N query vectors can be obtained at one time, and these N query vectors can ensure that the subsequent output results are still N in total (only another dimension is determined by the dimension D v D_v of the value vectorDvDecide). Consider multi-head attention, that is to say, at this time we consider whether we can get multiple query matrices to capture XX from different anglesThe important case of X , that is to say, using multiple matricesWQ i W_{Q_i}WQiProject it to m different spaces to get m query matrices, a total of m ∗ N m*NmN vectors, and then calculate the output separately, and finally use the vector splicing method to put them together to get the final result.

Single-Head Attention VS Multi-Head Attention compares the calculation process of the multi-head attention mechanism and the calculation process of the multi-head attention mechanism.
insert image description here


Application of PART6 Attention mechanism (in the field of computer vision)

Can the Attention mechanism help computer vision? Attention itself is a weight, and weighting means that different information can be fused. CNN itself has a flaw. Each operation can only focus on the information near the convolution kernel (local information), and cannot fuse distant information (non-local information). And attention can also help to weight and integrate distant information, playing an auxiliary role.
insert image description here
As shown in the figure above, the upper left side of each example is the original input image, the lower sentence is the description sentence automatically generated by the artificial intelligence system, and the upper right figure shows the corresponding picture when the AI ​​system generates the underlined words in the sentence Focused location area in . For example, when outputting the word dog, the AI ​​system will pay more attention to the position corresponding to the puppy in the picture.

In the related applications of computer vision, it can be roughly divided into two types :
(1) Learning weight distribution: different parts of the input data or feature maps correspond to different degrees of concentration

  • This weighting can be to keep all components as weights (that is, soft attention); it can also be to select some components with a certain sampling strategy in the distribution (that is, hard attention). At this time, RL is often used.
  • This weighting can be applied to the original image, that is, "Recurrent Model of Visual Attention" (RAM) and "Multiple Object Recognition with Visual Attention" (DRAM); it can also be applied to the feature map, such as many subsequent articles (such as image caption "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention" in .
  • This weighting can act on the spatial scale, weighting different spatial regions; it can also act on the channel scale, weighting different channel features; even weighting each element on the feature map.
  • This weighting can also be applied to historical features at different times, such as Machine Translation.

(2) Task focus: By decomposing tasks, designing different network structures (or branches) to focus on different subtasks, and redistributing the learning ability of the network, thereby reducing the difficulty of the original task and making the network easier to train.


PART7 Cross Attention

Cross Attention only changes the input of Self Attention. The input of Cross-attention comes from different sequences, and the input of Self-attention comes from the same sequence, which is the so-called different input. Specifically, the self-attention input is a single sequence of embeddings . Cross-attention asymmetrically combines two embedding sequences of the same dimension, while one sequence is used as query Q input, while the other sequence is used as key K and value V input . Of course, there are individual cases, in SelfDoc's cross-attention, use a sequence of queries and values, and another sequence of keys. All in all, QKV is composed of two sequences, not a single one.

It is finally used in Transformer's decoder. Transformer's decoder module is shown on the right side of the figure below. It has three inputs: input1, input2, and input3. The Decoder first recursively inputs input1: the output of the decoder at the past time and the input2: position encoding representing the position information are added, and after masked multi-head attention, cross attention is done with input3: the output of the encoder . Therefore, cross attention is usually used as a decoder module and used together with Self Attention as an encoder.

insert image description here
The Query input by Cross Attention comes from the output of the Encoder (Self Attention), while the Key and Value come from the initial input, that is, the input of the Encoder. That is to say, the encoder output is used as a representation of the predicted sentence, and then it is used to query the similarity with each word in the original input sentence. What Cross Attention does is to use key/value information to represent query information, or to introduce key/value information into query information (because there is a residual layer that will be added to the original query information), and what is obtained is query for key Relevance (query attending to key, eg, vehicle attending to lanes, vice versa).

Cross Attention is widely used in the decoder. Generally, after the encoder uses Self Attention, first use the Cross Attention network layer to obtain the attention value in the decoder, and then connect an MLP layer or LSTM layer to predict the target, which is better than directly using MLP or LSTM as the decoder. The effect will be much better. Cross Attention can also reintroduce information from any previous layer of the network layer, similar to the residual function, but more flexible.

In addition, the Query and Key of Cross Attention can also come from two different modal inputs, for example, one is an image and the other is the corresponding text, which is used to find the correlation between the two, that is, the image-text task, which is also One of the original intentions of this module.


References:

The attention mechanism in deep learning
reads the attention mechanism
GAT, Self Attention, Cross Attention comparison and the pytorch application in the automatic driving trajectory prediction task
Illustrated: Self-Attention
Detailed Self-Attention and Multi-Head Attention

Guess you like

Origin blog.csdn.net/weixin_44022810/article/details/127477454