Super detailed self-attention mechanism (Self-attention)

Attention ia all you need The original text of the paper

For self-study use only

The problem that Self-attention wants to solve: At present, the input is a vector, and the output may be a value or a category. If the input is a row of vectors, and the number of input vectors can be changed, how to deal with it.

1. The input is a row of vectors

Example: a sentence of English, a sound signal, a picture

For example, if you input a sentence of English, the number of words in the sentence is different, and the number of vectors is different each time you input it.

How to treat words as vectors

(1)One-hot Encoding

A word can be regarded as a vector. For example, there are 100 words in the world, and a 100-dimensional vector is created. Each dimension corresponds to a vocabulary, and their different positions are made into 1, then these 100 words can be made into Different vectors, but one problem is that they have nothing to do with each other

(2)Word Embedding

Give each vocabulary a vector, and the vector has semantic consultation, which means that if you draw it, all the animals are in one lump, and all the plants are in one lump

2. Output

2.1 The input and output lengths are the same: each input vector has a corresponding label (Sequence Labeling)

Break through each, and input each vector into the fully connected network, and the fully connected network will have a corresponding output. But there are problems, such as the part-of-speech tagging problem, I saw a saw, the two words before and after are the same, but the parts of speech we want are different, but for the fully-connected network, the two saws are the same, and the output will be the same. Then we want the Fully-connected network to consider the context information, and let the front and rear vectors synthesize a window into the Fully-connected network, which is much better, but it is not good enough. The proposed solution is to make the window larger and cover the entire Sequences all enter the Fully-connected network, but Sequences are long and short. If the entire Sequence is covered, the parameters of the Fully-connected network will be large, so a better method is Self-attention.

 After the four inputs of Self-attention consider a whole Sequence, the result obtained through the Fully-connected network will be better, and Self-attention can be used in combination.

2.2 A whole sequence only needs to output a label

2.3 Less than output a few labels, the machine decides by itself (sequence to sequence)

3. How to output Self-attention

This part is to talk about how the small rectangle in the above picture becomes a small rectangle with black edges (for convenience: input a1, a2, a3, a4 and output b1, b2, b3, b4 through Self-attention)

According to a1, find other vectors related to a1 in this sequence. The degree of correlation between each vector and a1 is represented by α. Then there is a question: how does the self-attention module automatically determine the correlation between two vectors? ?

This requires a computational attention module, which requires two vector inputs and then directly outputs the value of α.

3.1 Dot-product

The two input vectors are multiplied by two different matrices to obtain two vectors of q and k, and then multiply the obtained vectors to obtain α 

 3.2 Additive

The two input vectors are multiplied by two different matrices to obtain two vectors of q and k, and then the obtained vectors are added and strung together. After two things, we get α 

3.3 Solving method of α in Self-Attention:

Calculate the relevance of a1, a2, a3, a4 respectively

q: query, will be matched with each k kk in the future
k: key, will be matched by each q qq in the future
v: information extracted from a
 

After obtaining the four α, we perform softmax on them respectively to obtain four α points, increasing the nonlinearity of the model. (Softmax here can also be other, such as ReLU)

Then extract the important things in the sentence according to the α point, X{i}multiply W^{v}it v^{i}, v^{i}then multiply it by the α point, and then add it up b^{1}. If the correlation is strong, we can perform the same calculation process on the remaining a to get other b. Each b is calculated by integrating the correlation between each a, and this correlation is what we call the attention mechanism. Then we call such a calculation layer a self-attention layer.

From the perspective of the matrix, I is the input, multiplied by the matrix to get Q, K, V, Q and K’s transpose multiplied to get A, A is further processed to get A', A' we call attention matrix attention Force matrix, A' multiplied by V to get O, O is the output of Self-Attention. Among them, the parameters of Self-Attention need to be modified: W^{q}, W^{k},W^{v}

4. Advanced version: multi-head attention

In some cases, there are multiple different definitions of relevance, thus requiring multi-head attention. That is, use multiple k, q, v, get multiple similarities and then splicing them together and multiplying them by a matrix to get the final output.

 5. positional encoding

But the problem is: such a mechanism does not consider the position information of the input sequence. That is to say, the 1, 2, 3, and 4 we marked above are all the same to him, and there is no difference after the position is disturbed. So if you input "A hits B" or "B hits A", the effect is actually the same, so you need to add the position information, and use the positional encoding method in self-attention.

Each position has a position vector e^{i}, and e^{i}adding it a^{i}to it is equivalent to telling him that the current position is at the position i.

If the sequence is too long, you can use truncated self-attention, that is, limit the range when calculating the similarity.

6. Self-Attention comparison

6.1 Self-Attention vs. CNN

References: https://arxiv.org/abs/1911.03584
CNN can be regarded as a simplified version of Self-Attention. CNN only calculates the similarity within the receptive field range, and self-attention considers the similarity of the entire image. Self-attention is a complex version of CNN, that is, self-attention is an automatic learning receptive field, and CNN is a special case of self-attention. But self-attention requires more data sets, while CNN requires relatively less data.

6.2 self-attention vs. RNN

Reference:  https://arxiv.org/abs/2006.16236

RNN will have the problem of long-term memory forgetting, but self-attention does not. RNN is a serial output, while self-attention can be processed in parallel and can be output together at one time. Therefore, the self-attention calculation is more efficient.

Guess you like

Origin blog.csdn.net/Zosse/article/details/124838923