Attention mechanism, Encoder, Decoder

attention mechanism

At the beginning, let's think about a question: Suppose I want to find a job, how can I know how much salary I may get in a company?
Average the salary of everyone in the company? This is obviously unreasonable.
A reasonable idea is to find the average salary of people in this company who have the same or similar positions as me and have similar education and work experience to me. Such an average is obviously closer to the salary I may get.
Among them: It is known that each employee's job experience in the company and the corresponding salary are Key and Value (K_i, V_i) respectively,
and I am Query(Q), and the final salary I want is the requested Value

The purpose of the attention mechanism is to find a function F[Q, (K_i, V_i)] to find similar Keys through Query, and assign different weights according to the similarity with each Key, and use this weight Weight and the corresponding Value to find The value of the target Query.
The limiting is attention.

Attention mechanism: The input received by the neural network is many vectors of different sizes, and there is a certain relationship between different vectors and vectors. However, the relationship between these inputs cannot be fully utilized during actual training, resulting in extremely poor model training results. The attention mechanism forces the model to learn to focus on specific parts of the input sequence when decoding, rather than relying solely on the decoder's input.

Definition : The ability to select a small part of useful information from a large amount of input information to focus on and ignore other information is called attention.

Calculate :

  1. Enter Query, Key, Value;
  2. Calculate the correlation/similarity between the two based on Query and Key (common method dot product, cosine similarity, generally use dot product) to get the attention score;
  3. Scale the attention score (divided by the root sign of the dimension), then normalize with softmax, and then get the weight coefficient;
  4. According to the weight coefficient, the Value value is weighted and summed to obtain the Attention Value (at this time, V has some attention information, more important information is more concerned, and unimportant information is ignored);

f ( q , ( k 1 , v 1 ) , ... , ( km , vm ) ) = ∑ i = 1 m α ( q , ki ) vi ∈ R vf\left(\mathbf{q},\left(\mathbf {k}_{1}, \mathbf{v}_{1}\right), \ldots,\left(\mathbf{k}_{m}, \mathbf{v}_{m}\right)\ right )= \sum_{i=1}^{m} \alpha\left(\mathbf{q}, \mathbf{k}_{i}\right) \mathbf{v}_{i} \in \mathbb {R}^{v}f(q,(k1,v1),,(km,vm))=i=1ma(q,ki)viRv

α ( q , k i ) = softmax ⁡ ( a ( q , k i ) ) = exp ⁡ ( a ( q , k i ) ) ∑ j = 1 m exp ⁡ ( a ( q , k j ) ) ∈ R . \alpha\left(\mathbf{q}, \mathbf{k}_{i}\right)=\operatorname{softmax}\left(a\left(\mathbf{q}, \mathbf{k}_{i}\right)\right)=\frac{\exp \left(a\left(\mathbf{q}, \mathbf{k}_{i}\right)\right)}{\sum_{j=1}^{m} \exp \left(a\left(\mathbf{q}, \mathbf{k}_{j}\right)\right)} \in \mathbb{R} . a(q,ki)=softmax(a(q,ki))=j=1mexp(a(q,kj))exp(a(q,ki))R.

insert image description here

As shown in the figure above, choosing different attention scoring functions aleads to different attention pooling operations.

Some points to note :

  1. Why do you need to scale before softmax? Why divide by the root sign of the dimension?
  2. Scaling is because softmax normalization is problematic. When an element before scaling is very large, softmax will assign most of the probability to this large element, which will generate a vector similar to one-hot. Softmax backpropagation causes gradients to vanish. So scaling before softmax alleviates this problem.
  3. Divide by the root sign of the dimension because we want the data input to softmax to have a mean of 0 and a variance of 1.
  4. Generally K and V are the same, or there is a certain relationship.
  5. The new vector Attention Value represents Key and Value (Key is generally the same as Value), and Attention Value also implies Q information. To sum up: Find the key points in the Key by querying and traversing the Key, and then combine it with the Value to form a new vector to represent the Key.
  6. Why can't the similarity be obtained by multiplying Key and Key by itself, but to create a new Q?
  7. If the similarity is obtained by multiplying the key by itself, what is obtained at this time is actually a symmetric matrix, which is equivalent to projecting the key into the same space, and the generalization ability is weak.

self-attention mechanism

It can be seen from the above attention mechanism that there is no essential difference between Key and Query, but one is known and the other is unknown. The self-attention mechanism is that Query, Key, and Value all come from the known X sequence , and learn independently to find the relationship between each X_i and other X_k (k ≠ i) through a similar way of doing cloze.

The difference between Attention and Self-Attention:

  1. In Attention, K and V are often of the same origin (or different sources), and Q does not have any requirements, so attention is actually a very broad concept. There is no regulation on how Q, K, and V come from, as long as QKV is multiplied The process of calculating similarity is the attention mechanism (so there are channel attention mechanism and spatial attention mechanism);
  2. Self-Attention belongs to Attention, which requires that QKV must be of the same origin and still represent X, which can be regarded as equal in essence, but the same word vector X is multiplied by the parameter matrix, and a spatial transformation is made;

Calculate :

  1. Get Q, K, V by sharing parameters W_Q, W_K, W_V and X operation;
  2. The next step is exactly the same as the attention mechanism;
    insert image description here

insert image description here

insert image description here

Z is essentially an X vector, but it contains some new information: contains the similarity between each vector and all vectors in X, let x1/x2... focus on more important word vectors

location code

location code

Why is positional encoding needed?
Self-attention is parallel, which means that it can calculate the correlation between each position and other positions at the same time, that is to say, there is no sequential relationship between words. So if you mess up a sentence, the word vector in this sentence remains unchanged, that is, there is no positional relationship.

Summary: Self-attention does not have a sequence like RNN, you can find the position of each sequence, self-attention is only responsible for calculating the correlation between each position and other positions. In order to solve this problem, positional encoding is proposed.

How to do position encoding?
insert image description here

insert image description here

  1. Encode the word vector to generate embedding x1 (shape1);
  2. Generate the corresponding positional encoding positional emcoding t1 (shape1), as above t1 contains the positional relationship between x1 and x2, x1 and x3;
  3. Add the two to add (embedding + positional emcoding) to generate the final input feature Embedding with time signal (shape1);

Generation method of position code
1. Sine and cosine generation
PE ( pos , 2 i ) = sin ⁡ ( pos / 1000 0 2 i / d ) PE ( pos , 2 i + 1 ) = cos ⁡ ( pos / 1000 0 2 i / d ) \begin{array}{c} P E_{(pos, 2 i)}=\sin \left(pos / 10000^{2 i / d}\right) \\ P E_{(pos, 2 i+ 1)}=\cos \left(pos / 10000^{2 i / d}\right) \end{array}PE( p os , 2 i )=sin(pos/100002 i / d )PE( p os , 2 i + 1 )=cos(pos/100002 i / d )

With the help of the above formula, we can get the dmodel d_{model} of a specific positiondmodeldimensional position vector, and by virtue of the properties of trigonometric functions
{ sin ⁡ ( α + β ) = sin ⁡ α cos ⁡ β + cos ⁡ α sin ⁡ β cos ⁡ ( α + β ) = cos ⁡ α cos ⁡ β − sin ⁡ α sin ⁡ β \left\{\begin{array}{l} \sin (\alpha+\beta)=\sin \alpha \cos \beta+\cos \alpha \sin \beta \\ \cos (\alpha+\beta )=\cos \alpha \cos \beta-\sin \alpha \sin \beta \end{array}\right.{ sin ( a+b )=sinacosb+cosasinbcos ( a+b )=cosacosbsinasinb
我们可以得到:
{ P E (  pos  + k , 2 i ) = P E (  pos  , 2 i ) × P E ( k , 2 i + 1 ) + P E (  pos  , 2 i + 1 ) × P E ( k , 2 i ) P E (  pos  + k , 2 i + 1 ) = P E (  pos  , 2 i + 1 ) × P E ( k , 2 i + 1 ) − P E (  pos  , 2 i ) × P E ( k , 2 i ) \left\{\begin{array}{l} P E(\text { pos }+k, 2 i)=P E(\text { pos }, 2 i) \times P E(k, 2 i+1)+P E(\text { pos }, 2 i+1) \times P E(k, 2 i) \\ P E(\text { pos }+k, 2 i+1)=P E(\text { pos }, 2 i+1) \times P E(k, 2 i+1)-P E(\text { pos }, 2 i) \times P E(k, 2 i) \end{array}\right. { PE (  pos +k,2i ) _=PE (  pos  ,2i ) _×PE ( k ,2i _+1)+PE (  pos  ,2i _+1)×PE ( k ,2i ) _PE (  pos +k,2i _+1)=PE (  pos  ,2i _+1)×PE ( k ,2i _+1)PE (  pos  ,2i ) _×PE ( k ,2i ) .
It can be seen that for pos + k pos+kpos+The vector of position k has a certain dimension 2 i 2i2 i2 i + 1 2i+12i _+1 , it can be expressed aspos posp os position andkk2i 2iof the position vector at position k2 i2 i + 1 2i+12i _+A 1- dimensional linear combination, such a linear combination means that the relative position information is contained in the position vector.

We know that the position encoding is actually to let t1 know the positional relationship between x1 and x2, x1 and x3. Understand why from a formula perspective?

sin(pos+k) = sin(pos)*cos(k) + cos(pos)*sin(k)  # sin 表示的是偶数维度
cos(pos+k) = cos(pos)*cos(k) - sin(pos)*sin(k)  # cos 表示的是奇数维度

This formula tells us: position pos + k pos + kpos+k is the positionpos posp os and positionkkA linear combination of k
i.e. when I know the positionpos + k pos+kpos+k then the internal positionpos posp os and positionkkThe information of k
Therefore, even if the positional relationship of words is disturbed at this time, the positional encoding will also change, which solves the problem of transformer

2. Self-learning generation
This method is relatively simple, just randomly initialize random variables with the same shape as x1, and then learn automatically when training the network.

Multi-Head Attention Mechanism

The new word vector obtained by Multi-Head Self-Attention can be further improved than the word vector obtained by Self-Attention.

What is long? (Usually use 8 heads)
insert image description here

Theoretical approach:

  1. Enter X;
  2. Corresponding to 8 single heads, corresponding to 8 groups of W_Q, W_K, W_V and then performing self-attention respectively, to obtain Z_0 - Z_7;
  3. Then splicing and concat Z_0 - Z_7;
  4. Do another linear transformation (dimension reduction) to get Z;

source code:

  1. Enter X;
  2. According to W_Q, W_K, W_V, generate Q, K, V;
  3. Then split Q into q_1 - q_7, K into k_1 - k_7, and V into v_1 - v_7;
  4. head_0 - head_7 do self-attention mechanism on qkv_1 - qkv_7 respectively
  5. Then splicing and concat z_0 - z_7;
  6. Do another linear transformation (dimension reduction) to get Z;

Why bulls? what's the effect?
The essence of machine learning: y = σ ( wx + b ) y=σ(wx+b)y=σ(wx+b ) is actually doing nonlinear transformation. Transform data x (which is unreasonable) into data y (reasonable) through nonlinear transformation.
The essence of nonlinear transformation: space transformation, changing the position coordinates in space.
The essence of self-attention is to map the position of point X on the original data space to point Z on the new space through nonlinear transformation.
muti-head self-attention: take the input data X, and then map X to 8 different subspaces through nonlinear transformation, and then use these 8 different subspaces to find the final point Z in the new space. In this way, richer feature information can be captured and the effect is better.

To put it simply, it is possible to observe the relationship between Query and Key through more different dimensions, and to better calculate the weight of Q and K.

  1. Attention Mechanism & Self-Attention Model
  2. [In-depth understanding of Transformer related theories] Attention mechanism, self-attention mechanism, multi-head attention mechanism, position encoding
  3. Self-Attention Mechanism (Self-Attention)
  4. Explain the positional encoding in the self-attention mechanism (Part 1
  5. Detailed explanation of the Transformer model (the most complete version of the diagram)
  6. [Transformer special topic] 1. Attention is All You Need (Transformer introduction)
  7. The role of the Embedding layer
  8. EMBEDDING layer function
  9. attention mechanism

Guess you like

Origin blog.csdn.net/Lc_001/article/details/129437318