Super detailed illustration Self-Attention

Super detailed illustration Self-Attention

ConquerJ

First contact a year ago Transformer. At that time, I only felt that the model was complicated and the steps were complicated. After studying the paper for many days, I didn't fully understand the reasoning. I just memorized some terms in general, but the internal mechanism was completely incomprehensible, and the relevant formulas were even more forgettable.

Self-Attention It is  Transformerthe core idea. I have re-read the paper in recent days and have some new thoughts. Write this article to share with readers.

When the author first came into contact with it Self-Attention, the biggest thing I didn’t understand was Q K Vthe three matrices and the Query query vector we often mentioned, etc. Now the reason should be that I was stumped by high-dimensional and complicated matrix operations, and I didn’t really understand matrix operations. core meaning. Therefore, before the start of this article, the author first summarizes some basic knowledge, and the article will revisit how the ideas contained in this knowledge are reflected in the model.

some basics

  1. What is the inner product of vectors, how is it calculated, and most importantly, what is its geometric meaning?
  2. What does it mean to multiply a matrix � by its own transpose?

1. Key-value pair attention

In this section, we first analyze Transformerthe most core part. We start with the formula and draw each step into a graph to facilitate readers' understanding.

The core formula of the key-value pair Attention is shown in the figure below. In fact, this formula contains many points, let's talk about them one by one. Readers are invited to follow my train of thought, start from the most core part, and the details will suddenly become clear.

If the above formula is difficult to understand, can readers of the following formula know what it means?

Let's put aside the three matrices first , the most primitive form of self-Attention actually looks like the above. So what exactly does this formula mean?Q K V

let's talk step by step

What does �� represent 

What is the result of multiplying a matrix by its own transpose , and what is the point?

 

We know that a matrix can be regarded as consisting of some vectors , and the operation of multiplying a matrix by its own transpose can actually be regarded as calculating the inner product of these vectors with other vectors. (At this time, I think of the formula of matrix multiplication, the first row multiplied by the first column, the first row multiplied by the second column... Uh huh, isn't the first row the first column after the matrix transpose ?This is calculating the inner product of the first row vector and itself , multiplying the first row by the second column is calculating the inner product of the first row vector and the second row vector multiplying the first row by the third column is calculating The inner product of the first row vector and the third row vector .....)

Recalling the question raised at the beginning of our article, what is the geometric meaning of the inner product of vectors?

Answer: Characterize the angle between two vectors, and represent the projection of one vector on another vector

Keeping this knowledge point in mind, let's go into a super detailed example:

We assume  that �� is a two-dimensional matrix, and ���� is a row vector (in fact, many textbooks default to column vectors. For the convenience of examples, please understand that the author uses row vectors). Corresponding to the figure below, �1� corresponds to the result after the word "early" embedding, and so on.

 

The following operation models a process, ����. Let's see what the results mean

First, the row vector  is inner producted with itself and the other two row vectors (the inner products of "early" and "upper" and "good" are respectively calculated), and a new vector is obtained. Let's recall that the inner product of vectors mentioned above represents the angle between two vectors, and represents the projection of one vector on another vector . So what's the point of the new vector of vectors? is the projection of the row vector �� onto itself and the other two row vectors. We think, what does it mean to have a large projection value? So what if the projected value is small?

The larger the value of the projection, the higher the correlation between the two vectors .

We consider that if the angle between the two vectors is 90 degrees, then the two vectors are linearly independent and have no correlation at all!

Furthermore, this vector is a word vector, which is the numerical mapping of words in high-dimensional space . What does a high correlation between word vectors mean? Does it mean to a certain extent (not completely) that when paying attention to word A, more attention should be paid to word B ?

The figure above shows the result of a row vector operation, so what is the meaning of the matrix ����?

The matrix  ���� is  a square matrix, which we understand from the perspective of row vectors, which stores the result of the inner product operation between each vector and itself and other vectors.

So far, we understand the meaning of ���� in the formula ��������(����)��. Let's go further, what is the significance of Softmax? Please see the picture below

Let's recall the Softmax formula, what is the significance of the Softmax operation?

Answer: normalization

We understand from the above figure that after Softmax, the sum of these numbers is 1. Let's think again, what is the core of the Attention mechanism?

weighted sum

So where does the weight come from? These are the normalized numbers. When we pay attention to the word "early", we should allocate 0.4 attention to itself, leaving 0.4 attention to "up", and 0.2 attention to "good". Of course, when it comes to our Transformer, it is the operation corresponding to the vector, which is a later story.

So far, are we a little familiar with this thing? In the heatmap Heatmap in Python , does the matrix also save the similarity results?

We seem to have cleared some of the fog, and the formula  has already understood half of it. What's the point of the last X? What exactly does the full formula indicate? Let's continue the previous calculation, please see the figure below

 

Let us take a row vector of ��������(����) as an example. What does this row vector multiplied by a column vector of � represent?

Looking at the above figure, the row vector is multiplied by the first column vector of � to get a new row vector, and this row vector has the same dimension as �.

In the new vector, the value of each dimension is obtained by the weighted sum of the values ​​of the three word vectors in this dimension. This new row vector is the weighted sum of the "early" word vectors through the attention mechanism. subsequent representations.

A more vivid picture is like this. The color depth of the right half of the picture is actually the size of the value in the yellow vector in the picture above. The meaning is the correlation between words (recall the previous content, the essence of correlation is measured by the inner product of the vectors )!

If you insist on reading this far, I believe you have a deeper understanding of the formula ��������(����)��.

We next explain some of the minutiae in the original formula

2.  MatrixQ K V

The word did not appear in our previous example Q K Vbecause it is not the most essential content of the formula.

Q K VWhat exactly is it? Let's look at the picture below

In fact, the so-called matrix and query vector in many articles are Q K Vderived from the product of � and the matrix, which are essentially linear transformations of  X.

Why not use � directly but linearly transform it?

Of course, in order to improve the fitting ability of the model , the matrix can be trained to play a buffer effect.

 

If you really understand the content of the previous article and understand the meaning of the matrix ��������(����), I believe you also understand the meaning of the so-called query vector.

3. Significance of ��

Assuming that the mean value of the elements in ��,�� is 0 and the variance is 1, then the mean value of the elements in ��=���� is 0 and the variance is d. When d becomes large, the variance of the elements in �� will also be becomes very large, if the variance of the elements in �� is large, then the distribution of ��������(��) will tend to be steep (the variance of the distribution is large, and the distribution is concentrated in the area with large absolute value). To sum up, the distribution of ��������(��) will be related to d. Therefore, after each element in �� is divided by ��, the variance becomes 1 again. This decouples the "steepness" of the distribution of ������(�) from d, so that the gradient value remains stable during training.

So far, the core content of Self-Attention has been explained. For more details about Transformer, please refer to my answer:

Transformer - Attention is all you need571 agree · 39 comment articleEdit


Finally, I would like to add that for self-attention, it does attention with each input vector, so the order of the input sequence is not considered . More generally speaking, you can find that each word vector in our previous calculation calculates the inner product with other word vectors, and the result obtained loses the order information of our original text. In contrast, LSTM’s interpretation of text sequence information is the order of the output word vectors, and our calculation above does not mention the order of the sequence at all. If you disrupt the order of the word vectors, the result is still Are the same.

This involves the position encoding of Transformer, we will not show it.

Code Implementation of Self-Attention

# Muti-head Attention 机制的实现
from math import sqrt
import torch
import torch.nn


class Self_Attention(nn.Module):
    # input : batch_size * seq_len * input_dim
    # q : batch_size * input_dim * dim_k
    # k : batch_size * input_dim * dim_k
    # v : batch_size * input_dim * dim_v
    def __init__(self,input_dim,dim_k,dim_v):
        super(Self_Attention,self).__init__()
        self.q = nn.Linear(input_dim,dim_k)
        self.k = nn.Linear(input_dim,dim_k)
        self.v = nn.Linear(input_dim,dim_v)
        self._norm_fact = 1 / sqrt(dim_k)
        
    
    def forward(self,x):
        Q = self.q(x) # Q: batch_size * seq_len * dim_k
        K = self.k(x) # K: batch_size * seq_len * dim_k
        V = self.v(x) # V: batch_size * seq_len * dim_v
         
        atten = nn.Softmax(dim=-1)(torch.bmm(Q,K.permute(0,2,1))) * self._norm_fact # Q * K.T() # batch_size * seq_len * seq_len
        
        output = torch.bmm(atten,V) # Q * K.T() * V # batch_size * seq_len * dim_v
        
        return output

# Muti-head Attention 机制的实现
from math import sqrt
import torch
import torch.nn


class Self_Attention_Muti_Head(nn.Module):
    # input : batch_size * seq_len * input_dim
    # q : batch_size * input_dim * dim_k
    # k : batch_size * input_dim * dim_k
    # v : batch_size * input_dim * dim_v
    def __init__(self,input_dim,dim_k,dim_v,nums_head):
        super(Self_Attention_Muti_Head,self).__init__()
        assert dim_k % nums_head == 0
        assert dim_v % nums_head == 0
        self.q = nn.Linear(input_dim,dim_k)
        self.k = nn.Linear(input_dim,dim_k)
        self.v = nn.Linear(input_dim,dim_v)
        
        self.nums_head = nums_head
        self.dim_k = dim_k
        self.dim_v = dim_v
        self._norm_fact = 1 / sqrt(dim_k)
        
    
    def forward(self,x):
        Q = self.q(x).reshape(-1,x.shape[0],x.shape[1],self.dim_k // self.nums_head) 
        K = self.k(x).reshape(-1,x.shape[0],x.shape[1],self.dim_k // self.nums_head) 
        V = self.v(x).reshape(-1,x.shape[0],x.shape[1],self.dim_v // self.nums_head)
        print(x.shape)
        print(Q.size())

        atten = nn.Softmax(dim=-1)(torch.matmul(Q,K.permute(0,1,3,2))) # Q * K.T() # batch_size * seq_len * seq_len
        
        output = torch.matmul(atten,V).reshape(x.shape[0],x.shape[1],-1) # Q * K.T() * V # batch_size * seq_len * dim_v
        
        return output


On the basis of this article, the author implemented the Transformer model from scratch. Interested readers are welcome to take a look~

After staying up all night, I implemented the Transformer model from scratch, and I will tell you the code. 2268 Agree. 108 Comment articleEdit


a small advertisement

One of the more satisfying things I did this year is to create a communication group with a high concentration of information, mainly for the back-end students of the 24th autumn recruitment (other students are welcome to come in and have a look, and there are many social recruitment companies in the group ssp bigwigs), bigwigs gather and have been active. Not much to say here, interested students can send it in and have a look~

Take an offer Knowledge Planet X benefits of this group

24th Autumn Recruitment | High-density information exchange environment | High-quality exchange group

ConquerJ: Sword Finger Autumn Moves | A high-quality learning and communication environment ?

Take an offer Knowledge Planet introduction and common QA

https://docs.qq.com/doc/DWUtqcXJhVGJITUJr

Resume revision|Get an offerKnowledge Planet

https://docs.qq.com/doc/DWXJ0SlFMVE9DWVRj

This group's special benefits and large coupons can be directly received:

https://t.zsxq.com/0a2frxDEK

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/132412961