Several questions about Transformer

      Transformer was proposed in "Attention Is All You Need" and its structure is as follows:

Insert image description here

       There are many articles explaining Transformer, so I won’t repeat them here. You can refer to Document 1 and Document 2.

      This article just wants to talk about some of the problems I encountered when reading the Transformer article, and write down my understanding of these problems.

Question 1: Why divide by dk \sqrt{d_k}dk ?

Insert image description here
      when dk d_kdkWhen it increases, it means that the dot product operation between q and k will increase, as long as qiki q_ik_iqikiSlightly larger than other values, after softmax most values ​​will become very small, close to 0, making their gradient very small.

      We assume that the distributions of q and k both obey the standard normal distribution, that is, their mean is 0 and the variance is 1. After the dot multiplication operation is performed: the distribution becomes a mean of 0
Insert image description here
      and a variance of dk d_kdk[It can be deduced by using the properties of mean and variance], when dk d_kdkWhen it is very large, it means that the variance of q*k is very large, and the distribution will tend to be steep (the variance of the distribution is large, and the distribution will be concentrated in areas with large absolute values), which will cause the values ​​to become polarized after softmax() status.

      Do a simple experiment, dk d_kdkWhen taking 5 and 100 respectively, dk d_k is randomly generated.dkq and k obey the standard normal distribution, and after dot multiplication through the softmax function, the image is as follows:
      When dk d_kdk=5:
dk=5
      when dk d_kdk=100:
Insert image description here
      when I put dk d_kdk=100, q and k are dot multiplied and divided by dk \sqrt{d_k}dk Finally, the image becomes:
Insert image description here
Question 2: The role of multi-head attention

      Attention maps query and key into the same high-dimensional space to calculate similarity, while the corresponding multi-head attention divides query and key into h small sequences and maps them to different subspaces of the high-dimensional space to calculate similarity. The advantage of this is that when the total number of parameters of the two methods remains unchanged, Attention has different distributions in different subspaces. Concat diversifies the information of the Attention layer and increases the expressive ability of Attention.
Insert image description here

Question three: positional encoding

      Because Transformer deals with sequence problems and does not have the ability to capture data positions, additional position information needs to be added. The position information here can be absolute or relative, trainable or untrainable. Information, in the article the author proposes to use sin and cos to represent the relative position information of the sequence:
Insert image description here
      where pos is the position and i is the dimension.

      Sin and cos are periodic functions, and for any x value, the function value is uniquely determined. When the position pos posp os is offset by k units (recorded aspos + k pos+kpos+k ),PE pos + k PE_{pos+k}PEpos+kYou can use PE pos PE_{pos}PEposExpressed as a linear multiple of .

Question 4: Layer Normal and batch Normal

      Assume the input is [batch_size,channels,H,W]
Insert image description here

      Batch Normal, as shown in the first picture, calculates the mean and variance for each channel of all samples in the small batch (batch_size), and then normalizes them. Batch Normal is related to batch_size. When batch_size is small, normalizing each small batch is not enough to represent the distribution of the entire data, and the calculated mean variance and other information require additional storage space. For a fixed depth DNN and CNN are more suitable.

      Layer Normal, as shown in the second picture, normalizes the mean and variance of each sample. It does not depend on the size of batch_size and the depth of the input sequence, and is more suitable for RNN. Reference

2023/5/18 Some code records when learning Transformer again.
The basic reference for this part of the code and text is https://b23.tv/Vw6IYII

mask tensor

The mask tensor generally consists of 0 and 1, which represents whether the current position is masked or not. In Transformer, the mask tensor is mainly used for attention. The calculations in some generated attention tensors may be obtained by knowing the future information. The future information is seen because the entire output result is processed at once during training. embedding, but generally the decoder cannot produce the final result in one go, but needs to be synthesized from the previous result again and again. Therefore, future information may be used in advance. Therefore, a mask tensor is needed to mask future information.

# 构建掩码张量函数
def subsequent_mask(size):
    atten_shape=(1,size,size)
    subsequent_mask=np.triu(np.ones(atten_shape),k=1).astype('uint8')
    return torch.from_numpy(1-subsequent_mask)
    
plt.figure(figsize=(5,5))
plt.imshow(subsequent_mask(20)[0])
plt.show()

Insert image description here
The above picture is an illustration of the mask tensor. Yellow represents the occluded part, purple represents the visible part, the abscissa is the position of the target word, and the ordinate is the viewable position of the word. For example, when x=5, y=4, it means that at the fifth word position, only the first 4 words can be viewed.

attention code

# 注意力机制
def attention(query,key,value,mask=None,dropout=None):
    '''
    args:
        query
        key
        value
        mask:是否使用掩码
        dropout:dropout函数
    '''
    d_k=query.size(-1)  # 词向量维度,一般设为512
    scores=torch.matmul(query,key.transpose(-2,-1))/math.sqrt(d_k)
    
    if mask is not None:
        # mask维度和scores一样,用于遮掩满足一定条件的scores位置的值,用于掩码注意力机制中
        # mask一般为上三角矩阵,这里的意思是将mask中为0的位置对应的scores位置的值用1e-9进行填充
        scores=scores.masked_fill(mask==0,1e-9)
        
    p_attn=F.softmax(scores)
    if dropout is not None:
        p_attn=dropout(p_attn)
        
    return torch.matmul(p_attn,value),p_attn

Image explanation of Q, K and V:

假设给出一段文本,需要使用一些关键词对它进行描述:
key表示提前给的提示信息,value是第一次看到后脑中猜测的信息,qurey表示给出的这段文本。
一般key和value值相同,这种为一般的注意力机制。当key=value=query时,成为自注意力机制。

multi-head attention mechanism

def clone(module,nums):
    return nn.ModuleList([module for _ in range(nums)])


class MultiHeadedAttention(nn.Module):
    def __init__(self,head,embedding_dim,dropout=0.1):
        super(MultiHeadedAttention,self).__init__()
        '''   
        head:多头数目
        embedding_dim:词嵌入维度
        '''
        assert embedding_dim%head == 0 #可以整除
        
        self.d_k=embedding_dim/head  # 每个头分配的词嵌入维度
        self.head=head
        
        self.linears=clone(nn.Linear(embedding_dim,embedding_dim),4)
        self.attn=None
        self.dropout=nn.Dropout(p=dropout)
        
    
    def forward(self,query,key,value,mask=None):
        if mask is not None:
            mask=mask.unsqueeze(1) # 使用unsqueeze扩展维度,表示多头中的第头
        
        bach_size=query.size(0)  #表示有多少样本
        
        # 多头注意力机制
        query,key,value=[model(x).view(bach_size,-1,self.head,self.d_k).transpose(1,2) 
                         for model,x in zip(self.linears,(query,key,value))]
        
        # 传入attention函数
        x,self.attn=attention(query,key,value,mask,self.dropout)
        
        x=x.transpose(1,2).contiguous().view(bach_size,-1,self.head*self.d_k)
        
        return self.linears[-1](x)

      The above are the problems I encountered when reading the Transformer paper and some of my own thoughts. If there is anything written incorrectly or the expression is not clear, please give me some advice.

Guess you like

Origin blog.csdn.net/qq_44846512/article/details/114364559