Encoder structure implementation of Transformer model 1 (mask tensor + attention mechanism)

Description: partly from online tutorials

Tutorial link: 2.3.1 Mask tensor-part1_哔哩哔哩_bilibili

1. Mask tensor

A mask is to mask the contents of a tensor. "Mask" is to cover, and "code" is the value in the tensor. The size is variable, and its content generally only contains 0 and 1. As for whether the position of 0 is covered or the position of 1 is covered, it can be customized.

Code to generate mask tensor:

def subsequent_mask(size):
    '''
    生成掩码张量,参数size是掩码张量最后的两个维度大小,最后两维形成一个方阵
    '''
    attn_shape = (1,size,size)    # 定义掩码张量的形状
    # 利用np.triu形成上三角矩阵,为了节省空间转换为uint8
    # k=0时主对角线不移动,k=1表示主对角线向右上移动一次
    subsequent_mask = np.triu(np.ones(attn_shape),k=1).astype('uint8')
    # 进行反转并返回    

    return torch.form_numpy(1 - subsequent_mask)

2. Attention mechanism

Attention: When we observe food, we will automatically focus on the most critical part of the food, so we can make quick judgments.

Attention calculation rules: 3 inputs Q (query), K (key), and V (value) are required, and then the calculation result of attention is obtained through the formula. There are quite a few calculation rules, here we introduce one of them:

Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V

Let's say we want to profile a piece of text. Then this text is Q. At this time, some keywords will be given to you in advance as prompt information, and these keywords are K. And V represents what comes to mind after you see these keywords.

At that timeK=V\neq Q , it was called the attention mechanism; K=V=Qat that time it was called the self-attention mechanism. At this time, the keyword is equivalent to the text itself.

Code:

The attention mechanism is the carrier of the deep learning network to which the attention calculation rules can be applied. In addition to the attention calculation rules, it also includes some necessary fully connected layers and related tensor processing. An attention mechanism that uses self-attention computation rules is called a self-attention mechanism.

Attention mechanism code implementation:

def attention(query,key,value,mask=None,dropout=None):
    # query的最后一个维度:词嵌入的维度:(2,4,512)中的512
    # mask:掩码张量
    # dropout:实例化的dropout模块,用于随机置零
    
    d_k = query.size(-1)
    scores = torch.matmul(query,key.transpose(-2,-1)) / math.sqrt(d_k)
    
    
    if mask is not None:
        scores = scores.masked_fill(mask == 0,1e-9)
       
    p_attn = F.softmax(scores,dim=-1)
    
    if dropout is not None:
        p_attn = dropout(p_attn)
        
    # 最终返回值:query的注意力表示;注意力张量
    return torch.matmul(p_attn,value),p_attn

3. Multi-head attention mechanism

 The structure diagram of the multi-head attention mechanism is shown in the figure.

The multi-head attention mechanism actually splits the word embedding dimension. For example, after word embedding, the input is (2,4,512), then the word embedding dimension is 512, and you can divide it into two heads. The input (2, 4, 256) obtained by each head is sent to the attention mechanism, and the number of heads here is h in the figure.

The role of the multi-head attention mechanism:

This structural design allows each attention mechanism to optimize different feature parts of each vocabulary, thereby balancing the possible deviations of the same attention mechanism, and allowing word meanings to come from more diverse expressions. Improve the effect of the model.

# 用于深度拷贝的工具包copy
import copy

# 首先需要定义克隆函数,因为在多头注意力机制实现中,用到多个结构相同的线性层
# 使用clone函数将他们一同初始化在一个网络层列表对象中
def clones(module,N):
    '''
        用于生成相同网络层的克隆函数,参数module表示克隆的目标网络层,N代表克隆的数量
    '''
    # 通过for循环对Module进行N次深度拷贝,然后放在列表中存放
    
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

# 实现多头注意力机制
class MultiHeadedAttention(nn.Module):
    def __init__(self, head, embedding_dim, dropout=0.1):
        # head:头数
        # embedding_dim:词嵌入维度
        # dropout:dropout参数

        super(MultiHeadedAttention,self).__init__()
        # 确认词嵌入维度是否能整除头数
        assert embedding_dim % head == 0
        self.d_k = embedding_dim // head
        self.head = head
        self.embedding_dim = embedding_dim

        # 获得4个线性层
        self.linears =  clones(nn.Linear(embedding_dim,embedding_dim),4)

        # 初始化注意力张量
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)
      
    def forward(self, query, key, value, mask=None):
        
        if mask is not None:
            mask = mask.unsqueeze(1)
        # 得到batchsize
        batch_size = query.size(0)
        # 使用zip将网络层和输入数据连接在一起,输出需要进行维度改变
        query, key, value = \
            [model(x).view(batch_size,-1,self.head,self.d_k).transpose(1,2)
             for model,x in zip(self.linears,(query, key, value))]
        # 将每个头的输出传入到注意力层
        x, self.attn = attention(query, key, value, mask=mask,dropout=self.dropout)
        x = x.transpose(1,2).contiguous().view(batch_size,-1,self.head*self.d_k)

        #最后将x输入线性层列表的最后一个线性层,得到最终输出:query的多头注意力机制表示
        return self.linears[-1](x)
        

Guess you like

Origin blog.csdn.net/APPLECHARLOTTE/article/details/127231042