Transformer input part implementation

Note: Part of it comes from online tutorials, if there is any infringement, please contact me to delete the relevant content

Tutorial connection: 2.2 Input part implementation-part1_哔哩哔哩_bilibili

The input section contains:

1. Source Text Embedding Layer and Position Encoder

2. Target Text Embedding Layer and Position Encoder

 The role of the embedding layer: it is actually an encoder that encodes the vocabulary into a vector (typical algorithm: onehot vector onehot)

Embedding layer code implementation:

First you need to understand the nn.Embedding module: 

torch.nn.Embedding(num_embeddings,embedding_dim,padding_idx=None,
max_norm=None,norm_type=2.0,scale_grad_by_freq=False,
sparse=False,_weight=None)

 Parameter explanation: (from the short book: top_small soy sauce)

num_embeddings (python:int) – the size of the dictionary, for example, if there are 1000 words in total, then enter 1000.
embedding_dim (python:int) – The dimension of the embedding vector, that is, how many dimensions are used to represent a symbol.

padding_idx (python:int, optional) – padding id, for example, the input length is 100, but the length of each sentence is different, and it needs to be filled with a uniform number, and this number is specified here, so that the network encounters When the id is populated, its correlation with other symbols is not calculated. (initialized to 0)

max_norm (python:float, optional) – the maximum norm, if the norm of the embedding vector exceeds this limit, it will be renormalized.

norm_type (python:float, optional) – specifies which norm to use for calculation, and is used to compare max_norm, the default is 2 norm.

scale_grad_by_freq (boolean, optional) – scale the gradient according to the frequency of words in the mini-batch. The default is False.

sparse (bool, optional) – if True, gradients associated with weight matrices are turned into sparse tensors.

Embedding layer code implementation:

import torch
import torch.nn as nn
import math
from torch.autograd import Variable

#定义Embeddings类来实现文本嵌入层
class Embeddings(nn.Module):
    
    def __init__(self,d_model,vocab):
        #两个参数:d_model指词嵌入的维度,vocab:指词表的大小
        super(Embeddings,self).__init__()
        
        self.lut = nn.Embedding(vocab,d_model)
        self.d_model = d_model

    def forward(self,x):
        
        #参数x:代表输入给模型的文本通过词汇映射后的张量
        return self.lut(x) * math.sqrt(self.d_model)

Let's look at the position encoder:

Because there is no processing for vocabulary position information in Transformer's encoder structure, it is necessary to add a position encoder after the Embedding layer.

Position encoder code:

#定义位置编码器类
class PositionalEncoding(nn.Module):
    def __init__(self,dims,dropout,max_len=5000):
        # dims:单词的维度;max_len:句子的最大长度
        super(PositionalEncoding,self).__init__()
        
        self.dropout = nn.Dropout(p=dropout)
        
        # 初始化位置编码矩阵
        pe = torch.zeros(max_len,dims)
        # 定义一个绝对位置矩阵,形状为(max_len,1),unsqueeze的作用是扩展维度
        position = torch.arange(0,max_len).unsqueeze(1)
        
        # 将绝对位置矩阵信息加入到位置编码矩阵中,需要一个1xdims形状的变换矩阵
        div_term = torch.exp(torch.arange(0,dims,2) * -(math.log(10000.0)/dims))
        
        pe[:,0::2] = torch.sin(position * div_term)
        pe[:,1::2] = torch.cos(position * div_term)
        # 这样我们得到了位置编码矩阵pe,但是想要和embedding输出结合就必须扩展一个维度
        pe = pe.unsqueeze(0)
        
        # 最后把pe位置编码矩阵注册成模型的buffer,buffer也就是:对模型效果有帮助,但是不是模型结构的超参数或者参数,不需要随着优化步骤更新
        # 注册之后可以在模型保存后重加载时和模型结构与参数一同被加载
        self.register_buffer('pe',pe)


    def forward(self,x):
        # x表示文本序列的词嵌入表示
        x = x + Variable(self.pe[:,:x.size(1)],requires_grad=False)
        return self.dropout(x)

After the embedding layer and position encoder, the final output is a word embedding tensor with position encoding information added.

The advantage of the position encoder: to ensure that the corresponding position embedding vector of the same vocabulary will change with different positions

The value range of the sine wave and the cosine wave is both 1 to -1, which controls the size of the embedded value and helps the gradient to be calculated quickly

Guess you like

Origin blog.csdn.net/APPLECHARLOTTE/article/details/127206083