Note: Part of it comes from online tutorials, if there is any infringement, please contact me to delete the relevant content
Tutorial connection: 2.2 Input part implementation-part1_哔哩哔哩_bilibili
The input section contains:
1. Source Text Embedding Layer and Position Encoder
2. Target Text Embedding Layer and Position Encoder
The role of the embedding layer: it is actually an encoder that encodes the vocabulary into a vector (typical algorithm: onehot vector onehot)
Embedding layer code implementation:
First you need to understand the nn.Embedding module:
torch.nn.Embedding(num_embeddings,embedding_dim,padding_idx=None,
max_norm=None,norm_type=2.0,scale_grad_by_freq=False,
sparse=False,_weight=None)
Parameter explanation: (from the short book: top_small soy sauce)
num_embeddings (python:int) – the size of the dictionary, for example, if there are 1000 words in total, then enter 1000.
embedding_dim (python:int) – The dimension of the embedding vector, that is, how many dimensions are used to represent a symbol.
padding_idx (python:int, optional) – padding id, for example, the input length is 100, but the length of each sentence is different, and it needs to be filled with a uniform number, and this number is specified here, so that the network encounters When the id is populated, its correlation with other symbols is not calculated. (initialized to 0)
max_norm (python:float, optional) – the maximum norm, if the norm of the embedding vector exceeds this limit, it will be renormalized.
norm_type (python:float, optional) – specifies which norm to use for calculation, and is used to compare max_norm, the default is 2 norm.
scale_grad_by_freq (boolean, optional) – scale the gradient according to the frequency of words in the mini-batch. The default is False.
sparse (bool, optional) – if True, gradients associated with weight matrices are turned into sparse tensors.
Embedding layer code implementation:
import torch
import torch.nn as nn
import math
from torch.autograd import Variable
#定义Embeddings类来实现文本嵌入层
class Embeddings(nn.Module):
def __init__(self,d_model,vocab):
#两个参数:d_model指词嵌入的维度,vocab:指词表的大小
super(Embeddings,self).__init__()
self.lut = nn.Embedding(vocab,d_model)
self.d_model = d_model
def forward(self,x):
#参数x:代表输入给模型的文本通过词汇映射后的张量
return self.lut(x) * math.sqrt(self.d_model)
Let's look at the position encoder:
Because there is no processing for vocabulary position information in Transformer's encoder structure, it is necessary to add a position encoder after the Embedding layer.
Position encoder code:
#定义位置编码器类
class PositionalEncoding(nn.Module):
def __init__(self,dims,dropout,max_len=5000):
# dims:单词的维度;max_len:句子的最大长度
super(PositionalEncoding,self).__init__()
self.dropout = nn.Dropout(p=dropout)
# 初始化位置编码矩阵
pe = torch.zeros(max_len,dims)
# 定义一个绝对位置矩阵,形状为(max_len,1),unsqueeze的作用是扩展维度
position = torch.arange(0,max_len).unsqueeze(1)
# 将绝对位置矩阵信息加入到位置编码矩阵中,需要一个1xdims形状的变换矩阵
div_term = torch.exp(torch.arange(0,dims,2) * -(math.log(10000.0)/dims))
pe[:,0::2] = torch.sin(position * div_term)
pe[:,1::2] = torch.cos(position * div_term)
# 这样我们得到了位置编码矩阵pe,但是想要和embedding输出结合就必须扩展一个维度
pe = pe.unsqueeze(0)
# 最后把pe位置编码矩阵注册成模型的buffer,buffer也就是:对模型效果有帮助,但是不是模型结构的超参数或者参数,不需要随着优化步骤更新
# 注册之后可以在模型保存后重加载时和模型结构与参数一同被加载
self.register_buffer('pe',pe)
def forward(self,x):
# x表示文本序列的词嵌入表示
x = x + Variable(self.pe[:,:x.size(1)],requires_grad=False)
return self.dropout(x)
After the embedding layer and position encoder, the final output is a word embedding tensor with position encoding information added.
The advantage of the position encoder: to ensure that the corresponding position embedding vector of the same vocabulary will change with different positions
The value range of the sine wave and the cosine wave is both 1 to -1, which controls the size of the embedded value and helps the gradient to be calculated quickly