Background introduction
In the transformers architecture, the input of the word vector needs to be added with the position information corresponding to the original word as the input to the model for training. How to implement the specific position encoding? This blog will share with you the corresponding steps.
position encoding formula
There are many ways to encode the position of word vectors. Here we introduce the formula for position encoding using trigonometric functions.
PE means position embedding position encoding. pos represents the position of the word and the dimension of the word vector. i represents the i-th dimension of the word vector.
Then we will implement the code for position encoding according to the formula
Code
Environment dependent libraries
import torch
import math
import numpy as np
import matplotlib.pyplot as plt
Define a function to obtain positional encoding information
def generate_word_embeding(max_len,d_model):
# 初始化位置信息
pos = torch.arange(max_len).unsqueeze(1)
# 初始化位置编码矩阵
result = torch.zeros(max_len,d_model)
# 获得公式对应的值
coding = torch.exp(torch.arange(0,d_model,2)*(-math.log(10000.0))/d_model)
result[:,0::2] = torch.sin(pos*coding)
result[:,1::2] = torch.cos(pos*coding)
# 为了与原编码直接相加,格式为[B,seq_len,d_model],需要再增加一个维度
return result.unsqueeze(0)
Assume that our max_len is 100 and d_model is 20, then the dimension of pos is [100,1], the dimension of result is [100,20], the dimension of coding is [1,d_model/2], result[:,0: :2] refers to assigning values to every other column starting from column 0 of the result, corresponding to PE(pos,2i) in the formula; similarly, result[:,1::2] corresponds to PE(pos) in the formula ,2i+1)
Visualizing location-encoded information
We visualize the position encoding information to get a more intuitive feeling
d = 6
pos_code = generate_word_embeding(100,d)
print(pos_code.shape)
plt.plot(np.arange(100),pos_code[0,:,0:d])
plt.legend(['dim=%d'%p for p in range(d)])
plt.show()
Set the temporal length of the word to 6 and display the position coding information of each dimension in the corresponding temporal sequence.
It can be seen that each time series position corresponds to a transformation rule of a trigonometric function for each dimension. After being put into the model for training, the corresponding knowledge of the position can be obtained through learning.
Everyone is welcome to discuss and exchange~