Implementation of Transformer model of Ai algorithm: 1. Implementation of Input Embedding module and Positional Embedding module


1. Introduction to article generation model

The more common article generation models include the following:

  1. RNN: Recurrent Neural Network. Can handle sequence data of varying lengths, such as natural language text. RNN transfers the information in the time series through the loop structure in the hidden layer, so that the current calculation can refer to the previous information. However, this model has the risk of gradient explosion and gradient disappearance, so it can only do simple generation tasks.
  2. LSTM: Long short memory network. Control information transfer by introducing a gate control mechanism. It effectively avoids the problems of gradient disappearance and gradient guarantee. LSTM can do complex generation tasks.
  3. Transformer: Currently the most popular, a neural network model based on the self-attention mechanism. The main difference between Transformer and the above-mentioned generative models is that the training iterations of RNN and LSTM are serial, and the current word must be processed before the next one can be processed. All characters of Transformer are trained at the same time, that is, in parallel. So it's more efficient, and again, it works better because it references full-text location information.

It is worth mentioning that the value of these models is not limited to article generation. All application scenarios that require "experience value" should be suitable for reference. For example, in 2019, I tried to use LSTM to realize autonomous driving of IoT cars. The operating instructions are converted into text codes to realize operations such as automatic cruising, obstacle avoidance, and wall reversing. The effect is not bad. I believe it will be better to replace it with an attention mechanism.

This article has no intention of reinventing the wheel. It is purely based on interest-based learning and attempts to reproduce the model construction process. The environment used in this article is python3.9+pytorch, and the reference paper is Google's Attention Is All You Need 2017. Welcome to discuss harassment

For the implementation code of RNN and LSTM, please check the related articles in my blog

1.1 Transformer structure diagram

The left side is the foreign original version, and the right side is the translated version
Please add image description
The Transformer model is mainly divided into two parts, namely Encoder and Decoder, namely the encoder and the decoder. The decoder is responsible for mapping the input language sequence into a hidden layer, and then the decoder maps the hidden layer into other natural language sequences. In the original article, both the decoder and encoder are set to 6 layers (N = 6). It is said that this 6 has no special meaning. Just tried numbers based on experience balancing training and accuracy.
The data needs to be preprocessed before the input statement enters the assembler. This is the main content of this chapter: the implementation of the Embedding module

2. Implementation of Input Embedding character encoding module

Character encoding is essentially equivalent to mapping, which mathematically maps real-life objects to computers. Taking the translation task as an example, we need to prepare two different language data and use indexes to match them one-to-one. For example, English characters [i, eat, shit], Chinese [I, eat, shit], this is equivalent to us knowing the questions and answers, and the rest is to train the parameters of the hidden layer.

In npl, in order to make characters countable, the input vocabulary must first be mathematically transformed. In comparative language processing, one hot encoding is generally used. That is, specify an array of table value ranges and individually change the value at a certain position to determine its characteristics.
One-hot encoding example:
[1, 0, 0, 0] = i
[0, 1, 0, 0] = eat
[0, 0, 1, 0] = shit
One-hot encoding is simple and clear, but it cannot compare the similarity between two values. , dimensionality reduction operation cannot be performed. Therefore, multi-dimensional vectors are used in tranfomer to represent the encoding information of words. A vector represents a word. Multiple words together form a matrix. Compared with the previous one-hot encoding, word vectors can easily calculate the similarity (dot product) between words and can also perform dimensionality reduction operations.
Word vector example:
[11, 23, 31, 32]
[23, 21, 31, 23]< /span>
[13, 32, 33, 93]

There are many ways to obtain word embedding. For example, it can be pre-trained using algorithms such as Word2Vec and Glove, or it can be trained in Transformer. The following is a code script to obtain the Embedding vector using python, copy available.

import torch
import torch.nn as nn

# padding:当句子长度不一,有空白时用0补缺
embedding = nn.Embedding(单词数量, 向量维度,padding=0)
# 根据索引获取8个单词向量
input = torch.LongTensor([[1, 2, 3, 4], [11, 12, 13, 13]])
print(embedding(input))
print(embedding(input).shape)

3. Implementation of Positional Embedding position coding module

The positional encoding module is responsible for writing the positional information in the input sequence into the word vector. The sentences input to the transformer have no order information, so it is necessary to add positional information to the input series by calculating the length of the sentence, the length of the word and the position of the word through encoding. . The original author of Transformer uses sine and cosine encoding.

Position Embedding is represented by PE, and the dimensions of PE are the same as the word Embedding. PE can be obtained through training or calculated using a certain formula. The latter is used in Transformer, and the calculation formula is as follows:

So how are word vectors obtained?
Word vector = original word encoding + word position encoding
For example: I eat shit = i eat shit

Insert image description here
Position encoding calculation formula

Even index: P E ( p o s , 2 i ) = s i n ( p o s / 1000 0 2 i / d ) Even index: PE(pos,2i)=sin(pos/10000^2i/d)偶数单了:PE(posos a>,2i)=sin(pos/100002i/d)
Single number index: P E ( p o s , 2 i ) = c o s ( p o s / 1000 0 2 i / d ) Single number index: PE(pos,2i)=cos(pos/10000^2i/d) Number index:PE(pos a>,2i)=cos(pos/100002i/d)

import torch
import torch.nn as nn
import ludash as ld
import cv2
import seaborn    
import matplotlib.pyplot as plt

term = (10000**2/i)
pe[:, 0::2] = torch.sin(position * term )
pe[:, 1::2] = torch.cos(position * term )

4. Obtain preprocessed data

After obtaining the character encoding and position encoding, you can calculate the weight matrix with reference to the character position.

公式: [ q , k , v ] = ( I n p u t E m b e d d i n g + p o s i t i o n a l E m b e d d i n g ) ∗ [ W q , W k , W v ] 公式: [q, k, v] =(Input Embedding + positional Embedding)* [Wq, Wk, Wv] Official:[q,k,v]=InputEmbedding+positi onalEmbe dding[Wq,Wk,Wv]
q = query vector, k = key vector, v = value vector q = query vector, k = key-value vector, v = value vectorq=查询directional quantity,k=Key-value vector,v=value vector






After obtaining this value, you can proceed to the next step: passing in Transfrom's group coder for group code processing.

Guess you like

Origin blog.csdn.net/lengyoumo/article/details/132529600