Artificial intelligence (Pytorch) builds a transformer model, really runs through the transformer model, and deeply understands the architecture of the transformer

Hello everyone, I am Weixue AI. Today I will tell you about artificial intelligence (Pytorch) to build a transformer model, and manually build a transformer model. We know that the transformer model is a relatively complex model. It is a sequence construction using a self-attention mechanism. Modular deep learning model. Compared with RNN and CNN, the transformer model is more efficient and easier to parallelize, and is widely used in tasks such as neural machine translation, text generation, and question answering.

1. Transformer model

The transformer model is a deep neural network model for sequence-to-sequence (seq2seq) learning, which was originally applied to machine translation tasks, but has since been widely used in other natural language processing tasks, such as text summarization, language generation, etc.

The innovation of the Transformer model is that it realizes the modeling of sequence data without using a recurrent neural network (RNN) such as LSTM or GRU, which makes it have many advantages compared with RNN, such as better Parallelism, higher training speed and longer sequence dependencies.

Second, the structure of the transformer model

The main components of the Transformer model are the self-attention mechanism and the feedforward neural network. When using the self-attention mechanism, the model generates a vector representation of the same length as the sequence based on the information of each position in the input sequence. This vector representation nicely captures the relationship between each position and other positions in the input sequence, thus providing the model with a better way to understand the input information.

In Transformer, the input sequence is stacked by multiple encoders. In each encoder, the self-attention mechanism and feed-forward neural network form a block, and multiple blocks form a complete encoder. In order to maintain the information of the sequence, Transformer also uses an attention mechanism (attention mechanism) to pass the information of each position in the input sequence to the output sequence.

 The Transformer model includes parts:

Word embedding layer: each word is mapped to a vector representation, which is called an embedding vector, and the word embedding layer can also use pre-trained embedding vectors.

Position encoding: Since the Transformer model does not have a recurrent neural network, it needs a way to handle the position information of words in the sequence. Positional encodings are a set of vectors that are added to the embedding vector so that the model can encode the position of words in the sequence.

Multi-head self-attention mechanism : It is the core part of the Transformer model, which can transfer the information of each position in the input sequence to other positions, and generate a vector representation with the same length as the input sequence.

Feedforward neural network : After the self-attention mechanism, a layer of feedforward neural network is used to add nonlinear transformation to the representation of each position.

Residual connection : A residual connection is added between the self-attention mechanism and the feedforward neural network to capture long-distance dependencies in sequences.

Normalization layer: The normalization layer is divided into two types: 1. Normalization processing is performed on the dimension of the layer, that is, all neurons of each sample are calculated, and the mean and variance of the output of the sample in all neurons are used as normalization optimized parameters. 2. Perform normalization processing in each mini-batch, that is, normalize all samples in a mini-batch on the same dimension, and then use the mean and variance of the mini-batch on this dimension as normalization parameters.

Encoder layer : Consists of multiple (usually 6-12) identical blocks, each block contains a self-attention mechanism, a feed-forward neural network, and residual connections to encode the input sequence.

Decoder layer : In translation tasks, it is also necessary to use a decoder to generate a target language sequence from an already encoded source language sequence.

3. Construction of transformer model

import math
import torch
import torch.nn as nn
import torch.optim as optim

# 位置编码类
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=0.1)
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer("pe", pe)

    def forward(self, x):
        x = x + self.pe[: x.size(0), :]
        return self.dropout(x)

# transformer 模型搭建
class TransformerModel(nn.Module):
    def __init__(self, ntoken, ninp, nhead, nhid, nlayers):
        super(TransformerModel, self).__init__()

        #词嵌入层
        self.embedding = nn.Embedding(ntoken, ninp)

        # 位置编码
        self.pos_encoder = PositionalEncoding(ninp)

        #编码器层
        encoder_layers = nn.TransformerEncoderLayer(ninp, nhead, nhid)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layers, nlayers)
        self.decoder = nn.Linear(ninp, ntoken)
        self.init_weights()

    def init_weights(self):
        init_range = 0.1
        self.embedding.weight.data.uniform_(-init_range, init_range)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-init_range, init_range)

    def forward(self, x):
        x = self.embedding(x)
        x = self.pos_encoder(x)
        x = self.transformer_encoder(x)
        x = self.decoder(x)
        return x

def data_gen(batch_size=20, seq_len=10, limit=500):
    for _ in range(limit):
        data = torch.randint(1, 10, (batch_size, seq_len))
        targets = data * 2
        yield data, targets

if __name__ == "__main__":
    ntokens = 20
    emsize = 200
    nhead = 2
    nhid = 200
    nlayers = 2
    model = TransformerModel(ntokens, emsize, nhead, nhid, nlayers)

    criterion = nn.CrossEntropyLoss()
    lr = 0.001
    optimizer = optim.Adam(model.parameters(), lr=lr)

    num_epochs = 5
    for epoch in range(num_epochs):
        for i, (data, targets) in enumerate(data_gen()):
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output.view(-1, ntokens), targets.view(-1))
            loss.backward()
            optimizer.step()

            if i % 50 == 0:
                print(f"Epoch: {epoch}, Loss: {loss.item():.6f}")

    # Testing on some data
    test_data = torch.tensor([[3, 6, 9], [2, 4, 6]])
    print("Test input:", test_data)
    test_output = torch.argmax(model(test_data), dim=2)
    print("Test output:", test_output)

In order to make it easier for everyone to grasp, the encoder and decoder processes here directly use nn.TransformerEncoderLayer. According to the structure of the transformer encoder, it actually includes: multi-head self-attention mechanism, feedforward neural network, residual connection, and normalization layer. We can directly quote and adjust in the project.

In order for everyone to run through the Transformer model, I multiplied every integer in the input sequence by 2. The data generator data_genfunction generates random sequences for training. A smaller vocabulary size of 20 is set here, which means that there are not too many words.

After the training is complete, a simple test example is given. The test data includes two sequences [3, 6, 9] and [2, 4, 6]. The output of the model is a sequence of doubles of the input integers. This example is just to show you how to build a Transformer model and complete simple training on actual tasks, so that we can really get in touch with the Transformer model at zero distance.

Guess you like

Origin blog.csdn.net/weixin_42878111/article/details/130043148