[Transformers 03] Use Pytorch to start building transformers

 

1. Description

        In this tutorial, we will use PyTorch to build a basic transformer model from scratch. The Transformer model introduced by Vaswani et al. in the paper "Attention is All You Need" is a deep learning architecture designed for sequence-to-sequence tasks such as machine translation and text summarization. It is based on the self-attention mechanism, which has been the basis of many state-of-the-art natural language processing models, such as GPT and BERT.

2. Preparatory activities

        To generate the converter model, we will follow these steps:

  1. Import necessary libraries and modules
  2. Defining basic building blocks: multi-head attention, positional feed-forward networks, positional encoding
  3. Build the encoder and decoder layers
  4. Combine encoder and decoder layers to create a complete converter model
  5. Prepare sample data
  6. training model

        Let's start by importing the necessary libraries and modules.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import math
import copy

We will now define the basic building blocks of a converter model.

3. Multiple attention

Figure 2. Multi-head attention (Source: Image created by the author)

        The multi-head attention mechanism computes the attention between each pair of positions in the sequence. It consists of multiple "attention heads" to capture different aspects of the input sequence.

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, V)
        return output
        
    def split_heads(self, x):
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        
    def combine_heads(self, x):
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)
        
    def forward(self, Q, K, V, mask=None):
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))
        
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        output = self.W_o(self.combine_heads(attn_output))
        return output

        The MultiHeadAttention code initializes the module with input parameters and a linear transformation layer. It computes attention scores, reshape input tensors into multiple heads, and combines attention outputs from all heads together. The forward approach computes multi-head self-attention, allowing the model to focus on some different aspects of the input sequence.

4. Position Feedforward Network

class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionWiseFeedForward, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

The PositionWiseFeedForward class extends PyTorch's nn. module and implements a position-wise feed-forward network. This class is initialized with two linear transformation layers and a ReLU activation function. The forward method applies these transformations and activation functions sequentially to compute the output. This process enables the model to take into account the location of input elements when making predictions.

5. Position code

        Positional encodings are used to inject the positional information of each token in the input sequence. It uses sine and cosine functions of different frequencies to generate position codes.

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        super(PositionalEncoding, self).__init__()
        
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer('pe', pe.unsqueeze(0))
        
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

The PositionalEncoding class is initialized with the d_model and max_seq_length input parameters, creating a tensor to store the positional encoding values. This class calculates the sine and cosine of even and odd indices respectively according to the scaling factor div_term. The forward method computes the positional encoding by adding the stored positional encoding value to the input tensor, thus enabling the model to capture the positional information of the input sequence.

Now, we will build the encoder and decoder layers.

6. Encoder layer

Figure 3. The encoder part of the transformer network (source: the picture comes from the original text)

The encoder layer consists of a multi-head attention layer, a position feed-forward layer, and two layer normalization layers.

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

The class is initialized with input parameters and components, including a multi-head attention module, a PositionWiseFeedForward module, two layer normalization modules, and a dropout layer. The feed-forward method computes the encoder layer output by applying self-attention, adding the attention output to the input tensor, and normalizing the result. It then computes the position-wise feed-forward output, combines it with the normalized self-attention output, and normalizes the final result before returning the processed tensor.

7. Decoder layer

Figure 4. The decoder part of the transformer network (Souce: image from the original paper)

The decoder layer consists of two multi-head attention layers, a position feed-forward layer and three normalization layers.

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, enc_output, src_mask, tgt_mask):
        attn_output = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn_output))
        attn_output = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        return x

The decoder layer is initialized with input parameters and components, such as a multi-head attention module for masking self-attention and cross-attention, a PositionWiseFeedForward module, a three-layer normalization module, and a dropout layer.

The forward method computes the decoder layer output by performing the following steps:

  1. Computes the masked self-attention output and adds it to the input tensor, followed by dropout and layer normalization.
  2. The cross-attention output between the decoder and encoder outputs is computed and added to the normalized masked self-attention output, followed by dropout and layer normalization.
  3. The position-wise feed-forward output is computed and combined with the normalized cross-attention output, followed by dropout and layer normalization.
  4. Return the processed tensor.

These operations enable the decoder to generate target sequences from the input and the encoder output.

Now, let's combine the encoder and decoder layers to create the full converter model.

8. Transformer model

Figure 5. The Transformer Network (Source: The picture comes from the original text)

Merge them all together:

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):
        super(Transformer, self).__init__()
        self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_length)

        self.encoder_layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.decoder_layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

        self.fc = nn.Linear(d_model, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def generate_mask(self, src, tgt):
        src_mask = (src != 0).unsqueeze(1).unsqueeze(2)
        tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(3)
        seq_length = tgt.size(1)
        nopeak_mask = (1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool()
        tgt_mask = tgt_mask & nopeak_mask
        return src_mask, tgt_mask

    def forward(self, src, tgt):
        src_mask, tgt_mask = self.generate_mask(src, tgt)
        src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
        tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))

        enc_output = src_embedded
        for enc_layer in self.encoder_layers:
            enc_output = enc_layer(enc_output, src_mask)

        dec_output = tgt_embedded
        for dec_layer in self.decoder_layers:
            dec_output = dec_layer(dec_output, enc_output, src_mask, tgt_mask)

        output = self.fc(dec_output)
        return output

Classes combine previously defined modules to create complete converter models. During initialization, the Transformer module sets the input parameters and initializes various components, including embedding layers for source and target sequences, position encoding modules, encoder layer and decoder layer modules for creating stacked layers, and for projecting decoder output The linear layer and the dropout layer.

The generate_mask method creates a binary mask for the source and target sequences to ignore padding tokens and prevent future tokens from being processed by the decoder. The forward method computes the output of the converter model through the following steps:

  1. Use the generate_mask method to generate source and destination masks.
  2. Compute source and target embeddings, and apply positional encoding and dropout.
  3. Process the source sequence through the encoder layer, updating the enc_output tensor.
  4. Process the target sequence through the decoder layer, use the enc_output and mask, and update the dec_output tensor.
  5. Apply a linear projection layer to the decoder output, taking the output logarithm.

These steps enable the Transformer model to process an input sequence and generate an output sequence based on the combined functions of its components.

9. Prepare sample data

        In this example, we will create a toy dataset for demonstration purposes. In effect, you will use a larger dataset, preprocess the text, and create a vocabulary map for the source and target languages.

src_vocab_size = 5000
tgt_vocab_size = 5000
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
max_seq_length = 100
dropout = 0.1

transformer = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout)

# Generate random sample data
src_data = torch.randint(1, src_vocab_size, (64, max_seq_length))  # (batch_size, seq_length)
tgt_data = torch.randint(1, tgt_vocab_size, (64, max_seq_length))  # (batch_size, seq_length)

10. Training model

        Now, we will train the model using the example data. In practice, you will use a larger dataset and split it into training and validation sets.

criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

transformer.train()

for epoch in range(100):
    optimizer.zero_grad()
    output = transformer(src_data, tgt_data[:, :-1])
    loss = criterion(output.contiguous().view(-1, tgt_vocab_size), tgt_data[:, 1:].contiguous().view(-1))
    loss.backward()
    optimizer.step()
    print(f"Epoch: {epoch+1}, Loss: {loss.item()}")

        We can build a simple transformer from scratch in Pytorch using this approach. All large language models are trained using these transformer encoder or decoder blocks. So it's important to understand the network that starts it all. Hope this article helps anyone who wants to understand LLM in depth.

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132237566