Decoding Transformer: Detailed description and code implementation of self-attention mechanism and codec mechanism

Table of contents

This article comprehensively discusses Transformer and its derivative models, deeply analyzes the self-attention mechanism, encoder and decoder structure, and lists its encoding implementation to deepen understanding, and finally lists various models based on Transformer such as BERT and GPT. The article aims to explain in depth how Transformer works and demonstrate its broad impact in the field of artificial intelligence.

The author, TechLead, has more than 10 years of experience in Internet service architecture, AI product development, and team management. He is a Fudan master of Tongji University, a member of Fudan Robot Intelligence Laboratory, a senior architect certified by Alibaba Cloud, a project management professional, and a billion-dollar AI revenue. Head of Product Development

1. The Background of Transformer

file
The emergence of Transformer marks a milestone in the field of natural language processing. The following will comprehensively explain its background from three aspects: technical challenges, the rise of the self-attention mechanism, and Transformer's impact on the entire field.

1.1 Technical challenges and limitations of previous solutions

RNNs and LSTMs

Early sequence models, such as RNN and LSTM, although performed well in some scenarios, encountered many challenges in practice:

  • Computational efficiency : Due to the recursive structure of RNN, it must process the elements in the sequence one by one, so that the calculation cannot be parallelized.
  • Long-distance dependency problem : RNN has difficulty capturing long-distance dependencies in sequences, and LSTM has improved, but is still not perfect.

Attempts of Convolutional Neural Network (CNN) in Sequence Processing

Convolutional Neural Networks (CNNs) can capture local dependencies and in some ways improve the capture of long-range dependencies by using multi-layer convolutions. However, the fixed convolution window size of CNN limits the range of dependencies it can capture, and the handling of global dependencies is not flexible enough.

1.2 The rise of self-attention mechanism

The self-attention mechanism addresses the above challenges:

  • Parallelized Computation : By observing all elements in a sequence simultaneously, the self-attention mechanism allows the model to process the entire sequence in parallel.
  • Capturing long-distance dependencies : The self-attention mechanism can effectively capture long-distance dependencies in sequences, no matter how far away.

The introduction of this mechanism makes the Transformer model a technological breakthrough.

1.3 Revolutionary impact of Transformer

The emergence of Transformer has had a profound impact on the entire field:

  • Setting New Standards : Transformer sets new performance benchmarks across multiple NLP tasks.
  • Promote new research and applications : Transformer's structure promotes many new research directions and practical applications, such as the birth of advanced models such as BERT and GPT.
  • Cross-field impact : In addition to natural language processing, Transformer has also had an impact on other fields such as bioinformatics and image processing.

2. Self-attention mechanism

file

2.1 Concept and working principle

Self-attention mechanism is a technique that can capture the relationship between elements inside a sequence. It computes the similarity of each element in the sequence with other elements, enabling the capture of global dependencies.

  • Weight calculation : Assign different weights to each element by calculating the similarity between each element in the sequence.
  • Global Dependency Capture : Capable of capturing dependencies at arbitrary distances in a sequence, breaking through the limitations of previous models.

Element weight calculation

file

  • Query, Key, Value structure : Each element in the sequence is expressed as three parts: Query, Key, and Value.
  • Similarity measure : Use the dot product of Query and Key to calculate the similarity between elements.
  • Weight assignment : convert the similarity into weight by Softmax function.

For example, consider the weight calculation for an element:

import torch
import torch.nn.functional as F

# Query, Key
query = torch.tensor([1, 0.5])
key = torch.tensor([[1, 0], [0, 1]])

# 相似度计算
similarity = query.matmul(key)

# 权重分配
weights = F.softmax(similarity, dim=-1)
# 输出:tensor([0.7311, 0.2689])

weighted sum

The self-attention mechanism uses the calculated weights to weight and sum Value to obtain a new representation of each element.

value = torch.tensor([[1, 2], [3, 4]])
output = weights.matmul(value)
# 输出:tensor([1.7311, 2.7311])

The difference between self-attention and traditional attention

The main differences between the self-attention mechanism and traditional attention are:

  • Self-referencing : The self-attention mechanism is the sequence's own attention to itself, rather than to external sequences.
  • Global dependency capture : It is not limited by the local window and can capture dependencies at any distance in the sequence.

Computational efficiency

The self-attention mechanism is able to process the entire sequence in parallel and is not limited by the sequence length, thus achieving remarkable computational efficiency.

  • Parallelization advantage : Self-attention calculations can be performed simultaneously, improving the speed of training and reasoning.

Application in Transformer

In Transformer, the self-attention mechanism is a key component:

  • Multi-head attention : Through multi-head attention, the model can learn different dependencies at the same time, which enhances the expressiveness of the model.
  • Weight Visualization : Self-attention weights can be used to explain how the model works, increasing interpretability.

cross domain application

The influence of the self-attention mechanism goes far beyond natural language processing:

  • Image Processing : Applications to tasks such as image segmentation and recognition.
  • Speech Recognition : Helps capture temporal dependencies in speech signals.

Future Trends and Challenges

Despite the remarkable success of self-attention, there is still room for research:

  • Computational and storage requirements : High complexity creates memory and computational challenges.
  • Interpretability and theoretical understanding : A deep understanding of the attention mechanism remains to be further explored.

2.2 Calculation process

file

input representation

The input to the self-attention mechanism is a sequence, usually consisting of a set of word vectors or other elements. These elements will be converted into three parts: Query, Key, and Value respectively.

import torch.nn as nn

embedding_dim = 64
query_layer = nn.Linear(embedding_dim, embedding_dim)
key_layer = nn.Linear(embedding_dim, embedding_dim)
value_layer = nn.Linear(embedding_dim, embedding_dim)

Similarity Calculation

Through the dot product calculation of Query and Key, the similarity matrix between each element is obtained.

import torch

embedding_dim = 64

# 假设一个序列包含三个元素
sequence = torch.rand(3, embedding_dim)

query = query_layer(sequence)
key = key_layer(sequence)
value = value_layer(sequence)

def similarity(query, key):
    return torch.matmul(query, key.transpose(-2, -1)) / (embedding_dim ** 0.5)

weight distribution

Normalize the similarity matrix to weights.

def compute_weights(similarity_matrix):
    return torch.nn.functional.softmax(similarity_matrix, dim=-1)

weighted sum

Use the weight matrix to weight and sum the Value to get the output.

def weighted_sum(weights, value):
    return torch.matmul(weights, value)

multi-headed self-attention

In practical applications, multi-head attention is usually used to capture multi-faceted information in sequences.

class MultiHeadAttention(nn.Module):
    def __init__(self, embedding_dim, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.head_dim = embedding_dim // num_heads
        
        self.query_layer = nn.Linear(embedding_dim, embedding_dim)
        self.key_layer = nn.Linear(embedding_dim, embedding_dim)
        self.value_layer = nn.Linear(embedding_dim, embedding_dim)
        self.fc_out = nn.Linear(embedding_dim, embedding_dim)

    def forward(self, query, key, value):
        N = query.shape[0]
        query_len, key_len, value_len = query.shape[1], key.shape[1], value.shape[1]

        # 拆分多个头
        queries = self.query_layer(query).view(N, query_len, self.num_heads, self.head_dim)
        keys = self.key_layer(key).view(N, key_len, self.num_heads, self.head_dim)
        values = self.value_layer(value).view(N, value_len, self.num_heads, self.head_dim)

        # 相似度计算
        similarity_matrix = torch.einsum("nqhd,nkhd->nhqk", [queries, keys]) / (self.head_dim ** 0.5)

        # 权重分配
        weights = torch.nn.functional.softmax(similarity_matrix, dim=-1)

        # 加权求和
        attention = torch.einsum("nhql,nlhd->nqhd", [weights, values])

        # 串联多个头的输出
        attention = attention.permute(0, 2, 1, 3).contiguous().view(N, query_len, embedding_dim)

        # 通过线性层整合输出
        output = self.fc_out(attention)

        return output


3. The structure of Transformer

file

3.1 Encoder (Encoder)

file
The encoder is one of the core components of the Transformer, and its main task is to understand and process input data. The encoder constructs a powerful sequence-to-sequence mapping tool by combining self-attention mechanism, feed-forward neural network, normalization layer and residual connection. The self-attention mechanism enables the model to capture the complex relationship within the sequence, and the feed-forward network provides nonlinear computing capabilities. Normalization layers and residual connections help stabilize the training process.
Below are the individual components of the encoder and their detailed descriptions.

3.1.1 Self-attention layer

The first part of the encoder is the self-attention layer. As mentioned earlier, self-attention enables the model to attend to all positions in the input sequence and encode each position based on this information.

class SelfAttentionLayer(nn.Module):
    def __init__(self, embedding_dim, num_heads):
        super(SelfAttentionLayer, self).__init__()
        self.multi_head_attention = MultiHeadAttention(embedding_dim, num_heads)
    
    def forward(self, x):
        return self.multi_head_attention(x, x, x)

3.1.2 Feedforward neural network

After the self-attention layer, the encoder includes a Feed-Forward Neural Network (FFNN). This network consists of two linear layers and an activation function.

class FeedForwardLayer(nn.Module):
    def __init__(self, embedding_dim, ff_dim):
        super(FeedForwardLayer, self).__init__()
        self.fc1 = nn.Linear(embedding_dim, ff_dim)
        self.fc2 = nn.Linear(ff_dim, embedding_dim)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

3.1.3 Normalization layer

In order to stabilize the training and speed up the convergence, each self-attention layer and feed-forward layer is followed by a normalization layer (Layer Normalization).

layer_norm = nn.LayerNorm(embedding_dim)

3.1.4 Residual connection

The Transformer also uses residual connections so that the output of each layer is added to the input. This helps prevent vanishing and exploding gradients.

output = layer_norm(self_attention(x) + x)
output = layer_norm(feed_forward(output) + output)

3.1.5 Entire structure of encoder

The final encoder is formed by stacking N such layers.

class Encoder(nn.Module):
    def __init__(self, num_layers, embedding_dim, num_heads, ff_dim):
        super(Encoder, self).__init__()
        self.layers = nn.ModuleList([
            nn.Sequential(
                SelfAttentionLayer(embedding_dim, num_heads),
                nn.LayerNorm(embedding_dim),
                FeedForwardLayer(embedding_dim, ff_dim),
                nn.LayerNorm(embedding_dim)
            )
            for _ in range(num_layers)
        ])

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

3.2 Decoder (Decoder)

file
The decoder is responsible for generating the target sequence based on the output of the encoder and the previously generated partial output sequences. The decoder adopts a similar structure to the encoder, but adds a masked self-attention layer and an encoder-decoder attention layer to generate the target sequence. The mask ensures that the decoder generates output at each position using only previous positions. The encoder-decoder attention layer enables the decoder to use the output of the encoder. With this structure, the decoder is able to generate target sequences conforming to the context and source sequence information, providing a powerful solution to many complex sequence generation tasks.
Below are the main components of the decoder and how they work.

3.2.1 Self-attention layer

The first part of the decoder is a masked self-attention layer. This layer is similar to the self-attention layer in the encoder, but adds a mask to prevent positions from paying attention to positions after it.

def mask_future_positions(size):
    mask = (torch.triu(torch.ones(size, size)) == 1).transpose(0, 1)
    return mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))

mask = mask_future_positions(sequence_length)

3.2.2 Encoder-Decoder Attention Layer

The decoder also includes an encoder-decoder attention layer that allows the decoder to pay attention to the output of the encoder.

class EncoderDecoderAttention(nn.Module):
    def __init__(self, embedding_dim, num_heads):
        super(EncoderDecoderAttention, self).__init__()
        self.multi_head_attention = MultiHeadAttention(embedding_dim, num_heads)
    
    def forward(self, queries, keys, values):
        return self.multi_head_attention(queries, keys, values)

3.2.3 Feedforward neural network

The decoder also has a feed-forward neural network with the same structure as the feed-forward neural network in the encoder.

3.2.4 Normalization layer and residual connection

These components are also the same as in the encoder and are used after each sublayer.

3.2.5 The complete structure of the decoder

The decoder consists of a self-attention layer, an encoder-decoder attention layer, a feed-forward neural network, a normalization layer, and a residual connection, usually including N such layers.

class Decoder(nn.Module):
    def __init__(self, num_layers, embedding_dim, num_heads, ff_dim):
        super(Decoder, self).__init__()
        self.layers = nn.ModuleList([
            nn.Sequential(
                SelfAttentionLayer(embedding_dim, num_heads, mask=mask),
                nn.LayerNorm(embedding_dim),
                EncoderDecoderAttention(embedding_dim, num_heads),
                nn.LayerNorm(embedding_dim),
                FeedForwardLayer(embedding_dim, ff_dim),
                nn.LayerNorm(embedding_dim)
            )
            for _ in range(num_layers)
        ])

    def forward(self, x, encoder_output):
        for layer in self.layers:
            x = layer(x, encoder_output)
        return x

4. Various models based on Transformer

file

Transformer-based models continue to emerge, providing powerful tools for a variety of NLP and other sequence processing tasks. From generating text to understanding context, these models have different advantages and characteristics, and together promote the rapid development of the field of natural language processing. What these models have in common is that they all adopt the core concept of the original Transformer and make various innovations and improvements on this basis. In the future, more Transformer-based models can be expected to continue to emerge, further expanding its application scope and influence.

4.1 BERT(Bidirectional Encoder Representations from Transformers)

BERT is a Transformer encoder-based model for generating context-sensitive word embeddings. Unlike traditional word embedding methods, BERT can understand the specific meaning of words in sentences.

main feature

  • Two-way training, capturing contextual information
  • Extensive pre-training for a variety of downstream tasks

4.2 GPT(Generative Pre-trained Transformer)

Unlike BERT, GPT focuses on generating text using Transformer decoders. GPT is pre-trained as a language model and fine-tuned for various generative tasks.

main feature

  • generate text from left to right
  • High flexibility in a variety of generation tasks

4.3 Transformer-XL(Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context)

Transformer-XL addresses the context length limitation of the original Transformer model by introducing a reusable memory mechanism.

main feature

  • longer context dependent
  • Memory mechanism improves efficiency

4.4 T5(Text-to-Text Transfer Transformer)

The T5 model treats all NLP tasks as text-to-text translation problems. This unified framework makes it very easy to switch between different tasks.

main feature

  • Versatility, suitable for a variety of NLP tasks
  • Simplifies the need for task-specific architectures

4.5 XLNet

XLNet is a general-purpose autoregressive pre-training model that combines the bidirectional capabilities of BERT and the autoregressive advantages of GPT.

main feature

  • Combining bidirectional and autoregressive
  • Provides an effective pre-training method

4.6 DistilBERT

DistilBERT is a lightweight version of the BERT model that retains most of its performance but with a significantly reduced model size.

main feature

  • Fewer parameters and calculations
  • Suitable for scenarios with limited resources

4.7 ALBERT(A Lite BERT)

ALBERT is another optimization of BERT that reduces the number of parameters while improving training speed and model performance.

main feature

  • parameter sharing
  • faster training

V. Summary

Since its introduction, Transformer has profoundly changed the face of natural language processing and many other sequence processing tasks. Through its unique self-attention mechanism, Transformer overcomes many limitations of previous models, enabling higher parallelization and more flexible dependency capture.

In this paper, we explore the following aspects of Transformer in detail:

  1. Appearance background : Understand how Transformer was born from the limitations of RNN and CNN, and how it processes sequences through self-attention mechanism.
  2. Self-attention mechanism : Explains in detail the calculation process of the self-attention mechanism and how it allows the model to establish dependencies between different positions.
  3. Structure of Transformer : Gain an in-depth look at the structure of Transformer's encoder and decoder, and how the various components work together.
  4. Various models based on Transformer : Discuss a series of models based on Transformer, such as BERT, GPT, T5, etc., and understand their characteristics and applications.

Transformer not only promotes research and applications in the field of natural language processing, but also demonstrates its potential in other fields, such as bioinformatics, image analysis, etc. Many modern state-of-the-art models are based on Transformer, taking advantage of its flexible and efficient structure to solve previously intractable problems.

In the future, we can expect Transformer and its derivative models to continue to play an important role in a wider range of fields, continuously innovating and promoting the development of the field of artificial intelligence.


Guess you like

Origin blog.csdn.net/magicyangjay111/article/details/132253257