Attention Mechanism (5): Principles and Implementation of Transformer Architecture, Actual Machine Translation

Column: Neural Network Recurrence Directory

attention mechanism

Attention Mechanism is an artificial intelligence technology that allows neural networks to focus on key information while ignoring unimportant parts when processing sequence data. The attention mechanism has been widely used in natural language processing, computer vision, speech recognition and other fields.

The main idea of ​​the attention mechanism is to make the model pay more attention to important inputs by assigning different weights to input signals at different positions when processing sequence data. For example, when processing a sentence, the attention mechanism can adjust the model's attention to each word according to the importance of each word. This technique can improve the performance of the model, especially when dealing with long sequence data.

In deep learning models, attention mechanisms are often implemented by adding additional network layers that learn how to compute weights and apply those weights to the input signal. Common attention mechanisms include self-attention, multi-head attention, etc.

In conclusion, the attention mechanism is a very useful technique that can help neural networks to better process sequential data and improve the performance of the model.



self-attention

Self-Attention (Self-Attention) is an attention mechanism in deep learning. It was first proposed in the paper "Attention is All You Need" to solve sequence-to-sequence problems in natural language processing, such as machine translation.

The self-attention mechanism allows the model to automatically learn the dependencies between different positions in a text sequence, so as to better understand the relationship between different parts in the sequence. In self-attention, each element in the input sequence is represented as a vector, which can be regarded as a collection of query (Query), key (Key) and value (Value).

In self-attention, each query is dot-producted with all keys to obtain an attention weight that represents the relevance of the query to each key. Then, these attention weights and values ​​are weighted and averaged to obtain a weighted vector representation, which is the output of self-attention.

The advantage of the self-attention mechanism is that it can handle long sequence inputs because it does not need to retain all the historical information like the recurrent neural network. Self-attention can also learn more complex relationships, such as long-range dependencies, which are very important for some tasks, such as text generation and speech recognition.

The following is the formula for the self-attention mechanism:

insert image description here

Suppose the input sequence is X = [ x 1 , x 2 , . . . , xn ] X=[x_1, x_2, ..., x_n]X=[x1,x2,...,xn] , among whichxi x_ixiIndicates the iith in the input sequencei elements. Each input vectorxi x_ixican be expressed as a ddd- dimensional vector.

First, each input vector xi x_ixiMapped to three vectors qi , ki , vi q_i, k_i, v_iqi,ki,vi, and their dimensions are ddd . Specifically, for eachiii , with:

q i = W q x i q_i = W_q x_i qi=Wqxi

k i = W k x i k_i = W_k x_i ki=Wkxi

v i = W v x i v_i = W_v x_i vi=Wvxi

W q , W k , W v W_q, W_k, W_vWq,Wk,Wvare three learnable weight matrices respectively.

Then, compute each query vector qi q_iqiwith all key vectors kj k_jkjThe dot product, and then normalized by a softmax function to obtain the attention weight α i , j \alpha_{i,j}ai,j

α i , j = softmax ( q i T k j / d ) \alpha_{i,j}=\text{softmax}(q_i^T k_j/\sqrt{d}) ai,j=softmax(qiTkj/d )

where d \sqrt{d}d It is to alleviate the problem of excessive numerical value that may be caused by the dot product operation.

Finally, use the attention weight α i , j \alpha_{i,j}ai,jFor all value vectors vj v_jvjPerform weighted summation to get each query vector qi q_iqiThe corresponding output vector oi o_ioi

o i = ∑ j = 1 n α i , j v j o_i = \sum_{j=1}^{n} \alpha_{i,j} v_j oi=j=1nai,jvj

The output of self-attention is all query vectors qi q_iqiThe corresponding output vector oi o_ioiThe collection of can be expressed as a matrix O = [ o 1 , o 2 , . . . , on ] O=[o_1, o_2, ..., o_n]O=[o1,o2,...,on]

import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, input_dim, num_heads):
        super(SelfAttention, self).__init__()
        self.num_heads = num_heads
        self.q_linear = nn.Linear(input_dim, input_dim)
        self.k_linear = nn.Linear(input_dim, input_dim)
        self.v_linear = nn.Linear(input_dim, input_dim)
        self.output_linear = nn.Linear(input_dim, input_dim)

    def forward(self, x):
        # x shape: batch_size x seq_len x input_dim

        batch_size, seq_len, input_dim = x.size()

        # Project the input vectors to queries, keys, and values
        queries = self.q_linear(x).view(batch_size, seq_len, self.num_heads, input_dim // self.num_heads).transpose(1, 2)
        keys = self.k_linear(x).view(batch_size, seq_len, self.num_heads, input_dim // self.num_heads).transpose(1, 2)
        values = self.v_linear(x).view(batch_size, seq_len, self.num_heads, input_dim // self.num_heads).transpose(1, 2)

        # Compute the dot product of queries and keys
        dot_product = torch.matmul(queries, keys.transpose(-2, -1)) / (input_dim // self.num_heads) ** 0.5

        # Apply the softmax function to obtain attention weights
        attention_weights = torch.softmax(dot_product, dim=-1)

        # Compute the weighted sum of values
        weighted_sum = torch.matmul(attention_weights, values)

        # Reshape the output and apply a linear transformation
        weighted_sum = weighted_sum.transpose(1, 2).contiguous().view(batch_size, seq_len, input_dim)
        output = self.output_linear(weighted_sum)

        return output

First, we define a PyTorch model called SelfAttention that includes four linear layers: q_linear, k_linear, v_linear, and output_linear. These four linear layers respectively input the vector xxx maps toqqq k k k v v v vector and output vector.

In the forward method, we first pass the input vector xxThe shape of x is interpreted as (batch_size, seq_len, input_dim), where batch_size represents the batch size, seq_len represents the sequence length, and input_dim represents the dimension of the input vector.

Then, we will input the vector xxx is passed into the q_linear, k_linear and v_linear linear layers respectively, and their shapes are transformed to (batch_size, seq_len, num_heads, input_dim // num_heads). Here, num_heads represents the number of attention heads to use and we will input the vectorxxx inddThe d dimension is divided into num_heads sub-vectors, and an attention weight is calculated for each sub-vector. In this way, the dimension of each subvector becomes input_dim // num_heads.

Next, after we transform the shapes of queries, keys and values, we need to transpose the queries, keys and values ​​in the num_heads dimension, so that we can easily change their shapes into (batch_size * num_heads, seq_len, input_dim // num_heads ), which is convenient for subsequent calculations.

Next, we compute the dot product of queries and keys and divide the result by d model \sqrt{d_\text{model}}dmodel . Here, d model d_\text{model}dmodelIndicates the input vector xxThe dimension of x , ie input_dim. We divide byd model \sqrt{d_\text{model}}dmodel It is to avoid the problem that the dot product is too large or too small.

Then, we perform the softmax operation on the dot product result dot_product to obtain the attention weight attention_weights. Here, we softmax the last dimension, i.e. compute an attention weight for each subvector.

Next, we weighted and summed the attention weights attention_weights and values ​​to get the weighted vector weighted_sum.

Finally, we change the shape of the weighted vector weighted_sum back to (batch_size, seq_len, input_dim), and then pass it to the output_linear linear layer for transformation to get the final output vector output.

In summary, this code implements a self-attention layer with a multi-head self-attention mechanism that takes as input a vector xxx maps to output vectoryyy , and for the input vectorxxEach subvector in x computes an attention weight so that different subvectors are weighted differently. In this way, we can better understand different information in the input vector and assign different information to different sub-vectors. At the same time, the multi-head self-attention mechanism can improve the representation ability of the model and make it easier for the model to capture long-distance dependencies.

location code

Why Use Positional Encoding

In the transformer model, the input is a row of sentences. It is easy for humans to see the order of each word in the sentence, that is, the position information, for example:

(1) Absolute position information. a1 is the first token, a2 is the second token...
(2) Relative position information. a2 is one digit behind a1, a4 is two digits behind a2...
(3) The distance between different positions. a1 and a3 are two places apart, a1 and a4 are three places apart...

But this is a very difficult thing for the machine. The self-attention in the transformer can learn the correlation between each word in the sentence and pay attention to the important information, but it cannot learn the meaning of each word. Location information, so we need to add token location information to the model.

The Transformer model abandons RNN and CNN as the basic model of sequence learning. We know that the cyclic neural network itself is a sequential structure, which inherently contains the position information of words in the sequence. When the cyclic neural network structure is discarded and Attention is completely used instead, the word order information will be lost, and the model will have no way to know the relative and absolute position information of each word in the sentence. Therefore, it is necessary to add the word order signal to the word vector to help the model learn this information. Positional Encoding is the method used to solve this problem.

Positional Encoding (Positional Encoding) is a method of re-representing each word in the sequence with the positional information of the word, so that the input data carries the positional information, so that the model can find out the positional characteristics. As mentioned above, the Transformer model itself does not have the ability to learn word order information like RNN, and needs to actively feed word order information to the model. Then, the original input of the model is a word vector without word order information, and position coding needs to combine word order information and word vector to form a new representation input to the model, so that the model has the ability to learn word order information.

Calculation of position codes

Suppose the input represents X ∈ R n × d X\in R^{n \times d}XRn × d contains a d-dimensional embedding representation of n tokens in a sequence. Position encoding uses a position embedding matrixP ∈ R n × d P\in R^{n\times d}PRn × d outputX + P X+PX+P , matrixiiLine i represents the position code of a token:

Pos ( i , 2 j ) = sin ⁡ ( i 1000 0 2 j / d model ) , Pos ( i , 2 j + 1 ) = cos ⁡ ( i 1000 0 2 j / d model ) , \begin{aligned} \text{Pos}(i, 2j) &= \sin\left(\frac{i}{10000^{2j/d_{\text{model}}}}\right), \\ \text{Pos}(i, 2j+1) &= \cos\left(\frac{i}{10000^{2j/d_{\text{model}}}}\right), \end{aligned} Pos(i,2j ) _Pos(i,2 j+1)=sin(100002 j / dmodeli),=cos(100002 j / dmodeli),

Implementation of position encoding

#@save
class PositionalEncoding(nn.Block):
    """位置编码"""
    def __init__(self, num_hiddens, dropout, max_len=1000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(dropout)
        # 创建一个足够长的P
        self.P = np.zeros((1, max_len, num_hiddens))
        X = np.arange(max_len).reshape(-1, 1) / np.power(
            10000, np.arange(0, num_hiddens, 2) / num_hiddens)
        self.P[:, :, 0::2] = np.sin(X)
        self.P[:, :, 1::2] = np.cos(X)

    def forward(self, X):
        X = X + self.P[:, :X.shape[1], :].as_in_ctx(X.ctx)
        return self.dropout(X)

absolute position code

You might be wondering, how can a combination of sine and cosine represent a position/order?

It's actually very simple, assuming you want to represent a number in binary format:
insert image description here
2 i data at position i alternate once. The 2^i data at the i-th position are alternated once.2 at position iThe i data are alternated once.

The figure below uses sine function encoding, the sentence length is 50 (ordinate), and the encoding vector dimension is 128 (abscissa). It can be seen that the alternating frequency gradually slows down from left to right. As can be seen from the figure below, each line is the position code of a lexical element. We can clearly see that the position information of the first word and the last word are completely different.
insert image description here

relative position information

In addition to capturing absolute positional information, the positional encoding described above also allows the model to learn relative positional information in the input sequence. This is because for any determined position offset aaa,经说i + a i+ai+Position encoding at a can linearly project position iiThe position code at i is represented.

The mathematical interpretation of this projection is, let wi = 1 / 1000 0 2 j / d w_i=1/10000^{2j/d}wi=1/100002 j / d , for any definite position offsetaaa , any pair in( pi , 2 j , pi , 2 j + 1 ) (p_{i,2j},p_{i,2j+1})(pi , 2 j,pi , 2 j + 1) can be linearly projected to( pi + a , 2 j , pi + a , 2 j + 1 ) (p_{i+a,2j},p_{i+a,2j+1})(pi + a , 2 j,pi + a , 2 j + 1)

insert image description here

Transformer architecture

Model

Transformer is a neural network model based on self-attention mechanism for processing sequence-to-sequence (Sequence-to-Sequence) tasks, such as machine translation, text summarization, etc. It was proposed by Google and is considered to be one of the most advanced models in the field of natural language processing.

The most important component in the Transformer model is the self-attention mechanism, which can capture the relationship between different positions in the input sequence, thereby improving the performance of the model when processing long sequences. In addition, Transformer also uses techniques such as residual connection and layer normalization to make model training more stable and efficient.

The Transformer model mainly includes the following parts:

Input embedding layer: maps the words in the input sequence to a continuous vector space to facilitate subsequent processing.

Position encoding layer: In order to consider the relative position information between different positions in the sequence, each position in the input sequence needs to be encoded to obtain a position encoding vector.

Self-attention layer: Self-attention calculation is performed for each position in the input sequence, so as to capture the dependencies between different positions.

Feedforward network layer: A simple feedforward network process is performed on the self-attention output vector of each position, thereby enhancing the nonlinear ability of the model.

Output layer: Linearly transform the output vector of the last layer, and use the softmax function to get the probability distribution of each output word.

During the training process, the Transformer model uses an attention mechanism and a mask mechanism to avoid leakage of future information, and uses a cross-entropy loss function to evaluate the performance of the model. During inference, the Transformer model uses the Beam Search algorithm to generate the optimal output sequence.

In short, the Transformer model performs well in sequence-to-sequence tasks, has good scalability and applicability, and has become one of the important research directions in the field of natural language processing.

insert image description here
The architecture of Transformer is outlined in the figure. From a macro point of view, Transformer's encoder is composed of multiple identical layers, and each layer has two sublayers (sublayers are represented as sublayer sublayers u b l a yer ). The first sublayer is a multi-head self-attention pooling; the second sublayer is a positionwise feed-forward network. Specifically, when computing the encoder's self-attention, the query, key, and value all come from the output of the previous encoder layer. Each sublayer uses a residual connection.

The Transformer decoder is also superimposed by multiple identical layers, and residual connections and layer normalization are used in the layers. In addition to the two sublayers described in the encoder, the decoder inserts a third sublayer between these two, called the encoder-decoder attention layer. In encoder-decoder attention, queries come from the output of the previous decoder layer, while keys and values ​​come from the output of the entire encoder. In decoder self-attention, queries, keys, and values ​​are all derived from the output of the previous decoder layer. However, each position in the decoder can only consider all positions before that position. This masked attention preserves the auto-regressive properties, ensuring that predictions only depend on the generated output tokens.

Position-Based Feedforward Networks

First, we implement the following part:
insert image description here

The position-based feedforward network uses the same multi-layer perceptron (MLP) to transform the representation of all positions in the sequence, which is why the feedforward network is called positionwise. In the implementation below, an input X of shape (batch size, number of time steps or sequence length, number of hidden units or feature dimension) will be transformed by a two-layer perceptron into shape (batch size, number of time steps, ffn_num_outputs) The output tensor, that is, the last dimension is changed.

#@save
class PositionWiseFFN(nn.Module):
    """基于位置的前馈网络"""
    def __init__(self, ffn_num_input, ffn_num_hiddens, ffn_num_outputs,
                 **kwargs):
        super(PositionWiseFFN, self).__init__(**kwargs)
        self.dense1 = nn.Linear(ffn_num_input, ffn_num_hiddens)
        self.relu = nn.ReLU()
        self.dense2 = nn.Linear(ffn_num_hiddens, ffn_num_outputs)

    def forward(self, X):
        return self.dense2(self.relu(self.dense1(X)))

The example below shows that changing the size of the innermost dimension of a tensor changes the output size of a position-based feedforward network. Because the same multilayer perceptron is used to transform the input at all positions, when the input to all these positions is the same, their output is also the same.

ffn = PositionWiseFFN(4, 4, 8)
ffn.eval()
ffn(torch.ones((2, 3, 4)))[0]

insert image description here

Residual connections and layer normalization

Let us now focus on the addition and normalization (add & norm) components in the graph. As mentioned at the beginning of this section, this consists of residual connections followed by layer normalization. Both are key to building effective deep architectures.

Layer Normalization (Layer Normalization) is a normalization method commonly used in neural networks, and its purpose is to improve the training efficiency and performance of neural networks. Unlike batch normalization (Batch Normalization), layer normalization is to normalize a single sample, rather than normalize a batch of samples.

Specifically, for one with ddInput x of d features = ( x 1 , x 2 , . . . , xd ) x=(x_1,x_2,...,x_d)x=(x1,x2,...,xd) , layer normalization normalizes each feature so that the mean of each feature is 0 and the standard deviation is 1, namely:
x ^ i = xi − μ σ 2 + ϵ \hat{x}_i = \frac {x_i - \mu}{\sqrt{\sigma^2+\epsilon}}x^i=p2+ϵ xim
Among them, μ \muμ isxi x_ixiThe mean of σ 2 \sigma^2p2 isxi x_ixiThe variance of , ϵ \epsilonϵ is a small constant (usually1 0 − 5 10^{-5}105 ), used to avoid division by 0. Layer normalization then scales and translates each feature:
yi = γ x ^ i + β y_i = \gamma \hat{x}_i + \betayi=cx^i+β
whereγ \gammacb \betaβ is a learnable parameter used to scale and translate each feature. In this way, layer normalization allows each feature to have the same scale and translation, thus improving the generalization performance of the model.

Although batch normalization is widely used in computer vision, in natural language processing tasks (where the input is usually a sequence of variable length) batch normalization is usually not as good as layer normalization.

The following code compares the effects of layer normalization and batch normalization for different dimensions.

ln = nn.LayerNorm(2)
bn = nn.BatchNorm1d(2)
X = torch.tensor([[1, 2], [2, 3]], dtype=torch.float32)
# 在训练模式下计算X的均值和方差
print('layer norm:', ln(X), '\nbatch norm:', bn(X))

insert image description here
The AddNorm class can now be implemented using residual connections and layer normalization. Retirement is also used as a regularization method. That is, the following part is realized:
insert image description here

#@save
class AddNorm(nn.Module):
    """残差连接后进行层规范化"""
    def __init__(self, normalized_shape, dropout, **kwargs):
        super(AddNorm, self).__init__(**kwargs)
        self.dropout = nn.Dropout(dropout)
        self.ln = nn.LayerNorm(normalized_shape)

    def forward(self, X, Y):
        return self.ln(self.dropout(Y) + X)

multi-headed attention

Realize the following parts:
insert image description here
This part of the code has been explained, see for details: https://blog.csdn.net/qq_51957239/article/details/129732592?spm=1001.2014.3001.5502

def transpose_qkv(X, num_heads):
    """Transposition for parallel computation of multiple attention heads.

    Defined in :numref:`sec_multihead-attention`"""
    # Shape of input `X`:
    # (`batch_size`, no. of queries or key-value pairs, `num_hiddens`).
    # Shape of output `X`:
    # (`batch_size`, no. of queries or key-value pairs, `num_heads`,
    # `num_hiddens` / `num_heads`)
    X = X.reshape(X.shape[0], X.shape[1], num_heads, -1)

    # Shape of output `X`:
    # (`batch_size`, `num_heads`, no. of queries or key-value pairs,
    # `num_hiddens` / `num_heads`)
    X = X.permute(0, 2, 1, 3)

    # Shape of `output`:
    # (`batch_size` * `num_heads`, no. of queries or key-value pairs,
    # `num_hiddens` / `num_heads`)
    return X.reshape(-1, X.shape[2], X.shape[3])


def transpose_output(X, num_heads):
    """Reverse the operation of `transpose_qkv`.

    Defined in :numref:`sec_multihead-attention`"""
    X = X.reshape(-1, num_heads, X.shape[1], X.shape[2])
    X = X.permute(0, 2, 1, 3)
    return X.reshape(X.shape[0], X.shape[1], -1)
class MultiHeadAttention(nn.Module):
    """Multi-head attention.

    Defined in :numref:`sec_multihead-attention`"""
    def __init__(self, key_size, query_size, value_size, num_hiddens,
                 num_heads, dropout, bias=False, **kwargs):
        super(MultiHeadAttention, self).__init__(**kwargs)
        self.num_heads = num_heads
        self.attention = d2l.DotProductAttention(dropout)
        self.W_q = nn.Linear(query_size, num_hiddens, bias=bias)
        self.W_k = nn.Linear(key_size, num_hiddens, bias=bias)
        self.W_v = nn.Linear(value_size, num_hiddens, bias=bias)
        self.W_o = nn.Linear(num_hiddens, num_hiddens, bias=bias)

    def forward(self, queries, keys, values, valid_lens):
        # Shape of `queries`, `keys`, or `values`:
        # (`batch_size`, no. of queries or key-value pairs, `num_hiddens`)
        # Shape of `valid_lens`:
        # (`batch_size`,) or (`batch_size`, no. of queries)
        # After transposing, shape of output `queries`, `keys`, or `values`:
        # (`batch_size` * `num_heads`, no. of queries or key-value pairs,
        # `num_hiddens` / `num_heads`)
        queries = transpose_qkv(self.W_q(queries), self.num_heads)
        keys = transpose_qkv(self.W_k(keys), self.num_heads)
        values = transpose_qkv(self.W_v(values), self.num_heads)

        if valid_lens is not None:
            # On axis 0, copy the first item (scalar or vector) for
            # `num_heads` times, then copy the next item, and so on
            valid_lens = torch.repeat_interleave(
                valid_lens, repeats=self.num_heads, dim=0)

        # Shape of `output`: (`batch_size` * `num_heads`, no. of queries,
        # `num_hiddens` / `num_heads`)
        output = self.attention(queries, keys, values, valid_lens)

        # Shape of `output_concat`:
        # (`batch_size`, no. of queries, `num_hiddens`)
        output_concat = transpose_output(output, self.num_heads)
        return self.W_o(output_concat)

Encoder

encoder block

Implement the following parts:
insert image description here

#@save
class EncoderBlock(nn.Module):
    """Transformer编码器块"""
    def __init__(self, key_size, query_size, value_size, num_hiddens,
                 norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
                 dropout, use_bias=False, **kwargs):
        super(EncoderBlock, self).__init__(**kwargs)
        self.attention = d2l.MultiHeadAttention(
            key_size, query_size, value_size, num_hiddens, num_heads, dropout,
            use_bias)
        self.addnorm1 = AddNorm(norm_shape, dropout)
        self.ffn = PositionWiseFFN(
            ffn_num_input, ffn_num_hiddens, num_hiddens)
        self.addnorm2 = AddNorm(norm_shape, dropout)

    def forward(self, X, valid_lens):
        Y = self.addnorm1(X, self.attention(X, X, X, valid_lens))#自注意力
        return self.addnorm2(Y, self.ffn(Y))

This code defines a class EncoderBlock for a Transformer encoder block. In the __init__ function, it defines the following sublayers:

self.attention: A multi-head attention layer that uses input as query, key, and value. It passes the input as three arguments to d2l.MultiHeadAttention and uses valid_lens (a 1D tensor where each element represents the valid length of the corresponding sequence) to mask invalid padding.
self.addnorm1: A layer normalization layer that sums the input with the output of multi-head attention.
self.ffn: A Position-wise Feed-Forward Network (Position-wise Feed-Forward Network), used to process the output of the previous layer.
self.addnorm2: Another layer normalization layer that sums the output of the positional element-wise feed-forward neural network with the output of the previous layer.

In the forward function, the input X and the effective length valid_lens are passed to the multi-head attention layer for processing, and the output results are summed and processed through the layer normalization layer and the positional element-wise feedforward neural network layer. Ultimately, the function returns the positional element-wise feed-forward output of the neural network layer.

Encoder

Implement the following parts:
insert image description here

#@save
class TransformerEncoder(d2l.Encoder):
    """Transformer编码器"""
    def __init__(self, vocab_size, key_size, query_size, value_size,
                 num_hiddens, norm_shape, ffn_num_input, ffn_num_hiddens,
                 num_heads, num_layers, dropout, use_bias=False, **kwargs):
        super(TransformerEncoder, self).__init__(**kwargs)
        self.num_hiddens = num_hiddens
        self.embedding = nn.Embedding(vocab_size, num_hiddens)
        self.pos_encoding = PositionalEncoding(num_hiddens, dropout)
        self.blks = nn.Sequential()
        for i in range(num_layers):
            self.blks.add_module("block"+str(i),
                EncoderBlock(key_size, query_size, value_size, num_hiddens,
                             norm_shape, ffn_num_input, ffn_num_hiddens,
                             num_heads, dropout, use_bias))

    def forward(self, X, valid_lens, *args):
        # 因为位置编码值在-1和1之间,
        # 因此嵌入值乘以嵌入维度的平方根进行缩放,
        # 然后再与位置编码相加。
        X = self.pos_encoding(self.embedding(X) * math.sqrt(self.num_hiddens))
        self.attention_weights = [None] * len(self.blks)
        for i, blk in enumerate(self.blks):
            X = blk(X, valid_lens)
            self.attention_weights[
                i] = blk.attention.attention.attention_weights
        return X

Specifically, the TransformerEncoder class inherits the d2l.Encoder class, which defines a word embedding layer and a sequential layer composed of multiple EncoderBlocks. Each EncoderBlock consists of two parts: a multi-head attention layer and a position feedforward network, which are used to process the input.

In the forward function, the input is first passed through the embedding layer for word vector encoding, then multiplied by the square root of the embedding dimension for scaling, and finally the position encoding is added. The positional encoding is implemented by the PositionalEncoding class, which is used to encode each position in the sequence to help the model learn the relative positional relationship of elements in the sequence.

Then, the input is processed through multiple EncoderBlocks, and the output of each EncoderBlock is used as the input of the next EncoderBlock. In each EncoderBlock, the input first goes through the multi-head attention layer for self-attention calculation to obtain the attention weight matrix, and then it is processed by residual connection and layer normalization before being input to the position feedforward network for processing. Ultimately, the function returns the output of the last EncoderBlock as the output of the entire encoder.

Note that in each EncoderBlock, the attention weights of the multi-head attention layer will be recorded in the self.attention_weights list, which can be used for visualization and debugging.

The shape of Transformer encoder output is (batch size, number of time steps, num_hiddens)

decoder

decoder block

Implement the following parts:
insert image description here

class DecoderBlock(nn.Module):
    """解码器中第i个块"""
    def __init__(self, key_size, query_size, value_size, num_hiddens,
                 norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
                 dropout, i, **kwargs):
        super(DecoderBlock, self).__init__(**kwargs)
        self.i = i
        self.attention1 = d2l.MultiHeadAttention(
            key_size, query_size, value_size, num_hiddens, num_heads, dropout)
        self.addnorm1 = AddNorm(norm_shape, dropout)
        self.attention2 = d2l.MultiHeadAttention(
            key_size, query_size, value_size, num_hiddens, num_heads, dropout)
        self.addnorm2 = AddNorm(norm_shape, dropout)
        self.ffn = PositionWiseFFN(ffn_num_input, ffn_num_hiddens,
                                   num_hiddens)
        self.addnorm3 = AddNorm(norm_shape, dropout)

    def forward(self, X, state):
        enc_outputs, enc_valid_lens = state[0], state[1]
        # 训练阶段,输出序列的所有词元都在同一时间处理,
        # 因此state[2][self.i]初始化为None。
        # 预测阶段,输出序列是通过词元一个接着一个解码的,
        # 因此state[2][self.i]包含着直到当前时间步第i个块解码的输出表示
        if state[2][self.i] is None:
            key_values = X
        else:
            key_values = torch.cat((state[2][self.i], X), axis=1)
        state[2][self.i] = key_values
        if self.training:
            batch_size, num_steps, _ = X.shape
            # dec_valid_lens的开头:(batch_size,num_steps),
            # 其中每一行是[1,2,...,num_steps]
            dec_valid_lens = torch.arange(
                1, num_steps + 1, device=X.device).repeat(batch_size, 1)
        else:
            dec_valid_lens = None

        # 自注意力
        X2 = self.attention1(X, key_values, key_values, dec_valid_lens)
        Y = self.addnorm1(X, X2)
        # 编码器-解码器注意力。
        # enc_outputs的开头:(batch_size,num_steps,num_hiddens)
        Y2 = self.attention2(Y, enc_outputs, enc_outputs, enc_valid_lens)
        Z = self.addnorm2(Y, Y2)
        return self.addnorm3(Z, self.ffn(Z)), state

This is a block in a decoder. Each block in the decoder receives as input the decoder input (the output from the encoder or the output of the previous block) and the current state (the output from the block before the decoder or the output from the encoder) and outputs the current block's Output and updated state.

The block is structured as follows:

First, the input of the current block is concatenated with the historical output stored in the state to form the keys and values ​​that the encoder-decoder cares about. During training, the tokens of all output sequences are processed at the same time, so the historical output in the state is initialized to None; during inference, the output sequence is decoded token by token, so the historical output in the state contains the All historical output.

Next, the input is processed using multi-head self-attention.

Add the output of self-attention to the input and perform Layer Normalization.

The result is processed using encoder-decoder attention. This is computed as a "key-value pair" consisting of the encoder output and the current block's self-attention output.

The output of the encoder-decoder attention is added to the output of the self-attention, and Layer Normalization is performed.
Finally, the result is processed using a feed-forward neural network, and summation and normalization are performed again. The output and updated state are returned.

decoder

Implement the following parts:
insert image description here

class TransformerDecoder(d2l.AttentionDecoder):
    def __init__(self, vocab_size, key_size, query_size, value_size,
                 num_hiddens, norm_shape, ffn_num_input, ffn_num_hiddens,
                 num_heads, num_layers, dropout, **kwargs):
        super(TransformerDecoder, self).__init__(**kwargs)
        self.num_hiddens = num_hiddens
        self.num_layers = num_layers
        self.embedding = nn.Embedding(vocab_size, num_hiddens)
        self.pos_encoding = d2l.PositionalEncoding(num_hiddens, dropout)
        self.blks = nn.Sequential()
        for i in range(num_layers):
            self.blks.add_module("block"+str(i),
                DecoderBlock(key_size, query_size, value_size, num_hiddens,
                             norm_shape, ffn_num_input, ffn_num_hiddens,
                             num_heads, dropout, i))
        self.dense = nn.Linear(num_hiddens, vocab_size)

    def init_state(self, enc_outputs, enc_valid_lens, *args):
        return [enc_outputs, enc_valid_lens, [None] * self.num_layers]

    def forward(self, X, state):
        X = self.pos_encoding(self.embedding(X) * math.sqrt(self.num_hiddens))
        self._attention_weights = [[None] * len(self.blks) for _ in range (2)]
        for i, blk in enumerate(self.blks):
            X, state = blk(X, state)
            # 解码器自注意力权重
            self._attention_weights[0][
                i] = blk.attention1.attention.attention_weights
            # “编码器-解码器”自注意力权重
            self._attention_weights[1][
                i] = blk.attention2.attention.attention_weights
        return self.dense(X), state

    @property
    def attention_weights(self):
        return self._attention_weights

The following is a detailed explanation of each part:

Constructor init ()
vocab_size: Vocabulary size, that is, the number of distinct words in the vocabulary.
key_size, query_size, value_size: Dimension sizes of key, query and value in Transformer model.
num_hiddens: Dimension size of hidden units.
norm_shape: The shape of the normalization layer.
ffn_num_input, ffn_num_hiddens: The input and hidden layer dimensions of the feedforward layer.
num_heads: The number of heads in the multi-head attention mechanism.
num_layers: The number of Transformer layers in the decoder.
dropout: The probability of the dropout layer.
**kwargs: Additional arguments.

init_state() function: used to initialize the state of the decoder, returns a list, the first element is the encoder output, the second element is the effective length of the encoder, and the third element is a list containing each Layer status information.

forward() function: This function takes input X and state and outputs decoder output and decoder state. Among them, X is the decoder input, and state contains the state information of the decoder. The function first passes the input X through the embedding layer and the position encoding layer to obtain the position embedding representation X. Then, for each layer in the decoder, X and state are input to the Transformer decoder block for processing, resulting in the output X and new state information of the decoder. During processing, the decoder self-attention weights and the "encoder-decoder" self-attention weights are recorded, stored in the first and second sublists of the _attention_weights list, respectively. Finally, input X into the fully connected layer to get the output of the decoder.

attention_weights() function: This function returns a list of _attention_weights. This list holds the self-attention weights of each decoder layer and the "encoder-decoder" self-attention weights.

Overall, this code defines a Transformer decoder class and provides functions for initializing the state, forward propagation, and getting attention weights.

Practice: Machine Translation

data set

#@save
d2l.DATA_HUB['fra-eng'] = (d2l.DATA_URL + 'fra-eng.zip',
                           '94646ad1522d915e7b0f9296181140edcf86a4f5')

#@save
def read_data_nmt():
    """载入“英语-法语”数据集"""
    data_dir = d2l.download_extract('fra-eng')
    with open(os.path.join(data_dir, 'fra.txt'), 'r',
             encoding='utf-8') as f:
        return f.read()

raw_text = read_data_nmt()
print(raw_text[:75])

data preprocessing

#@save
def preprocess_nmt(text):
    """预处理“英语-法语”数据集"""
    def no_space(char, prev_char):
        return char in set(',.!?') and prev_char != ' '

    # 使用空格替换不间断空格
    # 使用小写字母替换大写字母
    text = text.replace('\u202f', ' ').replace('\xa0', ' ').lower()
    # 在单词和标点符号之间插入空格
    out = [' ' + char if i > 0 and no_space(char, text[i - 1]) else char
           for i, char in enumerate(text)]
    return ''.join(out)

text = preprocess_nmt(raw_text)
print(text[:80])
#@save
def tokenize_nmt(text, num_examples=None):
    """词元化“英语-法语”数据数据集"""
    source, target = [], []
    for i, line in enumerate(text.split('\n')):
        if num_examples and i > num_examples:
            break
        parts = line.split('\t')
        if len(parts) == 2:
            source.append(parts[0].split(' '))
            target.append(parts[1].split(' '))
    return source, target

source, target = tokenize_nmt(text)
source[:]
#@save
def show_list_len_pair_hist(legend, xlabel, ylabel, xlist, ylist):
    """绘制列表长度对的直方图"""
    d2l.set_figsize()
    _, _, patches = d2l.plt.hist(
        [[len(l) for l in xlist], [len(l) for l in ylist]])
    d2l.plt.xlabel(xlabel)
    d2l.plt.ylabel(ylabel)
    for patch in patches[1].patches:
        patch.set_hatch('/')
    d2l.plt.legend(legend)

show_list_len_pair_hist(['source', 'target'], '# tokens per sequence',
                        'count', source, target);
import collections
class Vocab:
    """Vocabulary for text."""
    def __init__(self, tokens=None, min_freq=0, reserved_tokens=None):
        """Defined in :numref:`sec_text_preprocessing`"""
        if tokens is None:
            tokens = []
        if reserved_tokens is None:
            reserved_tokens = []
        # Sort according to frequencies
        counter = count_corpus(tokens)
        self._token_freqs = sorted(counter.items(), key=lambda x: x[1],
                                   reverse=True)
        # The index for the unknown token is 0
        self.idx_to_token = ['<unk>'] + reserved_tokens
        self.token_to_idx = {
    
    token: idx
                             for idx, token in enumerate(self.idx_to_token)}
        for token, freq in self._token_freqs:
            if freq < min_freq:
                break
            if token not in self.token_to_idx:
                self.idx_to_token.append(token)
                self.token_to_idx[token] = len(self.idx_to_token) - 1

    def __len__(self):
        return len(self.idx_to_token)

    def __getitem__(self, tokens):
        if not isinstance(tokens, (list, tuple)):
            return self.token_to_idx.get(tokens, self.unk)
        return [self.__getitem__(token) for token in tokens]

    def to_tokens(self, indices):
        if not isinstance(indices, (list, tuple)):
            return self.idx_to_token[indices]
        return [self.idx_to_token[index] for index in indices]

    @property
    def unk(self):  # Index for the unknown token
        return 0

    @property
    def token_freqs(self):  # Index for the unknown token
        return self._token_freqs

def count_corpus(tokens):
    """Count token frequencies.

    Defined in :numref:`sec_text_preprocessing`"""
    # Here `tokens` is a 1D list or 2D list
    if len(tokens) == 0 or isinstance(tokens[0], list):
        # Flatten a list of token lists into a list of tokens
        tokens = [token for line in tokens for token in line]
    return collections.Counter(tokens)
#@save
def truncate_pad(line, num_steps, padding_token):
    """截断或填充文本序列"""
    if len(line) > num_steps:
        return line[:num_steps]  # 截断
    return line + [padding_token] * (num_steps - len(line))  # 填充

truncate_pad(src_vocab[source[0]], 10, src_vocab['<pad>'])
#@save
def build_array_nmt(lines, vocab, num_steps):
    """将机器翻译的文本序列转换成小批量"""
    lines = [vocab[l] for l in lines]#token to id
    lines = [l + [vocab['<eos>']] for l in lines]# 加上eos代表结束
    array = torch.tensor([truncate_pad(
        l, num_steps, vocab['<pad>']) for l in lines])# 转换为数组
    valid_len = (array != vocab['<pad>']).type(torch.int32).sum(1)#有效长度
    return array, valid_len
#@save
from torch.utils import data
def load_array(data_arrays, batch_size, is_train=True):
    """Construct a PyTorch data iterator.

    Defined in :numref:`sec_linear_concise`"""
    dataset = data.TensorDataset(*data_arrays)
    return data.DataLoader(dataset, batch_size, shuffle=is_train)

def load_data_nmt(batch_size, num_steps, num_examples=600):
    """返回翻译数据集的迭代器和词表"""
    text = preprocess_nmt(read_data_nmt())
    source, target = tokenize_nmt(text, num_examples)
    src_vocab = Vocab(source, min_freq=2,
                          reserved_tokens=['<pad>', '<bos>', '<eos>'])
    tgt_vocab = Vocab(target, min_freq=2,
                          reserved_tokens=['<pad>', '<bos>', '<eos>'])
    src_array, src_valid_len = build_array_nmt(source, src_vocab, num_steps)
    tgt_array, tgt_valid_len = build_array_nmt(target, tgt_vocab, num_steps)
    data_arrays = (src_array, src_valid_len, tgt_array, tgt_valid_len)
    data_iter = load_array(data_arrays, batch_size)
    return data_iter, src_vocab, tgt_vocab

model definition

class EncoderDecoder(nn.Module):
    """The base class for the encoder-decoder architecture.

    Defined in :numref:`sec_encoder-decoder`"""
    def __init__(self, encoder, decoder, **kwargs):
        super(EncoderDecoder, self).__init__(**kwargs)
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, enc_X, dec_X, *args):
        enc_outputs = self.encoder(enc_X, *args)
        dec_state = self.decoder.init_state(enc_outputs, *args)
        return self.decoder(dec_X, dec_state)
encoder = TransformerEncoder(
    len(src_vocab), key_size, query_size, value_size, num_hiddens,
    norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
    num_layers, dropout)
decoder = TransformerDecoder(
    len(tgt_vocab), key_size, query_size, value_size, num_hiddens,
    norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,
    num_layers, dropout)
net = EncoderDecoder(encoder, decoder)

hyperparameter setting

num_hiddens, num_layers, dropout, batch_size, num_steps = 32, 2, 0.1, 64, 10
lr, num_epochs, device = 0.005, 200, d2l.try_gpu()
ffn_num_input, ffn_num_hiddens, num_heads = 32, 64, 4
key_size, query_size, value_size = 32, 32, 32
norm_shape = [32]

train_iter, src_vocab, tgt_vocab = load_data_nmt(batch_size, num_steps)

train

def train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device):
    """Train a model for sequence to sequence.

    Defined in :numref:`sec_seq2seq_decoder`"""
    def xavier_init_weights(m):
        if type(m) == nn.Linear:
            nn.init.xavier_uniform_(m.weight)
        if type(m) == nn.GRU:
            for param in m._flat_weights_names:
                if "weight" in param:
                    nn.init.xavier_uniform_(m._parameters[param])
    net.apply(xavier_init_weights)
    net.to(device)
    optimizer = torch.optim.Adam(net.parameters(), lr=lr)
    loss = MaskedSoftmaxCELoss()
    net.train()
    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
                            xlim=[10, num_epochs])
    for epoch in range(num_epochs):
        timer = d2l.Timer()
        metric = d2l.Accumulator(2)  # Sum of training loss, no. of tokens
        for batch in data_iter:
            optimizer.zero_grad()
            X, X_valid_len, Y, Y_valid_len = [x.to(device) for x in batch]
            bos = torch.tensor([tgt_vocab['<bos>']] * Y.shape[0],
                               device=device).reshape(-1, 1)
            dec_input = d2l.concat([bos, Y[:, :-1]], 1)  # Teacher forcing
            Y_hat, _ = net(X, dec_input, X_valid_len)
            l = loss(Y_hat, Y, Y_valid_len)
            l.sum().backward()  # Make the loss scalar for `backward`
            d2l.grad_clipping(net, 1)
            num_tokens = Y_valid_len.sum()
            optimizer.step()
            with torch.no_grad():
                metric.add(l.sum(), num_tokens)
        if (epoch + 1) % 10 == 0:
            animator.add(epoch + 1, (metric[0] / metric[1],))
    print(f'loss {
      
      metric[0] / metric[1]:.3f}, {
      
      metric[1] / timer.stop():.1f} '
          f'tokens/sec on {
      
      str(device)}')
train_seq2seq(net, train_iter, lr, num_epochs, tgt_vocab, device)

insert image description here

predict

def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps,
                    device, save_attention_weights=False):
    """Predict for sequence to sequence.

    Defined in :numref:`sec_seq2seq_training`"""
    # Set `net` to eval mode for inference
    net.eval()
    src_tokens = src_vocab[src_sentence.lower().split(' ')] + [
        src_vocab['<eos>']]
    enc_valid_len = torch.tensor([len(src_tokens)], device=device)
    src_tokens = d2l.truncate_pad(src_tokens, num_steps, src_vocab['<pad>'])
    # Add the batch axis
    enc_X = torch.unsqueeze(
        torch.tensor(src_tokens, dtype=torch.long, device=device), dim=0)
    enc_outputs = net.encoder(enc_X, enc_valid_len)
    dec_state = net.decoder.init_state(enc_outputs, enc_valid_len)
    # Add the batch axis
    dec_X = torch.unsqueeze(torch.tensor(
        [tgt_vocab['<bos>']], dtype=torch.long, device=device), dim=0)
    output_seq, attention_weight_seq = [], []
    for _ in range(num_steps):
        Y, dec_state = net.decoder(dec_X, dec_state)
        # We use the token with the highest prediction likelihood as the input
        # of the decoder at the next time step
        dec_X = Y.argmax(dim=2)
        pred = dec_X.squeeze(dim=0).type(torch.int32).item()
        # Save attention weights (to be covered later)
        if save_attention_weights:
            attention_weight_seq.append(net.decoder.attention_weights)
        # Once the end-of-sequence token is predicted, the generation of the
        # output sequence is complete
        if pred == tgt_vocab['<eos>']:
            break
        output_seq.append(pred)
    return ' '.join(tgt_vocab.to_tokens(output_seq)), attention_weight_seq
engs = ['go .', "i lost .", 'he\'s calm .', 'i\'m home .']
fras = ['va !', 'j\'ai perdu .', 'il est calme .', 'je suis chez moi .']
for eng, fra in zip(engs, fras):
    translation, dec_attention_weight_seq = d2l.predict_seq2seq(
        net, eng, src_vocab, tgt_vocab, num_steps, device, True)
    print(f'{
      
      eng} => {
      
      translation}, ',
          f'bleu {
      
      d2l.bleu(translation, fra, k=2):.3f}')

insert image description here
As a result, it can be seen that the effect of machine translation is very good.

attention visualization

d2l.show_heatmaps(
    enc_attention_weights.cpu(), xlabel='Key positions',
    ylabel='Query positions', titles=['Head %d' % i for i in range(1, 5)],
    figsize=(7, 3.5))

insert image description here
It can be seen that the model pays more attention to the first few words of a sequence

Although the Transformer architecture is proposed for sequence-to-sequence learning, Transformer encoders or Transformer decoders are usually used alone in different deep learning tasks.

Guess you like

Origin blog.csdn.net/qq_51957239/article/details/129732840