[Original] Realize the Encoder-Decoder of the Transformer model in ChatGPT

Author: Night Passerby

Time: July 2023

Transformer Block (generic block) implementation

Looking at the entire link diagram above, we can clearly see that there are several major links in the Encoder link. The main core functions of each layer are as follows:

  • Multi-headed self Attention (attention mechanism layer): Through different attention functions and concatenating results, the expressive ability of the model is improved, and the correlation between words and long-distance times is mainly calculated.
  • Normalization layer (normalization layer): Normalize each hidden layer neuron so that the mean value of its feature value is 0 and the variance is 1 to solve the problem of gradient explosion and disappearance. Through normalization, the data can be compressed in an appropriate range, avoiding the appearance of super large or small values, which is conducive to model training, and can also improve the generalization ability of the model, accelerate model convergence and reduce the dependence on parameter quantities.
  • Feed forward network: transforms the attention output.
  • Another normalization layer (another normalization layer): Weight Normalization is used to normalize the weight matrix between layers in the model, and its main purpose is to solve the problem of gradient disappearance

Attention implementation

Referring to other information we know, we can see the core self-attention layer and so on. Let's peel off each layer to see how the core layer should be implemented.

Just look at the calculation process of attention:

The approximate operation process represented by this figure is:

For each token, first generate three vector query, key, value:

The query vector is analogous to a query. A certain token asked: "To what extent are the rest of the tokens related to me?"

A key vector is analogous to an index. A certain token said: "I have compressed the answers to each query and put them in my key"

The value vector is analogous to the answer. A certain token said: "I extracted another layer of information covered by myself and put it in my value"

Attention calculation code:

def attention(query: Tensor,
              key: Tensor,
              value: Tensor,
              mask: Optional[Tensor] = None,
              dropout: float = 0.1):
    """
    定义如何计算注意力得分
    参数:
        query: shape (batch_size, num_heads, seq_len, k_dim)
        key: shape(batch_size, num_heads, seq_len, k_dim)
        value: shape(batch_size, num_heads, seq_len, v_dim)
        mask: shape (batch_size, num_heads, seq_len, seq_len). Since our assumption, here the shape is
              (1, 1, seq_len, seq_len)
    返回:
        out: shape (batch_size, v_dim). 注意力头的输出。注意力分数:形状(seq_len,seq_ln)。
    """
    k_dim = query.size(-1)

    # shape (seq_len ,seq_len),row: token,col: token记号的注意力得分
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(k_dim)

    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e10)

    attention_score = F.softmax(scores, dim=-1)

    if dropout is not None:
        attention_score = dropout(attention_score)

    out = torch.matmul(attention_score, value)

    return out, attention_score  # shape: (seq_len, v_dim), (seq_len, seq_lem)

Take the token a2 in the figure as an example:

It generates a query, and each query is calculated in a "some way" with the keys of other tokens. The result we get is called the attention score (that is, $$\alpha $$ in the figure). A total of four attention scores are obtained. (Attention score can also be called attention weight).

By multiplying these four scores by the value of each token, we will get four vectors with information extracted.

Adding these four vectors is the result b2 of the final a2 passing through the attention model.

Throughout this layer, we implement a simple implementation of this logic through code:

class MultiHeadedAttention(nn.Module):
    def __init__(self,
                 num_heads: int,
                 d_model: int,
                 dropout: float = 0.1):
        super(MultiHeadedAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        # 假设v_dim总是等于k_dim
        self.k_dim = d_model // num_heads
        self.num_heads = num_heads
        self.proj_weights = clones(
            nn.Linear(d_model, d_model), 4)  # W^Q, W^K, W^V, W^O
        self.attention_score = None
        self.dropout = nn.Dropout(p=dropout)

    def forward(self,
                query: Tensor,
                key: Tensor,
                value: Tensor,
                mask: Optional[Tensor] = None):
        """
        参数:
            query: shape (batch_size, seq_len, d_model)
            key: shape (batch_size, seq_len, d_model)
            value: shape (batch_size, seq_len, d_model)
            mask: shape (batch_size, seq_len, seq_len). 由于我们假设所有数据都使用相同的掩码,因此这里的形状也等于(1,seq_len,seq-len)

        返回:
            out: shape (batch_size, seq_len, d_model). 多头注意力层的输出
        """
        if mask is not None:
            mask = mask.unsqueeze(1)
        batch_size = query.size(0)

        # 1) 应用W^Q、W^K、W^V生成新的查询、键、值
        query, key, value \
            = [proj_weight(x).view(batch_size, -1, self.num_heads, self.k_dim).transpose(1, 2)
                for proj_weight, x in zip(self.proj_weights, [query, key, value])]  # -1 equals to seq_len

        # 2) 计算注意力得分和out
        out, self.attention_score = attention(query, key, value, mask=mask,
                                              dropout=self.dropout)

        # 3) "Concat" 输出
        out = out.transpose(1, 2).contiguous() \
            .view(batch_size, -1, self.num_heads * self.k_dim)

        # 4) 应用W^O以获得最终输出
        out = self.proj_weights[-1](out)

        return out

Norm normalization layer implementation

# 归一化层,标准化的计算公式
class NormLayer(nn.Module):
    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

Feedforward Neural Network Implementation

class FeedForward(nn.Module):

    def __init__(self, d_model, d_ff=2048, dropout=0.1):
        super().__init__()

        # 设置 d_ff 缺省值为2048
        self.linear_1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        x = self.dropout(F.relu(self.linear_1(x)))
        x = self.linear_2(x)

Encoder (encoder once) realized

Encoder is to assemble and iterate the entire link part introduced above to complete the conversion from source code to intermediate code.


class EncoderLayer(nn.Module):

    def __init__(self, d_model, heads, dropout=0.1):
        super().__init__()
        self.norm_1 = Norm(d_model)
        self.norm_2 = Norm(d_model)
        self.attn = MultiHeadAttention(heads, d_model, dropout=dropout)
        self.ff = FeedForward(d_model, dropout=dropout)
        self.dropout_1 = nn.Dropout(dropout)
        self.dropout_2 = nn.Dropout(dropout)

    def forward(self, x, mask):
        x2 = self.norm_1(x)
        x = x + self.dropout_1(self.attn(x2, x2, x2, mask))
        x2 = self.norm_2(x)
        x = x + self.dropout_2(self.ff(x2))
        return x

class Encoder(nn.Module):

    def __init__(self, vocab_size, d_model, N, heads, dropout):
        super().__init__()
        self.N = N
        self.embed = Embedder(d_model, vocab_size)
        self.pe = PositionalEncoder(d_model, dropout=dropout)
        self.layers = get_clones(EncoderLayer(d_model, heads, dropout), N)
        self.norm = Norm(d_model)

    def forward(self, src, mask):
        x = self.embed(src)
        x = self.pe(x)
        for i in range(self.N):
            x = self.layers[i](x, mask)
        return self.norm(x)

Decoder (decoder layer) implementation

The Decoder part is very similar to the Encoder part. It mainly converts the intermediate code generated by the Encoder into the target code.

class DecoderLayer(nn.Module):

    def __init__(self, d_model, heads, dropout=0.1):
        super().__init__()
        self.norm_1 = Norm(d_model)
        self.norm_2 = Norm(d_model)
        self.norm_3 = Norm(d_model)

        self.dropout_1 = nn.Dropout(dropout)
        self.dropout_2 = nn.Dropout(dropout)
        self.dropout_3 = nn.Dropout(dropout)

        self.attn_1 = MultiHeadAttention(heads, d_model, dropout=dropout)
        self.attn_2 = MultiHeadAttention(heads, d_model, dropout=dropout)
        self.ff = FeedForward(d_model, dropout=dropout)

    def forward(self, x, e_outputs, src_mask, trg_mask):
        x2 = self.norm_1(x)
        x = x + self.dropout_1(self.attn_1(x2, x2, x2, trg_mask))
        x2 = self.norm_2(x)
        x = x + self.dropout_2(self.attn_2(x2, e_outputs, e_outputs,
                                           src_mask))
        x2 = self.norm_3(x)
        x = x + self.dropout_3(self.ff(x2))
        return x

class Decoder(nn.Module):

    def __init__(self, vocab_size, d_model, N, heads, dropout):
        super().__init__()
        self.N = N
        self.embed = Embedder(vocab_size, d_model)
        self.pe = PositionalEncoder(d_model, dropout=dropout)
        self.layers = get_clones(DecoderLayer(d_model, heads, dropout), N)
        self.norm = Norm(d_model)

    def forward(self, trg, e_outputs, src_mask, trg_mask):
        x = self.embed(trg)
        x = self.pe(x)
        for i in range(self.N):
            x = self.layers[i](x, e_outputs, src_mask, trg_mask)
        return self.norm(x)

Transformer implementation

Combining the entire link, including Encoder and Decoder, can eventually form a basic MVP implementation of the Transformer framework.

class Transformer(nn.Module):

    def __init__(self, src_vocab, trg_vocab, d_model, N, heads, dropout):
        super().__init__()
        self.encoder = Encoder(src_vocab, d_model, N, heads, dropout)
        self.decoder = Decoder(trg_vocab, d_model, N, heads, dropout)
        self.out = nn.Linear(d_model, trg_vocab)

    def forward(self, src, trg, src_mask, trg_mask):
        e_outputs = self.encoder(src, src_mask)
        d_output = self.decoder(trg, e_outputs, src_mask, trg_mask)
        output = self.out(d_output)
        return output

code description

If you want to learn to read the entire code, visit the black-transformer project, github access address:

GitHub - heiyeluren/black-transformer: black-transformer is a lightweight simulation of the outline code of the Transformer model implementation, used to understand the entire Transformer working mechanism

What replaces you is not AI, but someone who knows AI better than you and can use AI better!

##End##

おすすめ

転載: blog.csdn.net/heiyeshuwu/article/details/131771173