The Annotated Transformer(解读Transformer)

原链接
The Transformer from “Attention is All You Need” has been on a lot of people’s minds over the last year. Besides producing major improvements in translation quality, it provides a new architecture for many other NLP tasks. The paper itself is very clearly written, but the conventional wisdom has been that it is quite difficult to implement correctly.
在过去的一年中,“Attention is All You Need”的Transformer在很多人看来。 除了在翻译质量方面取得重大进步外,它还为许多其他NLP任务提供了新的架构。 论文本身写得非常清楚,但传统的方法认为很难正确实现它。

In this post I present an “annotated” version of the paper in the form of a line-by-line implementation. I have reordered and deleted some sections from the original paper and added comments throughout. This document itself is a working notebook, and should be a completely usable implementation. In total there are 400 lines of library code which can process 27,000 tokens per second on 4 GPUs.
在这篇文章中,我以逐行实现的形式呈现了论文的“注释”版本。 我重新排序并删除了原始论文中的一些部分,并在整个过程中添加了评论。 这个文档本身就是一个有效的笔记本,应该是一个完全可用的实现。 总共有400行库代码,可以在4个GPU上每秒处理27,000个tokens。

To follow along you will first need to install PyTorch. The complete notebook is also available on github or on Google Colab with free GPUs.
首先需要安装PyTorch。 完整的笔记本也可以在github或Google Colab上使用免费的GPU。

Prelims(准备工作)

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import math, copy, time
from torch.autograd import Variable
import matplotlib.pyplot as plt
import seaborn
seaborn.set_context(context="talk")
%matplotlib inline

Background(背景)

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU, ByteNet and ConvS2S, all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention.
ByteNet和ConvS2S的出现是为了减少序列计算量,这些都使用卷积神经网络作为基本构建块,并行计算所有输入和输出位置的隐藏表示。 在这些模型中,关联来自两个任意输入或输出位置的信号所需的操作数量在位置之间的距离上增长,对于ConvS2S呈线性增长,对于ByteNet呈线对数增长。 这使得学习远程位置之间的依赖性变得更加困难。 在Transformer中,这被减少到恒定的操作次数,尽管由于平均注意力加权位置而导致有效分辨率降低,这是我们与多头注意力相抵消的效果。

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations. End-to-end memory networks are based on a recurrent attention mechanism instead of sequence aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks.
自我关注(Self-attention),有时称为内部关注是关联机制(intra-attention),其关联单个序列的不同位置以计算序列的表示。自我关注已经成功地用于各种任务,包括阅读理解(reading comprehension),抽象概括(abstractive summarization),文本蕴涵(textual entailment)和学习任务独立(learning task-independent)的句子表示。 端到端记忆网络基于循环注意机制(recurrent attention mechanism)而不是序列对齐重复(sequence aligned recurrence),并且已经证明在简单语言问答(simple-language question answering)和语言建模任务(language modeling tasks)上表现良好。

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution.
然而,据我们所知(To the best of our knowledge),Transformer是第一个完全依靠自我关注的转换模型(transduction model)来计算其输入和输出的表示,而不使用序列对齐的(sequence aligned)RNN或卷积。

Model Architecture(模型架构)

Most competitive neural sequence transduction models have an encoder-decoder structure (cite). Here, the encoder maps an input sequence of symbol representations ( x 1 , , x n ) (x_1,…,x_n) to a sequence of continuous representations z = ( z 1 , , z n ) z=(z_1,…,z_n) . Given z, the decoder then generates an output sequence ( y 1 , , y m ) (y_1,…,y_m) of symbols one element at a time. At each step the model is auto-regressive (cite), consuming the previously generated symbols as additional input when generating the next.
大多数常用神经序列转换模型(sequence transduction models)具有编码器 - 解码器结构(encoder-decoder structure)。 这里,编码器将符号表示(symbol representations)的输入序列 x 1 . . . x n (x_1,...,x_n) 映射到连续表示序列 z = z 1 . . . z n z =(z_1,...,z_n) 。 给定z,解码器然后一次一个元素地生成符号的输出序列 y 1 . . . y m (y_1,...,y_m) 。 在每个步骤中,模型是自动回归(auto-regressive)(引用),在生成下一个时,把先前生成的符号作为附加输入。

class EncoderDecoder(nn.Module):
    """
    A standard Encoder-Decoder architecture. Base for this and many 
    other models.
    """
    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator
        
    def forward(self, src, tgt, src_mask, tgt_mask):
        "Take in and process masked src and target sequences."
        return self.decode(self.encode(src, src_mask), src_mask,
                            tgt, tgt_mask)
    
    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)
    
    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)
class Generator(nn.Module):
    "Define standard linear + softmax generation step."
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        return F.log_softmax(self.proj(x), dim=-1)

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
Transformer遵循这种整体架构(overall architecture),使用堆叠的自注意(stacked self-attention)和逐点(point-wise),全连接层(fully connected layers)用于编码器和解码器,分别如图1的左半部分和右半部分所示。
在这里插入图片描述

Encoder and Decoder Stacks(堆叠编码器和解码器)

Encoder(编码器)

The encoder is composed of a stack of N=6 identical layers.
编码器由N = 6个相同层的(identical layers)堆栈组成。

def clones(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])
class Encoder(nn.Module):
    "Core encoder is a stack of N layers"
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
        
    def forward(self, x, mask):
        "Pass the input (and mask) through each layer in turn."
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

We employ a residual connection (cite) around each of the two sub-layers, followed by layer normalization (cite).
我们在两个子层中的每一个周围使用残余连接(residual connection)(引用),然后是层标准化(layer normalization)(引用)。

That is, the output of each sub-layer is LayerNorm(x+Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. We apply dropout (cite) to the output of each sub-layer, before it is added to the sub-layer input and normalized.
也就是说,每个子层(sub-layer)的输出是LayerNorm(x + Sublayer(x)),其中Sublayer(x)是由子层本身实现的功能。 我们将dropout(引用)应用于每个子层的输出,然后将其添加到子层输入并进行规范化。

To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel=512.
为了利用(facilitate)这些残差连接(residual connections),模型中的所有子层以及嵌入层产生维度 d m o d e l d_{model} = 512的输出。

class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        "Apply residual connection to any sublayer with the same size."
        return x + self.dropout(sublayer(self.norm(x)))

Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.
每层有两个子层(sub-layers)。 第一种是多头(multi-head)自我关注机制(self-attention mechanism),第二种是简单的(simple),逐位置(position-wise)完全连接的(fully connected )前馈(feed-forward)网络。

class EncoderLayer(nn.Module):
    "Encoder is made up of self-attn and feed forward (defined below)"
    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size

    def forward(self, x, mask):
        "Follow Figure 1 (left) for connections."
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)

Decoder(解码器)

The decoder is also composed of a stack of N=6 identical layers.
解码器也是由N = 6个相同层的堆叠组成。

class Decoder(nn.Module):
    "Generic N layer decoder with masking."
    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
        
    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)

In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.
除了每个编码器层中的两个子层之外,解码器还插入第三子层,其对堆叠编码器层的输出执行多头注意(multi-head attention)。与编码器类似,我们在每个子层周围使用残差连接(residual connections),然后进行层规范化(layer normalization)。

class DecoderLayer(nn.Module):
    "Decoder is made of self-attn, src-attn, and feed forward (defined below)"
    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)
 
    def forward(self, x, memory, src_mask, tgt_mask):
        "Follow Figure 1 (right) for connections."
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        return self.sublayer[2](x, self.feed_forward)

We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.
我们还修改解码器中的自注意子层(self-attention sub-layer)以防止位置出现在后续位置(subsequent positions)。这种掩蔽与输出嵌入偏移一个位置的事实相结合,确保了位置i的预测仅依赖于小于i的位置处的已知输出。

def subsequent_mask(size):
    "Mask out subsequent positions."
    attn_shape = (1, size, size)
    subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
    return torch.from_numpy(subsequent_mask) == 0

猜你喜欢

转载自blog.csdn.net/czp_374/article/details/88866144