首先放论文原文链接

https://arxiv.org/pdf/1706.03762.pdfhttps://arxiv.org/pdf/1706.03762.pdf

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data

主流的序列转录模型大多由基于复杂循环或CNN的编码器和解码器构成。表现最好的模型也通过注意力机制将编码器和解码器结合在一块儿。这里，我们提出一个新的简单的网络架构，Transformer，不需要复杂的循环或卷积，只基于注意力机制。实验证明我们的模型在性能发面很突出，并行度更高且训练时间更少。（下面介绍了一些机器翻译数据集上的运行效果，带有绝对精度和相对精度，以及训练成本和时间）我们发现Transformer可以很好的泛化到别的任务中去，在大数据集和小数据集上都可以有很好的表现。

大致结构是：领域内的主要做法 + 我们的模型概要 + 模型特性 + 实验证明（绝对精度+相对精度+训练成本）+ 模型展望

1 Introduction

Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].

Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.

In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.

RNN，特别是LSTM和GRU，已经在时间序列模型领域，如语言模型、机器翻译等任务中，构建起了目前最先进的方法。

RNN序列模型：并行度差，训练速度慢；对于长序列记忆丢失严重。尽管做了弥补工作，但问题依然存在。

注意力机制在序列模型领域有较为重要的作用，但当前大部分实列都基于RNN。

我们的工作只基于Attention，我们的优点......（和摘要部分差不多）

目前的研究现状，他们做出了那些贡献，存在那些缺点，为此我们的模型做到了能够弥补他们的缺点的东西。

2 Background

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22].

End-to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [34]. To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9].

有学者将RNN替换为CNN去提高并行度，但这同时也为远距离节点间关系的表征增加了困难（对长序列难以建模）。Transformer中，这样的关系表示可以通过恒定的操作数量来完成，尽管有效的注意力有所亏损，但我们使用了Multi-head Attention（模拟CNN的多输出通道）。

自注意力机制在NLP领域的任务中取得了较好的成效。

End-to-end memory networks基于循环注意力机制......(不太懂，不重要)

在我们的探究下，Transformer是第一个.......（摘要的东西又说了一遍）

背景里写：与我研究领域相关/我用到相关技术的论文 + 与我的联系 + 与我的区别

3 Model Architecture

Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.

对于编码器，输入一个序列X，给出其对应的表示序列Z；对于解码器，输入一个序列Z，逐个元素输出其表示。（输入输出序列可以不等长）使用自回归模型，即每一步得到的输出也作为输入。

Transformer的主要架构为自注意力层和针对节点的全连接层。

大致的结构框图：

3.1 Encoder and Decoder Stacks

Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.

Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

编码器：由6个独立的层叠加而成。每一层包含两个子层，一层是多头的自注意力机制，第二层是一个简单的前馈全连接网络。每个子层上都应用了残差连接，并且再后面跟上了一一个层归一化的结构。（Layer Norm是针对样本进行归一化，而BatchNorm则是正对特征进行归一化，这里用的是LayerNorm）规定最后产出得到的对于每个token的embedding的维度是512。

解码器：相比编码器多了一个子层，它将编码器的输出作为一部分输入（不再是一个自注意力层）。并且，第一个子层即自注意力层使用masking做出了修改，保证在预测位置 i 的token时，整个模型将无法看到位置 i 以后的token信息。

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

注意力函数将查询和一系列键值对进行映射得到一个输出。其输出由各个键值对的value的加权和得到，其对应的权重则来自对应key和查询query的相似度。

3.2.1 Scaled Dot-Product Attention

$d_{k}$	因为查询query和键值key间相似度的计算采用的是点积，所以选择相同的向量维度
$d_{v}$	键值对中value的向量维度
$Q$	查询矩阵
$K$	键矩阵
$V$	值矩阵

公式如下：

$Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V$

其中，softmax最为激活函数。本文所采用注意力机制的亮点体现在上，因为当维度越大即 $d_{k}$ 越大时，其计算出来的权值基数越大，离散程度也就随之增大，如果直接使用softmax会使得其得到的最终权值分布两极化，从而使得计算梯度变小，影响模型的训练。

3.2.2 Multi-Head Attention

由于上述点积自注意力机制只有一次学习特征表示的机会（只进行点积得到一组权值），这里引入新的概念也是本文的亮点，即多头注意力机制。多头注意力机制对所有矩阵进行 $h$ 次映射，使用转换矩阵 $W_{i}^{Q}$ ， $W_{i}^{K}$ ， $W_{i}^{V}$ ，将初始的embedding映射成一个低维向量，然后进入上述注意力层，将最后得到的结果进行拼接。这样做相当于给了模型 $h$ 次的学习机会，即可以关注到整个序列更多的特征信息。

$W_{i}^{Q}$	查询转化矩阵，维度为 $d_{model} \times d_{k}$
$W_{i}^{K}$	键转换矩阵，维度为 $d_{model} \times d_{k}$
$W_{i}^{V}$	值转换矩阵，维度为 $d_{model} \times d_{v}$

$Multihead(Q,K,V)=Concat(head_{1},...,head_{h})W^{O}$ $head_{i}=Attention(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V})$

其中，为了方便运算， $d_{k}$ 和 $d_{v}$ 也取相同值。由于“多头”得到的结果在输出输入到线性层时还需要拼接，所以 $d_{q}=d_{v}=d_{k}=d_{model}/h$ 。

3.2.3 Applications of Attention in our Model

这时候又要拿到前面的结构框图了。

1. 多头自注意力机制， $Q, K, V$ 矩阵都来自编码器输入序列Input Embedding

2. 被遮盖的自注意力机制， $Q, K, V$ 矩阵都来自解码器输入序列Output Embedding。由于预测处于位置 $i$ 的token时，不能使其看到处于位置 $i$ 以后的token信息，所以这里会在计算位置 $i$ 的attention输出时，将位置 $i$ 以后的权重附为-INF，经过softmax函数后即为0。

3. 多头注意力机制， $K,V$ 矩阵来自编码器的输出， $Q$ 矩阵来自解码器第一层即2的输出部分。

3.3 Position-wise Feed-Forward Networks

这里提出的网络其实就是一个针对每一个序列位置分别做出的两层的前馈全连接网络，使用ReLU函数最为激活函数。其中，隐藏层输出的向量维度为 $d_{model}$ 的四倍。

$FFN(x)=max(0,xW_{1}+b_{i})W_{2}+b_{2}$

3.4 Embeddings and Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30]. In the embedding layers, we multiply those weights by $\sqrt{d_{model}}$

使用训练过的embedding机制将序列中的token转换为embedding，同时也使用训练过的线性转换和softmax函数去预测下一个token的概率。在我们的模型中，对两个embedding层和线性层使用相同的参数。在embedding层中，我们将所有的数值乘以 $\sqrt{d_{model}}$ 。

3.5 Positional Encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9].

由于我们的模型没有使用RNN或CNN，为了保证模型训练时序列的顺序性，我们引入位置编码。位置编码将作为输入序列的一部分，与初始embedding相加后进入编码器或译码器。

$PE_{(pos,2i)}=sin(pos/10000^{2i/d_{model}})$ $PE_{(pos,2i+1)}=cos(pos/10000^{2i/d_{model}})$

其中， $pos$ 代表token在序列中的位置， $i$ 表示embedding中的维度。

4 Why Self-Attention

In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations (x1, ..., xn) to another sequence of equal length (z1, ..., zn), with xi , zi ∈ R d , such as a hidden layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we consider three desiderata.

One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.

The third is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [12]. Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.

我们比较自注意力机制和RNN或CNN相对于不定长序列转换任务的不同方面，及比如编码器和解码器的应用。发现如下三点。

一是每一层的总计算复杂度。二是通过对序列进行一次处理所需的最小操作次数来衡量可并行程度。

三是计算长距离token对依赖关系需要经过的最大路径长度。计算远距离依赖一直是序列转换任务中的关键挑战，其中的一个关键因素就是其路径长度。路径距离越短，学习长距离依赖越容易。

这一部分就是通过与传统RNN或CNN的比较，在三个方面分析Transformer模型的自身优势。

5 Training

这一部分介绍了训练数据集的规模、所需的硬件设施和训练时长、优化器的选择及其正则化。

其中，正则化是我第一次接触到的概念。

Dropout是为了防止模型学习过拟合导入的机制。通过修改神经网路本身实现，训练开始时，随机“删除” $P_{drop}$ 的隐藏层单元，视他们为不存在。

使用交叉熵计算损失时，只考虑到训练样本中正确的标签位置的损失，而忽略了错误标签位置的损失。Label Smoothing可以有效避免上述错误， $\epsilon$ 作为为平滑因子。

$y'=(1-\epsilon )*y+\epsilon *u$

其中， $u$ 是认为引入的一个固定分布（可以看作卫视固定分布的噪声），由 $\epsilon$ 控制其权重。

6 Results

略

7 Conclusion

In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.

For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles.

We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours.

The code we used to train and evaluate our models is available at https://github.com/ tensorflow/tensor2tensor.

Acknowledgements We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful comments, corrections and inspiration.

在我们的工作中，我们展示了Transformer——第一个仅基于注意力机制的序列转录模型，使用多头自注意力机制替换了目前在编码器-译码器结构中最普遍使用的循环层。

对于机器翻译任务，Transformer模型训练速度相较于其他模型显著提高，实现了新高。

我们对这个模型的未来报以极大的期待，并且计划将他使用在其他序列模型训练任务中，不限于图片、音频和视频。

我们的代码开源。

鸣谢。

周三九的论文笔记(1) -- Attention Is All You Need