Google于2017年6月发布在arxiv上的一篇文章《Attention is all you need》，提出解决sequence to sequence问题的transformer模型，用全attention的结构代替了lstm，抛弃了之前传统的encoder-decoder模型必须结合cnn或者rnn的固有模式，只用attention，可谓大道至简。文章的主要目的是在减少计算量和提高并行效率的同时不损害最终的实验结果，创新之处在于提出了两个新的Attention机制，分别叫做 Scaled Dot-Product Attention 和 Multi-Head Attention。

transformer模型结构

模型结构如下图：

和大多数seq2seq模型一样，transformer的结构也是由encoder和decoder组成。

Encoder

Encoder由Nx个相同的layer组成，layer指的就是上图左侧的单元，最左边有个“Nx”，论文中是6x个。每个Layer由两个sub-layer组成，分别是multi-head self-attention mechanism和fully connected feed-forward network。其中每个sub-layer都加了residual connection和normalisation，因此可以将sub-layer的输出表示为：

$sub\_layer\_output = LayerNorm(x+(SubLayer(x))) \\$

multi-head self-attention mechanism和fully connected feed-forward network两个sub-layer的解释。

Multi-head self-attention

attention可由以下形式表示[深度学习：注意力模型Attention Model：Attention机制的本质]：

$attention\_output = Attention(Q, K, V) \\$

Note: 与之前的模型对应起来的话，Q就是decoder的隐层（如 $z_0, z_1$ 或者si），K就是encoder的隐层（如 $h_1, h_2, h_3, h_4$ ），V也是encoder的隐层。encoder中没有预测的输入，所有使用的是self-attention，取Q，K，V相同，均为encoder的隐层。

multi-head attention则是通过h个不同的线性变换对Q，K，V进行投影，最后将不同的attention结果拼接起来（很像cnn的思想）：

$MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O \\$

$head_i = Attention(QW_i^Q, KW_i^K, VW_i^V) \\$

文章中attention的计算采用了scaled dot-product，即：

$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V \\$

作者同样提到了另一种复杂度相似但计算方法additive attention，在 $d_k$ 很小的时候和dot-product结果相似， $d_k$ 大的时候，如果不进行缩放则表现更好，但dot-product的计算速度更快，进行缩放后可减少影响（由于softmax使梯度过小）。点乘注意力机制对于加法注意力而言，更快，同时更节省空间。

Note：scaled dot-product有效的解释：Transformer所使用的注意力机制的核心思想是去计算一句话中的每个词对于这句话中所有词的相互关系，然后认为这些词与词之间的相互关系在一定程度上反应了这句话中不同词之间的关联性以及重要程度。因此再利用这些相互关系来调整每个词的重要性（权重）就可以获得每个词新的表达。这个新的表征不但蕴含了该词本身，还蕴含了其他词与这个词的关系，因此和单纯的词向量相比是一个更加全局的表达。

Transformer通过对输入的文本不断进行这样的注意力机制层和普通的非线性层交叠来得到最终的文本表达。

Position-wise feed-forward networks

位置全链接前馈网络——MLP变形。第二个sub-layer是个全连接层，之所以是position-wise是因为处理的attention输出是某一个位置i的attention输出。用了两层Dense层，activation用的都是Relu。可以看成是两层的1*1的1d-convolution。hidden_size变化为：512->2048->512。
Position-wise feed forward network，其实就是一个MLP 网络，1 的输出中，每个 d_model 维向量 x 在此先由 xW_1+b_1 变为 d_f $维的 x'，再经过max(0,x')W_2+b_2 回归 d_model 维。之后再是一个residual connection。输出 size 仍是 $[sequence_length, d_model]$。

Decoder

Decoder和Encoder的结构差不多，但是多了一个attention的sub-layer。

对应上图先明确一下decoder的输入输出和解码过程：

输出：对应i位置的输出词的概率分布
输入：encoder的输出 & 对应i-1位置decoder的输出。所以中间的attention不是self-attention，它的K，V来自encoder，Q来自上一位置decoder的输出
解码：编码可以并行计算，一次性全部encoding出来，但解码不是一次把所有序列解出来的，而是像rnn一样一个一个解出来的，因为要用上一个位置的输入当作attention的query

对于decoder中的第一个多头注意力子层，需要添加masking，确保预测位置i的时候仅仅依赖于位置小于i的输出，因为训练时的output都是ground truth，这样可以确保预测第i个位置时不会接触到未来的信息。通过一个深层转换『Transformer』编码器进行输入，随后使用对应于遮盖处的最终隐态『final hidden states』对遮盖词进行预测，正如我们训练一个语言模型一样。

加了mask的attention原理如图（另附multi-head attention(多头注意力机制——点乘注意力的升级版本））：

Positional Encoding

除了主要的Encoder和Decoder，还有数据预处理的部分。Transformer抛弃了RNN，而RNN最大的优点就是在时间序列上对数据的抽象，所以文章中作者提出两种Positional Encoding的方法，将Positional Encoding后的数据与输入embedding数据求和，加入了相对位置信息。

两种Positional Encoding方法：

用不同频率的sine和cosine函数直接计算
学习出一份positional embedding（参考文献）

经过实验发现两者的结果一样，所以最后选择了第一种方法，公式如下：

$PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}}) \\$

$PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}}) \\$

方法1的好处有两点：

任意位置的 $PE_{pos+k}$ 都可以被 $PE_{pos}$ 的线性函数表示。由于三角函数有如下特性：

$cos(\alpha+\beta) = cos(\alpha)cos(\beta)-sin(\alpha)sin(\beta) \\$

$sin(\alpha+\beta) = sin(\alpha)cos(\beta) + cos(\alpha)sins(\beta) \\$

2. 如果是学习到的positional embedding，可能会像词向量一样受限于词典大小。也就是只能学习到“位置2对应的向量是(1,1,1,2)”这样的表示。所以用三角公式明显不受序列长度的限制，也就是可以对比所遇到序列的更长的序列进行表示。

Reddit上作者Jacob的介绍

（原帖链接）

The basic idea is very simple. For several years, people have been getting very good results "pre-training" DNNs as a language model and then fine-tuning on some downstream NLP task (question answering, natural language inference, sentiment analysis, etc.).

Language models are typically left-to-right, e.g.:

"the man went to a store"

P(the | <s>)*P(man|<s> the)*P(went|<s> the man)*...

The problem is that for the downstream task you usually don't want a language model, you want a the best possible contextual representation of each word. If each word can only see context to its left, clearly a lot is missing. So one trick that people have done is to also train a right-to-left model, e.g.:

P(store|</s>)*P(a|store </s>)*...

Now you have two representations of each word, one left-to-right and one right-to-left, and you can concatenate them together for your downstream task.

But intuitively, it would be much better if we could train a single model that was deeply bidirectional.

It's unfortunately impossible to train a deep bidirectional model like a normal LM, because that would create cycles where words can indirectly "see themselves," and the predictions become trivial.

What we can do instead is the very simple trick that's used in de-noising auto-encoders, where we mask some percent of words from the input and have to reconstruct those words from context. We call this a "masked LM" but it is often called a Cloze task.

Task 1: Masked LM

Input:
the man [MASK1] to [MASK2] store
Label:
[MASK1] = went; [MASK2] = store

In particular, we feed the input through a deep Transformer encoder and then use the final hidden states corresponding to the masked positions to predict what word was masked, exactly like we would train a language model.

The other thing that's missing from an LM is that it doesn't understand relationships between sentences, which is important for many NLP tasks. To pre-train a sentence relationship model, we use a very simple binary classification task, which is to concatenate two sentences A and B and predict whether B actually comes after A in the original text.

Task 2: Next Sentence Prediction

Input:
the man went to the store [SEP] he bought a gallon of milk
Label:
IsNext

Input:
the man went to the store [SEP] penguins are flightless birds
Label:
NotNext

Then we just train a very big model for a lot of steps on a lot of text (we used Wikipedia + a collection of free ebooks that some NLP researchers released publicly last year). To adapt to some downstream task, you just fine-tune the model on the labels from that task for a few epochs.

By doing this we got pretty huge improvements over SOTA on every NLP task that we tried, with almost task-specific no changes to our model needed.

But for us the really amazing and unexpected result is that when we go from a big model (12 Transformer blocks, 768-hidden, 110M parameters) to a really big model (24 Transformer blocks, 1024-hidden, 340M parameters), we get huge improvements even on very small datasets (small == less than 5,000 labeled examples).

BERT有哪些“反直觉”的设置？

ELMO的设置其实是最符合直觉的预训练套路，两个方向的语言模型刚好可以用来预训练一个BiLSTM，非常容易理解。但是受限于LSTM的能力，无法变深了。那如何用transformer在无标注数据行来做一个预训练模型呢？一个最容易想到的方式就是GPT的方式，事实证明效果也不错。那还有没有“更好”的方式？直观上是没有了。而BERT就用了两个反直觉的手段来找到了一个方法。

(1) 用比语言模型更简单的任务来做预训练。直觉上，要做更深的模型，需要设置一个比语言模型更难的任务，而BERT则选择了两个看起来更简单的任务：完形填空和句对预测。

(2) 完形填空任务在直观上很难作为其它任务的预训练任务。在完形填空任务中，需要mask掉一些词，这样预训练出来的模型是有缺陷的，因为在其它任务中不能mask掉这些词。而BERT通过随机的方式来解决了这个缺陷：80%加Mask，10%用其它词随机替换，10%保留原词。这样模型就具备了迁移能力。

感觉上，作者Jacob Devlin是拿着锤子找钉子。既然transformer已经证明了是可以handle大数据，那么就给它设计一种有大数据的任务，即使是“简单”任务也行。理论上BiLSTM也可以完成BERT里的两个任务，但是在大数据上BERT更有优势。

BERT这个模型与AI2的 ELMo和OpenAI的fine-tune transformer的区别是

它在训练双向语言模型时以减小的概率把少量的词替成了Mask或者另一个随机的词。我个人感觉这个目的在于使模型被迫增加对上下文的记忆。至于这个概率，我猜是Jacob拍脑袋随便设的。
增加了一个预测下一句的loss。

Transformer模型的评价

优点

作者主要讲了以下三点：

Total computational complexity per layer （每层计算复杂度）

2. Amount of computation that can be parallelized, as mesured by the minimum number of sequential operations required

作者用最小的序列化运算来测量可以被并行化的计算。也就是说对于某个序列 $x_1, x_2, ..., x_n$ ，self-attention可以直接计算 $x_i, x_j$ 的点乘结果，而rnn就必须按照顺序从 $x_1$ 计算到 $x_n$

3. Path length between long-range dependencies in the network

这里Path length指的是要计算一个序列长度为n的信息要经过的路径长度。cnn需要增加卷积层数来扩大视野，rnn需要从1到n逐个进行计算，而self-attention只需要一步矩阵计算就可以。所以也可以看出，self-attention可以比rnn更好地解决长时依赖问题。当然如果计算量太大，比如序列长度n>序列维度d这种情况，也可以用窗口限制self-attention的计算数量

4. 另外，从作者在附录中给出的栗子可以看出，self-attention模型更可解释，attention结果的分布表明了该模型学习到了一些语法和语义信息

缺点

缺点在原文中没有提到，是后来在Universal Transformers中指出的，在这里加一下吧，主要是两点：

实践上：有些rnn轻易可以解决的问题transformer没做到，比如复制string，尤其是碰到比训练时的sequence更长的时
理论上：transformers非computationally universal（图灵完备），（我认为）因为无法实现“while”循环

from: http://blog.csdn.net/pipisorry/

ref:

深度学习：transformer模型