1 Introduction

June 2017, Google Brain in the paper "Attention Is All You Need" proposed Transformer architecture, completely abandoned RNN cycle mechanism, uses a self-attention way global process. I also blog principle Attention Is All You Need (Transformer) parsing algorithm has been introduced.

Transformer structure features:

With all the attention from the mechanisms of self-attention.
It improved on the basis of self-attention on the Multi-Attention and Mask Multi-Attention two kinds of long attention mechanisms.
Network consists of a plurality of layers, each layer consists of long focus mechanism and a feedforward network configuration.
Since the calculation of the global focus mechanism, ignoring the most important position in the sequence information added to the position coding (Position Encoding), using the sine function is completed, generating a position vector for the location of each portion.

2. Vanilla Transformer

Vanilla Transformer is a Transformer and Transformer-XL middle of an algorithm excessive, so before introduction Transformer-XL we start to understand Vanilla Transformer under.

Vanilla Transformer schematics:

Vanilla Transformer论文中使用64层模型，并仅限于处理 512个字符这种相对较短的输入，因此它将输入分成段，并分别从每个段中进行学习，如下图所示。在测试阶段如需处理较长的输入，该模型会在每一步中将输入向右移动一个字符，以此实现对单个字符的预测。

Vanilla Transformer的三个缺点：

上下文长度受限：字符之间的最大依赖距离受输入长度的限制，模型看不到出现在几个句子之前的单词。
上下文碎片：对于长度超过512个字符的文本，都是从头开始单独训练的。段与段之间没有上下文依赖性，会让训练效率低下，也会影响模型的性能。
推理速度慢：在测试阶段，每次预测下一个单词，都需要重新构建一遍上下文，并从头开始计算，这样的计算速度非常慢。

3. Transformer-XL

Transformer-XL架构在vanilla Transformer的基础上引入了两点创新：

循环机制（Recurrence Mechanism）
相对位置编码（Relative Positional Encoding）。

以克服Vanilla Transformer的缺点。与Vanilla Transformer相比，Transformer-XL的另一个优势是它可以被用于单词级和字符级的语言建模。

3.1 循环机制（Recurrence Mechanism）

Transformer-XL仍然是使用分段的方式进行建模，但其与Vanilla Transformer的本质不同是在于引入了段与段之间的循环机制，使得当前段在建模的时候能够利用之前段的信息来实现长期依赖性。如下图所示：

在训练阶段，处理后面的段时，每个隐藏层都会接收两个输入：

该段的前面节点的输出，与Vanilla Transformer相同（上图的灰色线）。
前面段的节点的输出（上图的绿色线），可以使模型创建长期依赖关系。这部分输出市通过cache的机制传导过来，所以不会参与梯度的计算。原则上只要GPU内存允许，该方法可以利用前面更多段的信息。

在预测阶段：

如果预测\(x_{11}\)我们只要拿之前预测好的[\(x_1\),\(x_2\)...\(x_{10}\)]的结果拿过来，直接预测。同理在预测\(x_{12}\)的时候，直接在[\(x_1\),\(x_2\)...\(x_{10}\),\(x_{11}\)]的基础上计算，不用像Vanilla Transformer一样每次预测一个字就要重新计算前面固定个数的词。

3.2 相对位置编码

在Transformer中，一个重要的地方在于其考虑了序列的位置信息。在分段的情况下，如果仅仅对于每个段仍直接使用Transformer中的位置编码，即每个不同段在同一个位置上的表示使用相同的位置编码，就会出现问题。比如，第\(i_2\)段和第\(i_1\)段的第一个位置将具有相同的位置编码，但它们对于第\(i\)段的建模重要性显然并不相同（例如第\(i_2\)段中的第一个位置重要性可能要低一些）。因此，需要对这种位置进行区分。

Transformer-XL calculation of attention may be divided into the following four parts:

Based on the "address" of the content, i.e., without adding the raw score of the original position encoding.
Content-based position offset, i.e., relative positional deviation of the current content.
Global content bias, for the importance of key measure.
Global position offset, the importance of adjusting the distance between the query and the key

4. Summary

4.1 advantage

In several different data sets (big / small, character-level / word level, etc.) have achieved the most advanced language modeling results.
Combines two important concepts in depth learning - cycle mechanism and focus mechanism to allow the model to study the long-term dependence, and may need to be extended to other areas of the deep learning capabilities, such as audio analysis (such as voice 16k samples per second data) and so on.
Very fast inference stage, the most advanced method than the previous model language modeling using the Transformer 300 to 1800 times faster.

Less than 4.2

Yet on the application of specific NLP tasks such as sentiment analysis, QA and so on.
What are the advantages and did not give other Transformer-based models, such as BERT, etc., contrast.
Trainers need to use a lot of resources TPU.

7. Transformer-XL principles introduced

1. Language Model

Principle 2. Attention Is All You Need (Transformer) parsing algorithm

3. ELMo principle parsing algorithm

4. OpenAI GPT principle parsing algorithm

5. BERT principle parsing algorithm

6. Attention appreciated from the nature Encoder-Decoder (Seq2Seq)