Paper | Attention Is All You Need

[This is a reference to 4000 + article. Although bloggers do not do NLP, but still very interested. Of course, bloggers understanding and translation of this article is very jerky]

Motivation : attention mechanism is very important, NLP model now SOTA will be connected together by attentional mechanisms encoder and decoder.

Contribution : In this paper, the authors proposed converter (transformer). This structure has nothing to do with the circulation and convolution network, only with the attention mechanisms. Overall model training quicker and easier parallel, the best results.

Motivation detail

Current cycle logic network and logic described in humans, are of the sequence: With the advance operation is commenced symbol positions, i.e. \ (T \) time implicit state \ (h_t \) is on a time implicit state \ (h_ {t-1} \) and the time function of the current input.

However, this logic sequence has a drawback: not conducive to parallelization. Especially when long sequence, this disadvantage is magnified.

In this case, attention mechanism into the author's vision. Attentional mechanisms are already an integral part of many sequence modeling tasks and transformations (transduction) model can be used to model any long-distance dependencies . However, this mechanism is the attention and recycling network used simultaneously .

The proposed converter, attention is to make models out of frame loop network. At this time, it has parallel converter capabilities.

2. Related Work

There has been some work in the attempt to reduce the amount of computing sequence, and such ByteNet ConvS2S. They are all underpinned by a convolution neural network. Their common problem: When the sequence length increases, the amount of computation linear or logarithmic growth will increase.

The converter can do: Let the amount of calculation is a fixed constant, although some loss of performance.

Since attentional mechanisms: within a single sequence, the establishment of attentional mechanisms between different positions, and for modeling the sequence itself.

[End Memories: bloggers did not understand, did not get to know. ]

The authors claim, first converter is based entirely on the mechanism of attention from the conversion model, without RNN or convolution structure.

3. The converter structure

7_1

The left is the encoder, the right is the decoder.

As shown on the left side, the same configuration as the encoder 6 ( \ (N =. 6 \) ), each structure comprising two layers: one layer long attentional mechanisms (orange), a fully connected network layer (blue ). There are two short connecting inner layer and normalized. All layers are output \ (d_k = 512 \) dimension.

Right side is a decoder. The decoder is also of the same structure 6 composition, but the structure of each of a plurality of long attentional mechanisms (the middle). This module is performed on the output of the encoder. In addition, the bottom of that long attention module has also been modified: only output from attention mechanism to perform in front of, regardless of its output after. Modify way is masked.

3.1 Detailed attention mechanism

Attention mechanism essentially quite simple: the input query and a set of keys - value pairs; output value is a weighted combination; weight calculation function obtained by the query and a compatible bond.

3.1.1 scaling of the dot product of the attention mechanism

The authors say the proposed mechanism for the attention of: scaling the dot product of attention (scaled dot-product attention). Figure left:

7_2

query and dot key, divided by the square root of 512 (zooming), and after softmax value multiplied by a weighting process is complete attention.

In the practical operation of this process will matrixing, i.e., the first pack matrix \ (Q \) , \ (K \) and \ (V \) , then calculates:
\ [\ text the Attention} {(Q, K, V) = \ text {softmax} (\ frac {QK ^ T} {\ sqrt {d_k}}) V \]

除了这里用到的点积形式,还有一种常用的注意力策略:加性注意力。加性注意力(additive attention)只需要借助单层前向网络计算兼容函数。尽管理论上,加性注意力和点积注意力的计算复杂度接近,但由于矩阵操作有加速算法,因此点积注意力更高效。在性能上,当\(d_k\)较大时,点积注意力不如加性注意力。可能的原因是:当维度较高时,点积结果可能会很大(脚注4),因此softmax函数的梯度很小,导致训练困难。因此,我们将点积结果除以根号512。

3.1.2 多头注意力机制

我们总结一下上一节的放缩点积注意力机制:只有单个注意力函数,输入key和query,输出加权后的value。注意,输入、输出都是\(d_k = 512\)维。

除此之外,作者提出了更进一步的处理,如上图右:我们首先将value、query和key分别线性映射到\(d_k = 64\)\(d_k\)\(d_v = 64\)维,然后再通过上一节的注意力机制,处理得到\(d_v\)维的输出。该操作执行\(h = 8\)次,每一次的线性映射函数和注意力函数都不一样。最后,\(h\)\(d_v\)维输出再经过一次线性映射,得到一个\(d_v\)维最终输出。这就是所谓的multi-head。

多头与之前的“单头”有什么进步呢?博主的想法:

  1. Not just key, query comprehensive treatment, also key, query processing and value separately. There may be some key and query inherently very interesting.

  2. They carried out eight different "single head" and then the final result reweighted every combination. This fact allows each focus function is responsible for eight different representations of space. For instance, some function more attention to the subject, verb, and so pay more attention to some function.

  3. Because every "single head" dimensions are reduced, the overall amount of calculation did not improve.

Going back to the first graph. In Transformer, the author has used at three long attention:

  1. Encoder - between the decoder. query from the layer of the decoder, the key and the value output from the encoder. I.e.: Note that the decoder so that a position of each of the input sequence. This is typical of the attention mechanism.

  2. Encoder. At this time, key, value, and are derived from the one query. Ie: the attention of the layer on layer of each location.

  3. decoder. At this time, key, value, and are derived from the one query. Ie: the attention of the layer on layer of each location. Note that we do not need leftward flow of information, so we let the relevant mask is negative infinity (completely unrelated).

3.2 fully connected network

A fully connected network comprises two maps, ReLU non-linear. Input and output are 512-dimensional hidden layer dimension is 2048.

Embeddings do not know, do not read.

3.3 encoded position information

Since Transformer convolution does not comprise cyclic structure and, in particular, so we want to encode position information. Details slightly.

The rest a little.

In general, Google breaking the inherent original model: modeling attention with RNN or CNN. This breakthrough is the biggest contribution of this paper.

Guess you like

Origin www.cnblogs.com/RyanXing/p/Transformer.html