【Paper 01】《Attention is all you need》

Attention is all you need

Link to Station B:
https://www.bilibili.com/video/BV1pu411o7BE/?spm_id_from=333.788&vd_source=fab4cd66aafcb3b54c4bc627c1dcaac1

author

image-20220805092751556

image-20220805092828504

Jakob: proposed to replace RNN with self-attention

Ashish, Illia: Implemented the first Transformer

Noam: Proposed proportional dot product attention, multi-head attention, and parameter-free position representation

Niki: Designed, implemented, tuned and evaluated countless model variants in our original code base and tensor2tensor.

Llion: experimented with new model variants, was responsible for our initial codebase, and efficient inference and visualization

Lukasz and Aidan: Designing parts of tensor2tensor, which replaced our earlier codebase

Summary

​ The main sequence transcription model (transforming one sequence into another) is based on a complex recurrent or convolutional neural network, which includes an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, Transformer (Transformer), which is completely based on the attention mechanism, completely canceling loops and convolutions. Experiments on two machine translation tasks show that these models are qualitatively superior while being more parallel and requiring significantly less training time. Our model achieves 28.4 BLEU on the WMT 2014 English-German translation task, an improvement of more than 2 BLEU over the existing best results. On the WMT 2014 English-to-French translation task, our model was trained in 3.5 days on 8 GPUs, establishing a new state-of-the-art single-model BLEU score of 41.8, a fraction of the training cost of the best models in the literature . We successfully apply Transformer to English constituency parsing with both large and limited training data, and generalizes well to other tasks.

7 Conclusion

In this work, we propose Transformer, the first fully attention-based sequential transcription model, replacing the most commonly used recurrent layers in the encoder-decoder architecture with multi-headed self-attention. For translation tasks, Transformers can be trained much faster than architectures based on recurrent or convolutional layers. We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend Transformer to problems involving input and output modalities other than text, and investigate local, restricted attention mechanisms to efficiently process large inputs and outputs, such as images, audio, and video. Generating fewer sequences is another research goal of ours. The code we use to train and evaluate our model is at https://github.com/tensorflow/tensor2tensor.

1 Introduction

Recurrent neural networks, especially long short-term memory [13] and gated recurrent [7] neural networks, are state-of-the-art methods for sequential transcriptional models, such as language modeling and machine translation [35, 2, 5]. Since then, many efforts have continued to advance recurrent language models and encoder-decoder architectures [38, 24, 15]. The cyclic model is usually calculated along the symbol position of the input and output sequences (do it step by step, for example, the sequence is a sentence, and you need to look at it word by word). Aligning these positions with steps in computation time, they generate a sequence of hidden states ht h_tht(The t word generates a hidden state sequence ht h_tht ), h t h_t htis determined by the previous hidden state ht − 1 h_{t−1}ht1It is determined by the current tth word itself. The disadvantage is that it cannot be calculated in parallel, because the memory limits the batch processing between examples, and historical information needs to be obtained step by step.

The attention mechanism is mainly how to make the encoder pass to the decoder more efficiently, used together with the recurrent network. In this work, we propose Transformer, which no longer uses loops, but fully relies on an attention mechanism. The proposed Transformer allows more parallelization.

2 related work

How to replace RNNs with Convolutional Neural Networks as basic building blocks. It is difficult to model relatively long sequences with convolutional neural networks, because convolution only uses a small window, but Transformer can see all pixels every time, and one layer can see the entire sequence. Referring to the multi-output channel of the convolutional neural network, Transformer introduces a multi-headed self-attention (multi-head attention mechanism)

3 Model Architecture

Most competing neural sequence transduction models have an encoder-decoder structure [5, 2, 35].

​ Encoder: an input sequence represented by a symbol ( x 1 , … , xn ) (x_{1}, \ldots, x_{n})(x1,,xn) is mapped to a continuous representation sequenceZ = ( z 1 , … , zn ) \mathbf{Z}=\left(z_{1}, \ldots, z_{n}\right)Z=(z1,,zn) x t x_{t} xtIndicates the tth word, zt z_{t}ztCorresponding to xt x_{t}xtA vector representation of .

​ Decoder: Generate a symbolic output sequence ( y 1 , … , ym ) \left(y_{1}, \ldots, y_{m}\right)(y1,,ym) . [Note that n,m may be different] At each step, the model is autoregressive [10], using previously generated symbols as additional input when generating the next symbol.

The Transformer follows this overall architecture, using stacked self-attention and fully-connected layers of the encoder and decoder, as shown in the left and right halves of Figure 1, respectively.

3.1 Encoder and decoder stack

Encoder : The encoder consists of N=6 identical layers. Each layer has two sublayers. The first is a multi-head self-attention mechanism, and the second is a simple, positionally fully connected feed-forward network. Both sublayers use residual connections [11] followed by layer normalization [1]. That is, the output of each sublayer is LayerNorm ⁡ ( x + Sublayer ⁡ ( x ) ) \operatorname{LayerNorm}(x+\operatorname{Sublayer}(x))LayerNorms ( x+Sublayer ( x )) , where sublayer(x) is a function implemented by the sublayer itself. To facilitate these remaining connections, all sublayers in the model, as well as embedding layers, producedmodel d_{model}dmodel= 512 output. (fixed)

​ Difference: BatchNorm
insert image description here

​ Each column changes its mean to 0 in a mini-batch and its variance to 1. (Subtract its own mean and divide by the variance)

​ LayerNorminsert image description here

​ Each row changes its mean to 0 and variance to 1 in a mini-batch. (Row represents a sample)

The input in the Transformer is a sequence:

insert image description here

Decoder : The decoder also consists of N=6 identical layers. In addition to the two sublayers in each encoder layer, the decoder inserts a third sublayer, which is a masked multi-head attention mechanism. So that the input after time t will not be seen at time t, and the behavior during training and prediction is guaranteed to be consistent. Similar to the encoder, we use residual connections at each sublayer followed by layer normalization.

3.2 Attention

Note that a function can be described as mapping a query and a set of key-value pairs to an output, where query, key, value, and output are all vectors. The output calculates the weighted sum of the value (so the dimension of the output is the same as that of the value), where the weight assigned to each value is calculated by the similarity function between the query and the corresponding key.

3.2.1 Scaled Dot-Product Attention

​ query and key have the same dimension dk d_{k}dk, the dimension of value dv d_{v}dv, the dimension of output is also dv d_{v}dv, do a dot product (inner product) for query and key, the larger the inner product, the higher the cosine similarity, and the inner product is 0, which is equivalent to two vectors being orthogonal. Then divide each inner product result by √dk. N key-values ​​can calculate n values, because the query will do the inner product with each key, and then put it into softmax after calculation, you will get n non-negative weights that add up to 1, and apply this weight to value on, you get the output.

In practice, we simultaneously compute attention functions on a set of queries, packed into a matrix Q. The keys and values ​​are also packed into matrices K and V. We calculate the output matrix as:

image-20220805115425493

image-20220805120813119

​ And for smaller dk d_{k}dkvalue, the two mechanisms behave similarly, at dk d_{k}dk[3] For larger values, additive attention is better than dot-product attention. We guess that for dk d_{k}dkWhen the ratio of is large, the size of the dot product becomes larger and larger, pushing the softmax function to regions with extremely small gradients. To counteract this effect, we use √dk d_{k}dkto scale the dot product.

image-20220805143724204

​ Mask(opt.) is to avoid seeing the following content at the tth time.

3.2.2 Multi-Head Attention

​ query, key, and value enter a linear layer (projected to a relatively low dimension), then do Scaled Dot-Product Attention, do h times, get h outputs, merge all output vectors together, and finally do a linear projection, Return to Multi-Head Attention.

image-20220805144202660

image-20220805150029800

In this work, we use h = 8 parallel attention layers, or heads. For each model, we use dk=dv=dmodel/h=64 (output dimension divided by h, 512/8=64). Due to the reduced dimensionality of each head, its total computational cost is similar to full-dimensional single-head attention.

3.2.3 Applications of Attention in our Model

Transformer uses multi-head attention in three different ways:

  1. In the "encoder-decoder attention" layer, the query comes from the previous decoder layer, and the memory key and value comes from the output of the encoder. This allows every position in the decoder to participate in all positions in the input sequence. This mimics the typical encoder-decoder attention mechanism in sequence-to-sequence models such as [38, 2, 9].
  2. This encoder contains self-attention layers. In a self-attention layer, all keys, values, and queries come from a single location, in this case, the output of the previous layer in the encoder. Each position in the encoder can handle all positions in the previous layer of the encoder.
  3. Similarly, a self-attention layer in the decoder allows each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the autoregressive properties. We do this inside the scaled dot-product attention by masking (set to −∞) all values ​​in the softmax input that correspond to illegal connections. See Figure 2.

​ Assuming that the length of the sentence is n, the input is actually a vector of n lengths of d. Assuming that the batch size is set to 1, it is copied three times, that is, as key, query, and value, so it is called the self-attention mechanism. In fact, key, query, value is the same thing

insert image description here

3.3 Position-wise Feed-Forward Networks

Except for the attention sublayer, each layer in our encoder and decoder contains a fully connected feed-forward network (MLP), which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

​ (Each MLP is applied to each word once, and the same MLP is applied to each word, which means Position-wise)

image-20220805153903203

(x is a vector with a length of 512, W1 projects 512 into a dimension of 2048, which is to expand the dimension by four times, and W2 projects 2048 back to 512, so this formula is a single hidden layer MLP, expanding it by four times, and finally the output returns to the size of the input)

image-20220805155440914

3.4 Embeddings and Softmax

Similar to other sequence transduction models, we use the learned embeddings to transform the input and output tokens into vectors of dimension-d models. We also use the usual learned linear transformation and softmax function to transform the decoder output to the predicted next word probability. In our model, we share the same weight matrix and maximal linear transformation between two embedding layers, similar to [30]. In the embedding layer, we multiply these weights by √dmodel.

​ The input is a word, and it needs to be mapped into a vector. Embedding is to learn a vector with a length of d to represent it. Here, d can be considered as 512. Both the encoder and the decoder need embedding, and the linearity in front of softmax also needs embedding.

3.5 Positional Encoding

​ Attention itself has no timing information, the output is the value weighted sum, and the weight is the similarity between key and query, so Attention has no timing information. Therefore, timing information needs to be added to the input, and position i is added to the input.

image-20220805160940633

4. Why Self-Attention

Compare:

image-20220805161235946

5. Experiment

5.1 Training Data and Batching

We train on the standard WMT 2014 English-German dataset, which contains about 4.5 million sentence pairs. Sentences are encoded using byte pair encoding (bpe) [3], which has a shared source data vocabulary of about 37,000 tokens (do not make a dictionary for English, but make a dictionary for German, that is, Embedding is shared weight) . For the English-French vocabulary, we use the significantly larger WMT 2014 English-French dataset, which contains 36 million sentences, and decompose the tokens into a 32,000-word piece vocabulary [38]. Sentence pairs are grouped together by approximate sequence lengths. Each training batch contains a set of sentence pairs with approximately 25,000 source tokens and 25,000 target tokens.

5.2 Hardware and Schedule

We train our model on a machine with 8 NVIDIA P100gpus. For our base model using the hyperparameters described throughout the paper, each training step (bench) takes about 0.4 seconds. In total we trained the base model for 100,000 steps or 12 hours. For our large model (as described in the bottom line of Table 3), the step time is 1.0 seconds. These large models are trained for 300,000 steps (3.5 days).

TPU is more suitable for relatively large matrix multiplication.

5.3 Optimizer

We used the Adam optimizer [20] with β1 = 0.9, β2 = 0.98 and = 10−9. We vary the learning rate throughout training, calculated as follows:

image-20220805162914158

This corresponds to a linear increase in the learning rate for the first warmup_steps training steps, followed by a decrease proportional to the inverse square of the number of steps. We used warmup_steps=4000.

5.4 Regularization

Residual Dropout : We apply dropout [33] to the output of each sublayer, which is then added to the sublayer input and LayerNormed. Additionally, we apply dropout to the sum of embeddings and positional encodings in the encoder and decoder stacks. For the base model we use P drop = 0.1 P_{drop} = 0.1Pdrop=0.1 rate. (Using a lot of dropout)

Label Smoothing : During training, we employ label smoothing with value confidence = 0.1 [36]. Because the model is more uncertain, but improves accuracy and BLEU scores.

6. Results

6.3 Model Variations

Comparison between different hyperparameters:

image-20220805163930191

s speed. (Using a lot of dropout)

Label Smoothing : During training, we employ label smoothing with value confidence = 0.1 [36]. Because the model is more uncertain, but improves accuracy and BLEU scores.

6. Results

6.3 Model Variations

Comparison between different hyperparameters:

[External link image transfer...(img-itw9qIG9-1659696217145)]

Guess you like

Origin blog.csdn.net/weixin_42322991/article/details/126183691