Table of contents
Abstract
In a sequence transcription model, an "attention mechanism" is used between the encoder and the decoder, and two machine translation experiments are done, and the effect is better than other models.
Conclusion
Transformer is the first model for sequence transcription using only the attention mechanism. On the machine translation model, Transformer is much faster than other architectures, and the effect is better.
Introduction
The shortcomings of RNN are described. In the RNN model, all the previous information is put into the hidden state, which cannot be parallelized in time, making the calculation performance relatively poor.
This paragraph tells that Transformer no longer uses the previous cyclic neural layer, but is purely based on the attention mechanism, so the degree of parallelism is relatively high, and a better result can be achieved in a short time.
Background
Proposes how to use convolutional neural network to replace your cyclic neural network to reduce timing calculations. At the same time, it is proposed that convolution can do multiple output channels, and one output channel can be considered to be able to identify different patterns.
Training
Source and processing of training data sets
Using the device part, the training used 8 P100DE GPUs and trained for 12 hours on 8 GPUs.
Regularization, using a large number of dropout layers to regularize the model