The sixth work discusses the utility of Transformer in NLP tasks. In addition to some basic tasks, such as text classification, question answering, sequence

Author: Zen and the Art of Computer Programming

1 Introduction

Transformer

overview

The transformer model is a neural network structure based on the self-attention mechanism proposed at NIPS in 2017, which represents an important advancement in the field of machine learning. It can model long or short sequences and process text data by applying attention mechanism to encoder-decoder structure. Compared with the previous RNN or CNN models, this model has obvious advantages in sequence modeling, especially in advanced tasks such as translation, text summarization, and language models. Its main features are as follows:

  • The model structure is flexible: use fully connected layers to replace convolutional layers to reduce computing resource consumption; propose a multi-head attention mechanism to make full use of information in different subspaces; use residual connections to optimize gradient propagation and speed up training.
  • Adaptive function selection: For softmax or sigmoid functions, their activation range is limited by the value of the input feature, resulting in poor model performance, so a more complex nonlinear activation function, such as the GELU function, is proposed.
  • Use positional encoding: introduce positional encoding so that the model can learn absolute positional information.

Why use Transformers?

1. Scale controllable

The training time complexity and number of parameters of the Transformer structure are much smaller than those of RNN and other models, so the model can achieve better results on large-scale pre-training tasks. Since each GPU only needs to process one batch of data, and multi-threading is used to accelerate training, there is no need to worry about memory and hardware constraints.

2. Parallelizable

The parallel design of the Transformer model can effectively implement parallel computing between multiple GPUs, which can greatly reduce training time.

Guess you like

Origin blog.csdn.net/universsky2015/article/details/132222898