Introduction
Title : Improving Language Understanding by
Generative Pre- Training A large number of unlabeled text corpora are abundant, but labeled corpora are scarce, and it is difficult to achieve good results by training models separately. The paper proves that using an unlabeled corpus for generative pre-training and fine-tuning for different tasks works well.
introduce
It makes sense to learn text representations from unlabeled text, just like word embedding pre-training before. The problems of existing pre-training methods are: the model needs to be adjusted according to the task, complex learning methods, and auxiliary objective functions are required. Summary: Trouble.
This paper explores a semi-supervised approach for language understanding tasks: unsupervised pre-training + supervised fine-tuning.
The goal is to learn a general representation that can be used for a wide range of tasks with only minor changes.
The model is Transformer. Compared with RNN, the advantage of Transformer is that it can establish long-term dependencies of text and is more robust to different tasks.
The validation experiment uses four tasks: natural language inference, question answering, semantic similarity, and text classification
Unsupervised pre-training
The training data is an unlabeled corpus token:
The training objective function is to maximize the following formula.
The goal here can be understood as: predict the next token based on the first k corpus tokens.
For example, given U = "The weather is so good today", the model needs to have the following prediction capabilities:
given the sentence "today", the next prediction "day"
given the sentence "today", the next prediction "day"
given the sentence " Today is the day", the next prediction is "gas"
Given the sentence "today's weather", the next prediction is "true"
Given the sentence "today's weather is true", the next prediction is "good"
θ \theta in the objective functionθ is the parameter of the model, so this formula is to find the most suitableθ \thetaθ , letL 1 ( U ) L_{1}(U)L1( U ) max. The training method is stochastic gradient descent.
The model chooses a multi-layer Transformer decoder, and
the overall calculation process of the model is as follows:
W e W_{e}WeIt is a token embedding, which maps words to a matrix of vectors.
W p W_{p}WpIt is position embedding, which maps the position to a matrix of vectors.
supervised fine-tuning
Use a labeled dataset C where each instance has a sequence of input tokens: x 1 , . . . , xmx^{1},...,x^{m}x1,...,xm , the corresponding label y. hlm h_{l}^{m}hlmIs the output of the last layer of transformer_block, adding a new linear layer W y W_{y}Wy, and then use softmax classification to get the final result.
The goal of the fine-tuning phase is to maximize:
In practice, it is found that using a mixed type of objective function works better: (1) let the model further learn the ability of unsupervised prediction (2) help convergence
The token design is shown in the figure below: