GPT-1 paper reading

Introduction

Title : Improving Language Understanding by
Generative Pre- Training A large number of unlabeled text corpora are abundant, but labeled corpora are scarce, and it is difficult to achieve good results by training models separately. The paper proves that using an unlabeled corpus for generative pre-training and fine-tuning for different tasks works well.


introduce

It makes sense to learn text representations from unlabeled text, just like word embedding pre-training before. The problems of existing pre-training methods are: the model needs to be adjusted according to the task, complex learning methods, and auxiliary objective functions are required. Summary: Trouble.
This paper explores a semi-supervised approach for language understanding tasks: unsupervised pre-training + supervised fine-tuning.
The goal is to learn a general representation that can be used for a wide range of tasks with only minor changes.
The model is Transformer. Compared with RNN, the advantage of Transformer is that it can establish long-term dependencies of text and is more robust to different tasks.
The validation experiment uses four tasks: natural language inference, question answering, semantic similarity, and text classification

Unsupervised pre-training

The training data is an unlabeled corpus token:
insert image description here

The training objective function is to maximize the following formula.

The goal here can be understood as: predict the next token based on the first k corpus tokens.
For example, given U = "The weather is so good today", the model needs to have the following prediction capabilities:
given the sentence "today", the next prediction "day"
given the sentence "today", the next prediction "day"
given the sentence " Today is the day", the next prediction is "gas"
Given the sentence "today's weather", the next prediction is "true"
Given the sentence "today's weather is true", the next prediction is "good"

θ \theta in the objective functionθ is the parameter of the model, so this formula is to find the most suitableθ \thetaθ , letL 1 ( U ) L_{1}(U)L1( U ) max. The training method is stochastic gradient descent.
The model chooses a multi-layer Transformer decoder, and
the overall calculation process of the model is as follows:
insert image description here
W e W_{e}WeIt is a token embedding, which maps words to a matrix of vectors.
W p W_{p}WpIt is position embedding, which maps the position to a matrix of vectors.

supervised fine-tuning

Use a labeled dataset C where each instance has a sequence of input tokens: x 1 , . . . , xmx^{1},...,x^{m}x1,...,xm , the corresponding label y. hlm h_{l}^{m}hlmIs the output of the last layer of transformer_block, adding a new linear layer W y W_{y}Wy, and then use softmax classification to get the final result.
insert image description here
The goal of the fine-tuning phase is to maximize:
insert image description here
In practice, it is found that using a mixed type of objective function works better: (1) let the model further learn the ability of unsupervised prediction (2) help convergence

insert image description here
The token design is shown in the figure below:
insert image description here

Guess you like

Origin blog.csdn.net/artistkeepmonkey/article/details/129458712