GPT series models

GPT series model structure of Transformer development

I followed the Transformer development route to learn the introductory
Transformer-BERT-GPT-DETR-ViT-Swin ViT / DeiT
Insert image description here

The last article was about Transformer. I understood it immediately after listening to the lessons of Li Hongyi and Li Mu. Today I am learning GPT-123.

main idea:

The difference between GPT and BERT also lies in the selection of the objective function. GPT predicts the future, and BERT predicts the middle (cloze). The
core idea of ​​GPT : pre-training of unlabeled text data + fine-tuning of labeled data.
The core idea of ​​GPT-2 : only use unlabeled data. Carry out pre-training and let the model learn to solve multi-task problems by itself.
The core idea of ​​GPT-3 : no gradient updates or fine-tuning, only individual examples for text interaction with the model, use a model with a larger number of parameters, and work wonders! ! !


GPT

To briefly summarize:

paper:Improving Language Understanding by Generative Pre-Training

Problem background :
There are a lot of unlabeled texts, but labeled text data for specific tasks is sparse, making it very challenging to train an accurate model for specific tasks.

Solution :
Train a language model on unlabeled data and fine-tune it on downstream specific tasks using labeled data.

Core idea : pre-training + fine-tuning

Main steps : Use two-stage training. The first stage is to use the language model for pre-training (unsupervised form), and the second stage is to solve downstream tasks (in supervised mode) through Fine-tuning mode, which is somewhat similar to transfer learning.

The main structure of the model :
1. Unsupervised pre-training:
        Assume that an unlabeled text U is input, and the vector inside is u1...un. GPT uses a language model to maximize the likelihood function. This language model is the Decoder in the 12-layer Transformer Partially stacked together, this likelihood function (the first formula) is to predict the probability of the i-th word appearing under this language model, and each time K (K is a sliding window) consecutive words are used to predict K words The probability of the following word, and then select the one with the highest probability, which is the most likely next word. The following formulas are equivalent to the mathematical expressions of Transformer's Decoder. I don't understand Transformer specifically.
Insert image description here

2. Fine-tuning:
        After pre-training, the author directly transfers the pre-trained parameters to downstream tasks. The downstream task data set is denoted as C, where each data in the data set contains a series of tokens: x1...xm, labeled y. Send these data to the transformer decoder with pre-trained parameters, classify the results using softmax, and get the final result.
The author found that when doing fine-tuning, two loss functions were added, one was the loss function L1 during pre-training, and the other was the loss function L2 for classification.
Insert image description here

3. Specific tasks:
When inputting downstream tasks, start characters, end characters, and spacers are also added here.
Insert image description here

Model structure :
It is a 12-layer Transformer-Decoder, and only masked self-attention is retained. This is done to ensure that only masked self-attention is retained. When predicting the k-th word, only the content of k-1 words can be seen. .
In addition, the author made adjustments to the position encoding and used a learnable position encoding, which is different from the transformer's trigonometric function position encoding.
Insert image description here

GPT-2

To briefly summarize:

paper:Language Models are Unsupervised Multitask Learners

Problem background :
Although the use of pre-training + fine-tuning can solve the problem of sparse labeled text, it still needs to be re-trained when targeting specific tasks. The generalization is relatively poor and cannot be widely used.

Solution :
Unlike GPT-1, GPT-2 completely abandons the fine-tuning phase and only uses large-scale multi-domain data pre-training to allow the model to learn to solve multi-task problem language models by itself under the setting of Zero-shot Learning.

Core idea : When performing downstream tasks, a setting called zero-shot is used. Zero-shot means that when performing downstream tasks, there is no need for any annotation information of the downstream tasks, so there is no need to go Retrain the pretrained model. The advantage of this is that as long as I train a model, I can use it anywhere to achieve transfer learning from known fields to unknown fields.

Main structure of the model :

In terms of input:
In order to achieve zero-shot, the input of downstream tasks cannot add special characters at the beginning, middle and end when constructing the input like GPT. These are not seen by the model during pre-training, but should be compared with the pre-training. The text the training model sees is the same, more like natural language.
And the input can be structured as translation prompt+English+French; or Q&A prompt+document+question+answer, and the different prompts in front can be regarded as special separators.
Insert image description here

Model Demo :
GPT2’s text generation capability is very powerful. If you are interested, you can use this tool to experience
AllenAI GPT-2 Explorer (https://gpt2.apps.allenai.org/?text=Joel%20is)


GPT-3

To briefly summarize:

paper:Language Models are Unsupervised Multitask Learners

Problem background :
1. Each subtask needs to be fine-tuned with relevant data sets, so the relevant data sets need to be labeled, otherwise it will be difficult to achieve good results, and the cost of labeling data is very high.
2. The author believes that although fine tuning is very effective, it still requires a lot of label data, and will cause the model to learn some false features, causing overfitting and worsening the model's generalization performance.

Solution :
Unlike GPT-1 and GPT-2, GPT3 does not perform gradient updates or fine-tuning when applying. It only uses task instructions and individual examples for text interaction with the model, and uses a larger model to work wonders.

Main architecture of the model :
The different settings used to evaluate GPT-3 are clearly defined, including zero-shot, one-shot and few-shot.
Insert image description here

GPT proposes an in-context learning method, which means that when a task description and some reference cases are given, the model can understand the current context based on the current task description and parameter cases, even in downstream tasks and pre-trained tasks. The model can also perform well even when the data distribution is inconsistent. Note that GPT does not use examples for fine-tuning, but uses cases as an input guide to help the model complete tasks better.

The model is mainly the same as GPT-2. The difference is that GPT-3 uses alternating dense and local banded sparse attention patterns on each layer of the transformer, similar to the Sparse Transformer.

few-shot
at inference time, only gives the model instructions and some examples of a specific task, but does not update the weights.

One-shot
at inference time, only gives the model a description and an example of a specific task, and does not update the weights.

Zero-shot
at inference time only gives instructions for a specific task of the model, without giving examples or updating weights.

Finally, the limitations of the GPT series are also mentioned above. They can only look forward and cannot learn in both directions.

Related:

1. Detailed explanation of GPT series models
2. Notes: Teacher Li Mu’s explanation of GPT series
3. Mu Shen study notes: GPT, GPT-2, GPT-3

Ich denke du magst

Origin blog.csdn.net/qq_42740834/article/details/125189405
Empfohlen
Rangfolge