GPT-1, GPT-2, GPT-3 InstructGPT paper study notes

Gpt-1
paper: "Improving Language Understanding by Generative Pre-Training"
GPT-1 network structure
Unsupervised, using 12 layers of transformer decoder structure, each layer dimension is 768, 12 attention heads
token embedding matrix, processed by transformer decoder Finally, after the linear layer and the softmax layer, the predicted distribution of the next token is obtained

Position encoding 3072-dimensional
Adam optimizer, maximum learning rate 2.5e-4
token sequence length is 512, 100 epochs
activation function uses GELU
regularization means: residual network, dropout, drop ratio is 0.1
supervised fine-tuning dropout ratio 0.1
yes The supervised fine-tuning learning rate is 6.25e-5, the batchsize is 32, and the training epoch is 3

2. Unsupervised training
Given a token sequence set of an unlabeled sample library, the goal of the language model is to maximize the following likelihood value. That is to predict the next token through the previous tokens.

insert image description here


After the unsupervised pre-training model is obtained by supervised fine-tuning, the obtained parameter values ​​are directly applied to the supervised task. For a data set c with changes, these tokens are input into the pre-training model, and then a full link + softmax get the predicted result,
insert image description here

For text classification tasks, the prediction model can be fine-tuned directly. Since our prediction model is trained on continuous text sequences, for some tasks with structured input, we need to make some modifications to the input.
Start Symbols, intermediate symbols, end symbols, splicing sentences,

GPT2
论文 Language Models are Unsupervised Multitask Learners

Design prompt, prompt generation,
Zero-shot: Downstream tasks do not require training models.
The author believes that as long as the capacity of the language model is large enough and the training data is rich enough, other downstream supervision tasks can be completed at the same time only relying on the model.

The network structure
is not much different from GPT-1. The input dimension and decoder layer are scaled up. For example, the input dimension is increased from the original 768 to 1024, and the number of stacked layers becomes 24 layers.

GPT-3
论文:《Language Models are Few-Shot Learners》

Model structure:
GPT-3 follows the structure of GPT-2, and the training process is similar to GPT-2. The main difference is to expand the model size, dataset size and diversity of GPT-2. Some improvements have been made in initialization, normalization, and tokenization, and the optimization points of Sparse Transformer have been borrowed. The article also mentions some optimization work on model parallelism. In order to verify the impact of model capacity on the effect, the article trained 8 models with different capacities - parameters ranging from 125 million parameters to 175 billion parameters, spanning three orders of magnitude, the largest model is called GPT-3. The parameters of each model are shown in the figure below. It is worth noting that larger models generally use larger batch sizes, but only require smaller learning rates.
Autoregressive model, learnable parameters, non-sparse,
stacked 96-layer decoder, input dimension 12288, 128 heads,
after the number of layers increases, the input dimension also increases, each batch training size is 3.2 million samples, the batch is larger, and the computing performance Well, the amount of communication becomes smaller, distributed computing will be better,
for small models, the data is easier to fit, the
model is getting bigger and bigger, the model is not easy to overfit, the model neural network design, will not be as easy as a simple MLP Overfitting,
working hard to create miracles,

few-shot ,
in-context learning context learning, few-shot , few sample learning needs to be based on context,
in-context learning can learn numerical addition, text error correction, and translation, there is correlation between the same format, learning through context The rules,
model evaluation, few-shot learning before the task, translate the word input before starting the task, add about three examples, let the model extract more information
one-shot learning After the task description, translate the word input to start Before the task, add an example, hope that after the model sees the sentence, which one will help you do the translation later based on useful information, zero-shot
learning For the translation task, build a prompt to translate the following sentence, and then add the translation word

Training data set:
After downloading common crawl, keep the samples and classifier task positive examples, remove the negative examples, and use the lsh method for deduplication. An article is a collection of words, and another large article collection is carried out Comparing the similarity, the author only selects 60% of a bathsize when sampling

Fine-tuning:
For each marked sample, I can calculate the loss, and then update the weights. The initial value of the fine-tuning learning rate should be smaller, and the training starts from the pre-trained model. Summary: GPT series from 1 to 3 after
two
years Many, the bottom layer used relies on the decoder of the transformer, and there is no great innovation and improvement in the new steel structure, mainly relying on the continuous increase of model capacity and super computing power. GPT2 mainly talks about zero-shot, and GPT3 mainly talks about it. class, few-shot-learning

Although compared with apt-2, the performance of gpt-3 has been very good, and there are obvious improvements in quality and parameter quantity, but they still have some limitations, 1, gpt
-3 in text synthesis and several nlp tasks , there are still significant weaknesses, such as the task of text synthesis. Although the overall quality is very high, gpt-3 will have community coherence in longer paragraphs, contradict itself, and occasionally generate illogical sentences and paragraphs. gpt The series are all based on the autoregressive language model structure (one-way). The sampling and calculation of this structure will be simple. The gpt-3 related experiments do not contain any two-way structure or other training objectives (such as denoising). This structure limits gpt-3 The author speculates that a large two-way model is stronger than gpt-3 in terms of fine-tuning. The article proposes that based on the scale of gpt-3, or trying to use few-shot/zero-shot learning to train a two-way language model is a very important future There is a promising direction,
but there is still a problem with writing long texts, you can write the title of each paragraph

2. Like other lm-like models, gpt-3 will fall into the limit of the pre-training target. In short, the large language model lacks interaction with the real world, so it lacks contextual information that manages the world. Self-supervised The training may reach a limit, so different methods need to be used for enhancement. The author proposes to learn the objective function from humans, fine-tune it through reinforcement learning, and add different modes. The first two are the work of instruct-gpt3
, A common limitation of language models is that in the pre-training phase, the utilization rate of samples is low. Although gpt-3 is in the test phase, it is moving towards a test time closer to humans (one-shot or zero-shot), but it is in the pre-training phase. 4. The scale of gpt-3 is huge. Regardless of the
objective function or algorithm, the training of gpt-3 is expensive and inconvenient for reasoning. Large models like gpt3 usually contain a lot of skills, but most of them are not required for specific tasks. The author proposes that a future direction to solve the above problems may be to distill large models. The industry has done a lot of exploration on distillation, but 5.
Gpt-3 has some common limitations with most deep learning systems - uninterpretable, it is difficult to ensure that some articles generated by gpt-3 do not contain sensitive words - such as religious prejudice, race Inception and gender bias

instructGPT

Process: 1. Collect the marked data set, use it to supervise the learning and fine-tune gpt3, 2. Manually mark the multiple results output by the model, sort them in order from good to bad, and then sort the data sets of the results Further fine-tuning this supervised model

Purpose: The main goal of this paper is to explore how to train language models (LMs) to follow user instructions, perform various tasks, and provide useful and reliable output, aligned with user intent

Implementation method: artificial feedback reinforcement learning
1. sft (supervised fine-tuning), collect question and answer data, and then use supervised learning to fine-tune the pre-trained gpt3 model on the data set. 2.
Reward model (RM) training, given the pre-trained Train the model with some questions, and then let gpt generate the answers generated by gpt. The probability sampling of the next word is generated by gpt. Sampling is usually used. Binsearch usually generates 4 answers, and then manually judges and sorts these four answers. Given the model input and Output, so that the output is that the score of the answer satisfies the previous ranking relationship.
Third, the model after sft performs reinforcement learning (RL) in the rm model, so that the generated answer is scored in the reward model, and then its parameters are optimized according to the score. Make the final output of gpt3 align with human intentions
The final model is instuctGPT

The loss function of the RM model uses the pairwise ranking loss to sort

Guess you like

Origin blog.csdn.net/dream_home8407/article/details/131772115