The third ChatGPT training process of the large language model

The development history of the large language model is introduced in the large language model GPT history article, and the training process of the large language model is briefly introduced. This article elaborates on the training details and related algorithms.

After 2020, global Internet companies and AI start-up companies have developed many AI super-large models (tens of billions or even hundreds of billions of parameters), typical representatives are GPT-3, LlaMA in the field of NLP, DALL*E2, Stable Diffusion and V -MoE. Most of the existing generative AI tools are based on the pre-training models developed by major manufacturers, and use the small data for specific scenarios to quickly iterate the Fine-Tune mode.

DALL-E2: DALL-E2 is an image generation model proposed by OpenAI in 2021. It is based on the GPT-3 pre-training model and uses a self-attention mechanism to process input images. DALL-E2 can generate high-quality images, and can generate images based on text descriptions.

Stable Diffusion: Stable Diffusion is an image generation model proposed by Facebook AI Research in 2021. It is based on diffusion processes and stochastic differential equations, and uses a self-attention mechanism to process input images. Stable Diffusion can generate high-quality images, and it can perform unsupervised learning and control the style of generated images.

The training process of ChatGPT is divided into the following four steps:
insert image description here
1. The pre-training (pretrain) model is based on a large number of unlabeled self-supervised learning models. Because the data crawled from the Internet has not been cleaned, some false information on the Internet, Conspiracy theories, prejudices, and common sense errors will all be trained by the model, so the model at this stage is usually called a pre-training model; 2. Then
use higher-quality data to fine-tune the pre-training model. These high-quality data, such as StackOverflow, Quora, Wikipedia, Baidu Encyclopedia, and human annotations, etc., which make the model output as little harmful and useless content as possible.
3. Then use RLHF to further adjust the model to make it more suitable for specific application needs.

In the above three steps of model training, the pre-training (Pretraining) model in the first step takes up most of the computing power and data. According to OpenAI official website data, InstructGPT takes up 98% of the pre-training model stage (https:/ /openai.com/research/instruction-following) computing power and data. Think of SFT and RLHF as unlocking features that the pre-trained model already has but are difficult for users to access just by prompting. Therefore, the SFT in the second step and the RLHF in the third step do not endow the model with new abilities in essence, but seal the bad abilities and unlock the good and necessary abilities.

The first stage of pre-training

The product of the pre-training model is a large language model, such as GPT-x (OpenAI), Gopher (DeepMind), LLaMa (Meta).
The language model encodes the speech information. In a specific context, the probability of occurrence of different words/words is different. The task of the language model is to predict the next word. The speech model can be regarded as a word-filling task, given a prompt (Prompt), and supplemented below.

Prompt (用户输入):今天天气真好,我打算出去逛街,
补充 (语言模型): 去买买衣服和包包。

This seems to be a very simple thing, but it is indeed a very powerful function. It can complete translation, summary, writing code, doing arithmetic problems, writing copywriting, etc. For example, given the Prompt: How are you in Chinese is ..., The completion of the language model will output how are you, which is the realization of the translation function.

Pre-trained mathematical representation

Training data: unmarked low-quality data
Data scale: At the time of writing this article, the tokens of the training data used by the large language model are on the order of trillions. According to the current growth rate of the large model pre-training data set, it will be published on the Internet in a few years There is a high probability that the data can be crawled by leading companies. By then the advantage of the data dimension will be brought by private data.

  • GPT-3 (OpenAI) dataset token scale, using 0.5 trillion, GPT-4 is not public
  • The token size of the LlaMA-1 (Meta) dataset is 140 million, and the LlaMA-2 uses 2 trillion tokens for pre-training.

Formula description:

  • LLM ϕ LLM_{\phi}LLMϕ: Large language model to be trained, where ϕ \phiϕ is a parameter, the goal of training is to get the parameter set ϕ \phiwith the smallest cross entropy lossϕ ;
  • [ T 1 , T 2 , … , T V ] [T_1, T_2, …, T_V] [T1,T2,,TV] : vocabulary, the total number of tokens in the training set, the vocabulary of GPT-2 (OpenAI) is 50257. The vocabulary size of both LlaMA-1 and LlaMA-2 is 32000.
  • VVV : is the size of the vocabulary. 32000 for both LlaMA-1/LlaMA-2.
  • f ( x ) f(x) f ( x ) : maps the token to its position in the vocabulary, if tokenxxThe position of x in the vocabulary isT k T_kTk, then f ( x ) = kf(x)=kf(x)=k.
  • For a sentence of length n ( x 1 , x 2 , … , xn ) (x_1, x_2, …, x_n)(x1,x2,,xn) , according to the token split into the sequence required for training the model, which will get n training samples.
    • Import x= ( x 1 , x 2 , … , xi − 1 ) (x_1, x_2, …, x_{i-1})(x1,x2,,xi1)
    • The predicted word is xi x_ixi
  • For each training sample ( x , xi ) (x,x_i)(x,xi):
    • k = f ( x i ) k=f(x_i) k=f(xi)
    • Model export: LLM ( x ) = [ y 1 , y 2 , … , yv ] LLM(x) = [y_1, y_2, …, y_v]LLM(x)=[y1,y2,,yv], ∑ j y j = 1 \sum_j y_j=1 jyj=1
    • loss为: C E ( x , x i ; ϕ ) = − l o g y k CE(x,x_i;\phi) = -logy_k CE x ,xi;) _=logyk
  • Objective function: find the parameter set ϕ \phiϕ minimizes the loss of all training samples:CE ( ϕ ) = − E x log ⁡ yk CE(\phi)=-E_x\log{y_k}CE ( ​​ϕ )=Exlogyk

Phase 2: Supervised pre-training Supervised finetuning (SFT)

Why is SFT needed?

Because the task of the pre-training model is to predict the next word, if the pre-training model is input: how to make salt and pepper prawns
, then the pre-training model may output:
1.? What ingredients are needed and what are the steps to make it?
2. Not greasy, crispy and delicious?
3. Detailed ingredients and production steps

In fact, the third detailed production step is what we need. The purpose of SFT is to improve the probability of the desired output.

How can we improve the probability of better output for desired scenarios (such as knowledge quizzes, summaries, translations, etc.)? Because the model imitates the data set, on the basis of the pre-trained model, a batch of samples of a specific type is provided for the model to imitate again. This is what the SFT stage does. The examples in the SFT stage are given by (prompt, response) OpenAI calls it call-supervised fine-tuning behavior cloning. The distribution of InstructGPT's sample SFT sample data set is as follows: the
insert image description here
left column shows the proportion of different cases, and the right shows prompt examples for three cases.

sample data set

Example data sets can be provided by people (such as OpenAI's InstructGPT and ChatGPT practices), which are different from traditional data annotation. The example providers are composed of intellectuals with high educational backgrounds who have passed the test. 90% of the InstructGPT prompts have university degrees. , one-third have a master's degree. The education level of its detailed personnel is as follows:
insert image description here
40 prompt engineers employed by OpenAI created 13,000 (prompt, response) examples for InstructGPT, one of which is as follows:

Prompt Response
Serendipity means the occurrence and development of events by chance in a happy or beneficial way. Use the word in a sentence. Running into Margaret and being introduced to Tom was a fortunate stroke of serendipity.
ELI5: What’s the cause of the “anxiety lump” in our chest during stressful or disheartening experiences? The anxiety lump in your throat is caused by muscular tension keeping your glottis dilated to maximize airflow. The clenched chest or heartache feeling is caused by the vagus nerve which tells the organs to pump blood faster, stop digesting, and produce adrenaline and cortisol.
Create a shopping list from this recipe:Trim the ends off zucchini. Cut zucchini in half lengthwise; scoop out pulp, leaving 1/2-in. shells. Finely chop pulp. In a skillet, cook beef, zucchini pulp, onion, mushrooms and peppers over medium heat until meat is no longer pink; drain. Remove from the heat. Add 1/2 cup cheese, ketchup, salt and pepper; mix well. Spoon into the zucchini shells. Place in a greased 13x9-in. baking dish. Sprinkle with remaining cheese. Zucchini, beef, onion, mushroom, peppers, cheese, ketchup, salt, pepper

OpenAI's method example data set is of high quality, but the cost and time are also high. Thus there are also examples obtained from the Internet by means of heuristic filtering of conversations. Gopher (DeepMind) takes this approach.
LlaMA-2 uses 27,540 (prompt, response) examples, which are high-quality (prompt, response) examples screened by Meta from millions of annotations in third-party datasets.

Mathematical representation of SFT

  • Dataset: (prompt, response) Annotated high-quality dataset.
  • Data scale: 10,000 to 100,000 (prompt, response) examples
    • InstructGPT: ~145,000, of which 13,000 are from prompt engineers and 1,500 are from users
    • CALL-2 27540条
  • Model input and output:
    • input: prompt
    • Output: the corresponding response
  • The training loss function uses the cross-entropy criterion.

Stage Three: RLHF

Empirically, RLHF significantly improves performance compared to SFT alone. Anrowpic's explanation for this is: "We expect human feedback (HF) to have the greatest comparative advantage over other techniques when people have complex intuitions that are easy to elicit but difficult to formalize and automate." (Bai et al., 2022 )
insert image description here
Dialogue be flexible. Given a prompt, there are many reasonable responses, some better than others. The demo data tells the model which responses are plausible for a given context, but it doesn't tell the model how good or bad the responses are.
The idea is: what if we had a scoring function that, given a prompt and a response, outputs a score of how good a response is? We then use this scoring function to further train our LLM to give high-scoring responses. This is exactly what RLHF does. RLHF consists of two parts:
1. Training the reward model as the scoring function.
2. Optimizing the LLM to generate responses for which the reward model will give high scores.

RM model

The task of the RM model is to score (prompt, response), and training a model to output a score on a given input is a very common task in ML. It can be easily structured as a classification or regression task. The challenge in training reward models is in obtaining trustworthy data. It is quite difficult to get different people to give consistent marks for the same responses. It's much easier for someone to compare the two responses and decide which is better.

Therefore, the process of labeling is the process of outputting (prompt, winning_response, losing_response) pairs, which is called data comparison. The next question is only these comparison data pairs, how to train the model to give a specific score?

For InstructGPT, the goal is to maximize the score difference between (prompt, winning_response) and (prompt, losing_response) (see the mathematical formula section for details). From this idea, it can be seen that while encouraging the winning_response answer, it also suppresses the losing_response class answer of the reward model .

The RM model can be initialized in different ways, but seeding from the SFT model seems to give the best results, and the intuition is that the RM should be at least as powerful as the LLM in order to be able to score the LLM's responses well.

Mathematic representation of RM

  • Dataset: high quality (prompt, winning_response, losing_response)
  • Data scale: 100,000 to 1 million samples,
    • InstructGPT has 50,000 prompts, and each prompt has 49 responses, which can form 636 pairs (winning_response, losing_response). This means that its data size is 300,000 to 1.8 million examples.
    • In addition to using third-party data, LlaMA-chat-2 also uses self-built ones. The difference from InstructGPT is that responses can be generated by different models in addition to human writing.
      LlaMA-2 RM training dataset distribution
      LlaMA-2 RM training dataset distribution

The formula says:

  • r θ r_{\theta}ri: RM to be trained, the parameter set uses θ \thetaθ means that the goal of the training process is to find the parameter setθ \thetaθ makes the loss the smallest overall on the RM dataset;
  • Training data organization format:
    • x \mathbf{x} x : input
    • yw \ mathbf y_wyw: Relatively good model output, w means winning response
    • yl \mathbf y_lyl: Relatively poor model output, l means loss response
  • For each training sample ( x , yw , yl ) (\mathbf x, \mathbf y_w, \mathbf y_l)(x,yw,yl)
    • S w = r θ ( x , yw ) \mathbf S_w=r_{\theta }(\mathbf x, \mathbf y_w)Sw=ri(x,yw) : The winning response score of the RM model
    • S l = r θ ( x , yl ) \mathbf S_l=r_{\theta}(\mathbf x,\mathbf y_l)Sl=ri(x,yl) : the loss response score of the RM model
    • loss值是: − log ⁡ ( σ ( S w − S l ) ) -\log(\sigma(S_w -S_l)) log ( σ ( S _wSl))
  • Objective function: find the parameter set θ {\theta} with the smallest loss for all training samplesθ,即 − E x log ⁡ ( σ ( s w − s l ) ) -E_x \log(\sigma(s_w-s_l)) Exlog g ( s ( swsl)) minimum

You can take a look at how this loss function works, let d = sw − sld=s_w-s_ld=swsl f ( d ) = − log ⁡ ( σ ( d ) ) f(d)=-\log(\sigma(d)) f(d)=The graph of log g ( σ ( d )) is as follows. For negative d, the loss value will be large. This function will make the RM model score higher for the winning response.
insert image description here

Fine-tuning using the RM reward model

At this stage, we will further train the SFT model to generate output responses that maximize the RM's score. Today, most people use Proximal Policy Optimization (PPO), a reinforcement learning algorithm released by OpenAI in 2017.

Fine-tuning is performed based on the SFT model, and the steps given in the InstructGPT paper are as follows:
insert image description here
During this process, hints are randomly selected from the distribution - for example, OpenAI may choose randomly among user prompts. Each of these prompts is fed into an LLM model to obtain a response, and the RM scores the response.

The training process is as follows. The Initial Language Model in the figure below is the SFT model, while the Tuned Language Model is the (RL strategy) model. The relevant loss and gradient iterations are shown in the figure below:
insert image description here

OpenAI also found it necessary to add a constraint: the model produced by this stage should not deviate too far from the model produced by the SFT stage (mathematically expressed as the KL divergence term in the objective function below) and the original pre-trained model. The intuition is that for any given prompt, there are many possible responses, the vast majority of which RM has never seen. For many unknown ((prompt, response) samples, the RM may incorrectly give extremely high or low scores. Without this constraint, the model may be biased toward those responses that score extremely high, even though they may not be good responses .

Mathematical representation of reward model fine-tuning

  • This step employs a reinforcement learning approach
    • Action space: the token used by the LLM vocabulary. The action taken corresponds to selecting the token output by the model.
    • Observation space: the distribution of all possible prompts.
    • Policy: The probability distribution over all actions (and thus tokens) to be taken given an observation (prompt). The LLM constitutes a strategy because it determines the likelihood of tokens being generated next.
    • Reward function: RM model above.
  • Training dataset: randomly selected prompts
  • Data scale: 10,000-100,000 prompts
    • InstructGPT: 4万 prompts

Formula representation

  • RM: The model described in the RM model section
  • L L M S F T LLM^{SFT} LLMSFT : instruction fine-tuning model obtained from the SFT model subsection, given promptx \mathbf xx , the model output is the probability of answering, in the InstructGPT paper,LLMSFT LLM^{SFT}LLMSFTπ SFT \pi^{SFT}PiSFT said.
  • LLM ϕ RL LLM_{\phi}^{RL}LLMϕRL: The parameter set uses ϕ \phiϕ represents the reinforcement learning training model.
    • The goal of reinforcement learning training is to find the parameter set ϕ \phi that maximizes the RM model scoreϕ
    • Given prompt x \mathbf xx , whose output corresponds to the probability distribution of responses.
    • In the InstructGPT paper, LLM ϕ RL LLM_{\phi}^{RL}LLMϕRLWrite π ϕ RL \pi_{\phi}^{RL}PiϕRL
  • x \mathbf x x: prompt
  • D R L D_{RL} DRL: The prompt distribution used by the RM model
  • D p r e t r a i n D_{pretrain} Dpretrain: Distribution of the training data set of the pre-trained model
    For each step of training, from DRL D_{RL}DRLSelect a sample of batch size, denoted as x RL \mathbf x_{RL}xRL, and from D pretrain D_{pretrain}DpretrainSelect a sample of batch size, recorded as xpretrain \mathbf x_{pretrain}xpretrain. These two kinds of data come from different sample sets, so the objective functions are different.
    1. For each RM model input x RL \mathbf x_{RL}xRL, then use the reinforcement learning model LLM ϕ RL LLM_{\phi}^{RL}LLMϕRLGet the output under the prompt, recorded as y ∼ LLM ϕ RL ( x RL ) y \sim LLM_{\phi}^{RL}(\mathbf x_{RL})yLLMϕRL(xRL) , then the objective function is calculated as follows,
    O 1 ( x RL , y ; ϕ ) = RM ( x RL , y ) − β log ⁡ LLM ϕ RL ( y ∣ x ) LLMSFT ( y ∣ x ) O_1 {(\ mathbf x_{RL},y;ϕ)=RM(\mathbf x_{RL},y)−β \log \frac{LLM^{RL}_ϕ(y|x)}{LLM^{SFT}(y| x)}}O1(xRL,y;) _=RM(xRL,y)blogLLMSFT(yx)LLMϕRL(yx)
    The second item is the KL divergence introduced so that the RL model does not deviate too much from the SFT model.
    2. For each xpretrain \mathbf x_{pretrain}xpretrain, the objective function should be that the output of the RM model at this prompt should not be worse than the pre-training model:
    O 2 ( xpretrain ; ϕ ) = γ log ⁡ LLM ϕ RL ( xpretrain ) O_2(\mathbf x_{pretrain}; \phi) = \gamma \log LLM_{\phi}^{RL}(\mathbf x_{pretrain})O2(xpretrain;) _=clogLLMϕRL(xpretrain)
    The final objective function is the sum of the two, and in the RL setting, maximizes the objective function.
    objective ( ϕ ) = E x ∼ DRLE y ∼ LLM ϕ RL ( x ) [ RM ( x , y ) − β log ⁡ LLM ϕ RL ( y ∣ x ) LLMSFT ( y ∣ x ) ] + γ E x ∼ D pretrain log ⁡ LLM ϕ RL ( x ) objective (ϕ)=E_{x \sim D_{RL}}E_{y∼LLM^{RL}_ϕ(x)}[RM(x,y)−β\log\frac { LLM^{RL}_ϕ(y|x)}{LLM^{SFT}(y|x)]}+γE_x∼D_{pretrain}\log LLM^{RL}_ϕ(x)objective(ϕ)=ExDRLEyLLMϕRL(x)[RM(x,y)blogLLMSFT(yx)]LLMϕRL(yx)+c ExDpretrainlogLLMϕRL(x)

The objective function of the InstructGPT article with the same meaning is:
insert image description here

Large language model fictitiously generated content

I found two hypotheses to explain LLM's production of constructive content:

The first hypothesis, first proposed at DeepMind by Pedro A. Ortega et al. in October 2021, was that LLMs were generated because of their "lack of understanding of the cause and effect of their actions " (at the time, DeepMind used the word "delusional" for " hallucinations"). They show that this can be addressed by viewing response generation as a causal intervention.

The second hypothesis is that fiction generation is caused by a mismatch between the LLM's internal knowledge and the labeler's internal knowledge. In a talk at UC Berkeley (April 2023), OpenAI co-founder and PPO author John Schulman suggested that behavioral cloning can lead to spurious generation. During SFT, the LLM is trained to mimic human-written responses. If we respond with knowledge that we have but LLM does not, we are teaching LLM fiction.

In December 2021, another OpenAI employee, Leo Gao, also expounded this point very well . In theory, human annotators could include all the context they know at each prompt to teach the model to use only existing knowledge. However, this is not possible in practice.

Schulman argues that LLMs know whether they know something, which means that if we find a way to force LLMs to give answers that only contain information they know, fiction can be resolved. Then he proposed several solutions.

  1. Verification: The LLM is asked to explain (retrieve) the source from where it got the answer.
  2. reinforcement learning. RM models are trained using only comparisons: response A is better than response B, without any information about how much or why A is better. Schulman argues that we can address hallucinations by having a better reward function, for example, by punishing an imaginary model.

A screenshot of Schulman's April 2023 view of RL methods solving fiction is as follows:
insert image description here
However, the InstructGPT paper shows that RLHF actually makes fiction worse. As shown below:
insert image description here

Guess you like

Origin blog.csdn.net/shichaog/article/details/132198736