ChatGPT technical principle

Task03 ChatGPT technical principle

Table of contents

ChatGPT is an iteration of GPT-3. The original GPT-3 may have difficulty learning some higher-level representations from text , which results in a language model like GPT-3 that is difficult to understand the user's true intention. There are cases where the answer is not what the question was asked, and it is serious nonsense .

How has ChatGPT been improved? The core problem ChatGPT needs to solve is how to align the model with the user.

So what is model and user alignment? It is to let the model learn to understand the meaning of human commands (such as write me a short essay to generate questions, knowledge answer questions, brainstorming questions and other different types of commands), and to let the model learn to judge the input instructions for a given prompt. (User questions), what kind of answers are high-quality (rich in information, rich in content, helpful to users, harmless, not containing discriminatory information, etc.).

The improvement made by ChatGPT here is to continuously Fine-tune pre-training language models by introducing "Reinforcement Learning from Human Feedback" (RLHF, the human feedback here is actually artificially labeled data).

Under the framework of "artificially labeled data + reinforcement learning", training ChatGPT is mainly divided into three stages:

  • The first stage uses standard data (prompt and corresponding answers) for fine-tuning, which is supervised fine-tuning SFT (Supervised fine-tuning)
  • The second stage is to train the reward model (RM). Given a prompt (about 30,000), use the fine-tuned model to generate multiple answers, manually sort the multiple answers, and then use pair-wise learning to train RM, that is, learn the order of manual annotation (manually output the model Multiple answers are sorted in order of merit).
  • The final stage is to use reinforcement learning to fine-tune the pre-trained language model.

The purpose of using reinforcement learning is to make the model's answers closer to human intentions. At this stage, there is no need to manually label the data. Instead, the RM model learned in the previous stage is used, and the pre-trained model parameters are updated based on the scoring results of the RM model. So there is another question here, why not use SFT directly? The main reason is that there is too little labeled data.

These three stages are introduced in detail below:

Stage 1: Supervised fine-tuning (SFT)

In order for ChatGPT to initially understand the intention contained in the command (prompt), it will first randomly select a batch of prompts (commands or questions) submitted by test users, and rely on professional annotators to provide high-quality answers to the specified prompts. Then use these manually annotated <prompt, answer> data to Fine-tune the GPT-3 model. After this process, we can think that ChatGPT initially has the ability to understand the intention contained in human prompts and give relatively high-quality answers based on this intention. However, due to too few samples, it is difficult to achieve the ideal effect.

In general, the main work in the first stage is to use manually labeled command-answer data pairs to fine-tune GPT3. This part believes that ChatGPT is initially capable of understanding human intentions in prompts.

The annotated data set mainly consists of some questions and answers, generation tasks and other questions, and is mainly divided into the following parts:

The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly.

Phase 2: Training reward model (RM)

The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly.

The main purpose of this stage is to train the reward model by manually labeling the training data. Specifically, randomly sample a batch of prompts submitted by users (most of which are the same as those in the first stage), and use the fine-tune model of the first stage. For each prompt, K different answers are generated by the previous SFT model. , so the model generated <prompt, answer1>, <prompt, answer2>….<prompt, answerK> data (K here is between 4 and 9). Afterwards, the annotators sorted the K results according to many criteria (relevance, informativeness, harmful information and many other criteria mentioned above) and gave the ranking order of the K results. This is the manual labeling at this stage. The data.

Next, we are going to use this ranking result data to train the reward model. The training mode adopted is actually the commonly used pair-wise learning to rank (pair learning ranking). For K sorting results, combine them in pairs to form (k 2) \binom{k}{2}(2k) training data pairs, ChatGPT uses pair-wise loss to train the reward model. The RM model accepts an input <prompt, answer> and gives a return score Score that evaluates the quality of the answer. For a pair of training data <answer1, answer2>, we assume that answer1 ranks before answer2 in manual sorting, then the Loss function encourages the RM model to score <prompt, answer1> higher than <prompt, answer2>.

Here is the loss function for the reward model:

loss ( θ ) = − 1 ( k 2 ) E ( x , y w , y l ) ∼ D [ log ( σ ( r θ ( x , y w ) − r θ ( x , y l ) ) ) ] \text{loss}\left( \theta \right)=-\frac{1}{\binom{k}{2}}E_{\left( x,y_w,y_l \right)\sim D}\left[ \text{log}\left( \sigma\left( r_{\theta}\left( x,y_w \right)-r_{\theta}\left( x,y_l \right) \right) \right) \right] loss( i )=(2k)1E(x,yw,yl)D[log( p(ri(x,yw)ri(x,yl)))]

Where $ r_{\theta}\left( x,y \right) represents the output of the return model, represents the output of the return model,Represents the output of the reward model, xis the given prompt, is the given prompt,is the given pro m pt , yrepresents the answer to. Express the answer to.Express the answer to. y_wand sumand y_lmeans answer means answerIndicates that the answer wis ranked in the answer and ranked in the answer.Ranked in front of answer l$, similar to answer1 above, ranked in front of answer2.

To summarize, in this stage, the SFT supervised model first generates K results for each prompt, and the results are manually sorted from high to low according to the quality of the results. This is used as training data and trained through the pair-wise learning to rank mode. Return model. For the learned RM model, input <prompt, answer> and output the quality score of the result. The higher the score, the higher the quality of the answer generated.

Stage 3: Fine-tuning the SFT model using reinforcement learning

The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly.

At this stage, there is no need to manually label the data. Instead, the RM model learned in the previous stage is used, and the parameters of the pre-trained model are updated based on the RM scoring results. Specifically, first, a batch of new commands are randomly sampled from the prompts submitted by the user (referring to new prompts that are different from the first and second stages), and the parameters of the PPO model are initialized by the first-stage SFT model. Then, for the randomly selected prompts, the PPO model is used to generate the answer, and the RM model trained in the previous stage is used to give the reward score score of the answer quality evaluation. This reward score is the overall reward given by RM to the entire answer.

The objective function of reinforcement learning is as follows:

object ( ϕ ) = E ( x , y ) ∼ D π ϕ R L [ r θ ( x , y ) − β  log ( π ϕ R L ( y ∣ x ) / π S F T ( y ∣ x ) ) ] + γ E x ∼ D pretrain [ log ( π ϕ R L ( x ) ) ] \text{object}\left( \phi \right)=E_{\left( x,y\right)\sim D_{\pi {\phi}^{RL}}}\left[ r{\theta}\left( x,y \right)-\beta\space \text{log}\left( \pi {\phi}^{RL}\left( y|x \right)/\pi^{SFT}\left( y|x \right) \right) \right]+\gamma E{x\sim D_{\text{pretrain}}}\left[ \text{log}\left( \pi _{\phi}^{RL}\left( x \right) \right) \right] object( ϕ )=E(x,y)Dp ϕRL[rθ(x,y)β log( p ϕRL(yx)/ pSFT(yx))]+γExDpretrain[log( pϕRL(x))]

The first item here is to maximize the return score, the second item is to prevent the output of reinforcement learning from deviating too much from SFT, and the last item is to ensure that the effect of the original language model will not deteriorate during fine-tuning.

Guess you like

Origin blog.csdn.net/Runnymmede/article/details/132914294