Artificial intelligence LLM model: training of reward model, training of PPO reinforcement learning, RLHF

Artificial intelligence LLM model: training of reward model, training of PPO reinforcement learning

1. Training of Reward Model

1.1 The concept of reward model in large language model

After the large language model is fine-tuned with SFT supervision, the next stage is to build a reward model to score the question-answer pairs. The reward model is derived from the reward function in reinforcement learning, which can describe a score for the current state to illustrate the value of this state. The reward model in large language model fine-tuning is to calculate a score for the input question and answer. The higher the match between the input answer and the question, the higher the score output by the reward model.

1.2 Model Architecture and Loss Function of Reward Model

1.2.1 Model Architecture

The reward model (RM model) removes the softmax of the last layer of the SFT model, that is, the last layer does not use softmax, but changes it to a linear layer. The input of the RM model is a question and answer, and the output is a scalar, that is, a score.

Because the model is too large and not stable enough, the loss value is difficult to converge and the cost of a small model is low, therefore, the RM model uses a model with a parameter amount of 6B instead of a model with 175B.

1.2.2 Loss function

The training data for the reward model is to manually rank each answer to the question, as shown in the following figure:

For each question, several answers are given, and then workers sort them, and the reward model uses the sorted results for backpropagation training. The loss function of the reward model adopts Pairwise Ranking Loss , and the formula is as follows:

loss ( θ ) = − ( K 2 ​ ) 1 ​ E ( x , yw ​ , yl ​ ) D ​ [ log ( σ ( r θ ​ ( x , yw ​ ) − r θ ​ ( x , yl ​ ) ) ) ] loss(θ)=−(K2​)1​E(x,yw​,yl​) D​[log(σ(rθ​(x,yw​)−rθ​(x,yl​))) ]loss(θ)=(K2​)1​E(x,the w _y l ​) D ​[ l o g ( σ ( r θ ​( x , the w ​)rθ(x,yl)))]

Among them:
D: the data set that manually sorts the answers;
x: the questions in the data Dset ;
K: the number of answers corresponding to each question;
yw​and yl​: two of the answers xcorresponding to the question, and the ranking ratio is high , because it is a pair , also called pairwise ; : The RM model that needs to be trained, the scalar score obtained for the input pair sum ; : The parameters that the RM model needs to optimize.Kyw​yl​rθ​(x,y)xy
θ

How to understand the loss function of the RM model?

The goal of the RM model is to make the scalar score corresponding to the high- yw​ranking answer higher than the scalar yl​score corresponding to the low-ranking answer, and the higher the better, that is, rθ​(x,yw​)−rθ​(x,yl​)the larger the difference in the loss function, the better. Pass the subtracted score through the sigmoid function, and the difference becomes between -1 and 1. Since the sigmoid function is a monotonically increasing function, the bigger the σ(rθ​(x,yw​)−rθ​(x,yl​))better. σ(rθ​(x,yw​)−rθ​(x,yl​))It is close to 1, which means that it is higher than the ranking, and belongs to the category of 1, but it belongs to the category of -1 anyway, so it can also be regarded as a binary classification problem here yw​. yl​Coupled with the logistic function, it is equivalent to the cross-entropy loss function. There is an answer for each question K, divided by before the loss function CK2​, so that the loss function value will not Kchange too much due to changes in . The ultimate goal of the loss function is minimization loss(θ), which rθ​(x,yw​)−rθ​(x,yl​)corresponds to maximization.

The number of answers corresponding to each question in the reward model is Kthe value. Why is it more appropriate to choose 9 instead of 4?

  • When marking, it takes a lot of time to understand the questions, but the answers are relatively similar. Assuming that it takes 30 seconds to sort 4 answers, then 40 seconds may be enough to sort 9 answers. Compared with 4 answers, 9 answers generate 5 times more question and answer pairs, which is very cost-effective in terms of efficiency;
  • K=9When , there are 36 items rθ​(x,y)to be calculated for each calculation of loss, and the calculation of the RM model takes a lot of time, but it can save a lot of time by reusing the previously calculated values ​​(that is, only need to calculate 9 times).

Why does the loss function of the reward model compare the order of the answers instead of doing a regression on the specific score of each answer?

Everyone has a different score for the answer to the question, and it is impossible to use a uniform value to score each answer. If the model is trained by regression on the specific score of the answer, it will cause a large error. However, everyone's ranking of good and bad answers is basically the same. Human error is avoided by sorting.

1.3 Summary

The reward model further improves the generation ability and naturalness of large language models by interacting with human experts to obtain feedback signals on the quality of generated responses. Different from the supervised model, the reward model makes the generated text more natural and realistic through the form of scoring, which further improves the generation ability of the large language model.

2. Training of PPO reinforcement learning

2.1 PPO Reinforcement Learning Concept

After the big language model completes the training of the reward model, the next stage is to train the reinforcement learning model (RL model), which is also the last stage. The optimization algorithm used to train the RL model in the fine-tuning of the large language model is the PPO (Proximal Policy Optimization, proximal policy optimization) algorithm, which optimizes the set objective function through stochastic gradient descent. Proximal policy optimization is a deep reinforcement learning algorithm for training agents to learn and perform tasks in complex environments. Through the training of the agent, it can maximize the cumulative reward in the interaction with the environment, so as to achieve the specified task goal. The agent here refers to the RL model in the large language model.

2.2 Principle of PPO reinforcement learning

The initial model of the RL model adopts the large language pre-training model after SFT fine-tuning. The data set for training the RL model only needs to collect the question set (prompt set), and does not need to label the question. The question set generates the answer text through the RL model, and then inputs the questions and answers into the RW model trained in the previous step for scoring to evaluate the quality of the generated text, and the goal of training the RL model is to make the generated text as good as possible on the RW model. high score.

Modeling the fine-tuning task of the initial language model as a reinforcement learning (RL) problem requires defining basic elements such as policy, action space, and reward function.

The strategy is based on the language model, receiving the prompt as input, and then outputting a series of texts (or the probability distribution of the text); the action space is the arrangement and combination of all tokens in the vocabulary at all output positions; the observation space is the possible input token sequence (prompt), which is the permutation and combination of all tokens in the vocabulary at all input positions; and the reward function is the RM model trained in the previous stage, and the reward calculation is carried out with some policy-level constraints. The flow of this stage is shown in the figure below:

The loss function formula for RL model training is as follows:

objective ( ϕ ) = E ( x , y ) ∼ D π ϕ RL [ r θ ( x , y ) − β log ( π ϕ RL ( y ∣ x ) / π SFT ( y ∣ x ) ) ] + γ E x ∼ D pretrain [ log ( π ϕ RL ​ ( x ) ) ] objective(ϕ)=E(x,y)∼DπϕRL​​[rθ​(x,y)−βlog(πϕRL ​(y∣x)/πSFT(y∣x))]+γEx∼Dpretrain​​[log(πϕRL(x))]objective(ϕ)=E ( x ,y)D π ϕR L ​​[ r θ ​( x ,y)βl o g ( π ϕR L ​( yx)/πSFT(yx))]+γExDpretrain​​[log(πϕRL(x))]

Among them:
πSFT: SFT model;
πϕRL​: In reinforcement learning, the model is called Policy, πϕRL​which is the model that needs to be adjusted, that is, the final model. The initialization is πSFT; (x,y)∼DπϕRL​​: xis the question in the RL data set, yand is the answer obtained xby the model; : the RM model that scores the question and the answer ; : the probability of getting the answer through the question , that is, the prediction for each is related to the output of its softmax Multiply; : The probability of getting an answer by passing the question ; : It is the data from the pre-training stage of the large language model; , : Adjustment coefficients.πϕRL​
rθ​(x,y)xy
πϕRL​(y∣x)xπϕRL​yy
πSFT(y∣x)xπSFTy
x∼Dpretrain​x
βγ

The optimization goal of the RL model is to make the loss function as large as possible. The loss function can be divided into three parts, the scoring part, the KL divergence part, and the pre-training part.

  • **Scoring part:** The question data set of the RL model is obtained xthrough πϕRL​the model to get the answer y, and then this (x,y)pair is substituted into the RW model for scoring, which is in the loss function formula rθ​(x,y). The higher the score, the better the answer the model generates.
  • **KL divergence part: ** After each update of the parameters, πϕRL​it will change, and the xgenerated πϕRL​one ywill also change, and rθ​(x,y)the reward model is πSFTtrained according to the data of the model. If πϕRL​the sum πSFTdiffers too much, it will result in rθ​(x,y)an inaccurate score estimate. Therefore, it needs to be calculated by KL divergence, the distance between πϕRL​the generated answer distribution and πSFTthe generated answer distribution, so that the difference between the two models is not too far. log(πϕRL​(y∣x)/πSFT(y∣x))The KL divergence is calculated in the loss function formula . Since the smaller the KL divergence is, the better, and the training goal is that the larger the loss function, the better, so a negative sign needs to be added in front.
  • **Pre-training part: **The pre-training part corresponds to in the loss function Ex∼Dpretrain​​[log(πϕRL​(x))]. If there is no such item, then the model may end up being able to do well only on this one task, and the performance will degrade on other tasks. Therefore, it is necessary to add the objective function of the pre-training stage so that the first two parts are fitted on the new data set while ensuring that the original data will not be discarded.

The final optimized πϕRL​model is the final model of the large language model.

2.3 Summary

Through the training method of reinforcement learning, the reward model (RW model) and the policy model (RL model) are iteratively updated, so that the reward model can more accurately describe the quality of the model output, and the output of the policy model can widen the gap with the initial model. , making the output text more and more in line with human cognition. This training method is also called RLHF.

Currently, RLHF techniques have had a huge impact on training large language models, outperforming previous methods. However, large language models trained by RLHF may still output harmful or factually inaccurate text, requiring continuous improvement. In addition, when training a model based on the RLHF paradigm, the cost of manual annotation is still very high, and the performance of RLHF can only reach the knowledge level of the annotator in the end. The manual labeling here is mainly to label the sorting results of the output text for the RM model, and if you want to train the model by manually writing answers, the cost is unimaginable.

3. Key knowledge points

  1. Reward model training in large language model fine-tuning: 1. The reward model inputs question-answer pairs and outputs scores. 2. The purpose of the loss function of the reward model is to make the answer with a higher score as large as possible than the answer with a lower score. 3. The reward model is discriminant model

  2. Reward models are: supervised learning, reinforcement learning, discriminative models

  3. PPO reinforcement learning in large language model training: 1. In large language model training, the reinforcement learning model architecture is the same as the SFT supervised fine-tuning model. 2. In the RLHF training reinforcement learning model stage, there is no need to mark the answer to the question 3. In RLHF The initial strategy is the SFT model

  4. Regarding the loss function of RL model training in the RLHF method: 1. The loss function of the RL model contains three parts 2. The loss function of the RL model needs to calculate the KL divergence outputted by the RL model after the policy update and the SFT model 3. The RL model The loss function needs to calculate the loss function of the large language model pre-training stage 4. The loss function of the RL model should make the text generated by the RL model score as high as possible in the reward model

  5. RLHF essentially optimizes the model through human feedback, and the generated text will be more natural.

For more high-quality content, please pay attention to the official account: Ting, artificial intelligence; some related resources and high-quality articles will be provided for free reading.

The higher the score of the text generated by the type in the reward model, the better

  1. RLHF essentially optimizes the model through human feedback, and the generated text will be more natural.

For more high-quality content, please pay attention to the official account: Ting, artificial intelligence; some related resources and high-quality articles will be provided for free reading.

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/131776129