Optimizing Large Models Using RLHF: Improving Performance and Application Ability

In the process of continuous development of data science, large models are more and more widely used in various fields such as natural language processing, image recognition, and financial forecasting. However, the training and optimization of large models is also facing more and more challenges, such as excessive data volume, insufficient computing resources, and difficulty in hyperparameter adjustment. Traditional machine learning algorithms are often difficult to deal with these problems, so more efficient and intelligent algorithms are needed to deal with them. Reinforcement Learning from Human Feedback (RLHF), as a reinforcement learning algorithm, can effectively use the feedback information given by humans to train large models and improve their performance. This article will introduce how to use RLHF to optimize large models and provide more powerful support for their applications.

The following is the structure of the article

Pre-train a language model (LM)

The goal of pre-training is to equip the language model with statistical information about the language, so that it can predict the occurrence probability of words according to the context. A language model can be thought of as a "completion machine" that, given a prompt text, can generate a response text to complete the prompt. Through pre-training, we get a large language model (LLM), also known as a pre-trained model.

Once you have a pretrained language model, you can do an additional optional step of supervised fine-tuning (STF). In supervised fine-tuning, we use human-annotated (input, output) text pairs to fine-tune a pre-trained model to make it better for a specific task. STF is considered as a high-quality initialization of RLHF, which lays a good foundation for the subsequent RLHF process.

At the end of this step, we have the trained language model, our main model. This master model is the one we hope to train further with RLHF, through which it will continually improve its generative capabilities based on human feedback.
insert image description here

It is worth mentioning that in the pre-training stage, different research institutions may adopt different models and methods. For example, OpenAI uses a smaller version of GPT-3 in its popular RLHF model InstructGPT, Anthropic uses a Transformer model with a large number of parameters, and DeepMind uses its own huge parameter model Gopher. In addition, when fine-tuning the pre-trained model, some institutions may use additional text or conditions. For example, OpenAI fine-tunes the "preferable" artificially generated text, while Anthropic uses the criteria of "useful, honest and harmless" in context The original model is distilled on the thread. These fine-tuning steps may involve expensive augmentation data, but are not necessary for RLHF. Because RLHF is still an unexplored field, there is no clear answer as to which model is suitable as a starting point for RLHF, so different institutions may experiment with different approaches.

Supervised fine-tuning (SFT) of dialogue

The goal of supervised fine-tuning (SFT) is to optimize a pre-trained model so that it can generate responses expected by users. In the pre-training stage, the model optimizes its predictive ability for completed texts by learning a large amount of linguistic data. This means that when we give the pre-trained model a question, such as "how to learn to program", it can generate multiple plausible ways of doing it, such as:

  1. Adding question context: for beginners
  2. Added follow-up question: What programming languages ​​do I need to learn? How long does it take to learn programming?
  3. Give the answer directly: Learning programming requires mastering programming syntax and algorithms.

Of these options, the third is the most appropriate if we really want an answer. The goal of supervised fine-tuning is to optimize the pre-trained model to make it more inclined to generate the answers expected by users.

When implementing supervised fine-tuning, we show the language model examples of different use cases (e.g. question answering, summarization, translation, etc.) to tell the model how to respond to these cues appropriately. The format of these examples is (prompt, response), often referred to as demo data. OpenAI calls this supervised fine-tuning approach "behavioral cloning": you show the model how it should behave, and the model clones that behavior.

As an example, suppose we have a pretrained language model and we want to optimize its performance in question answering scenarios. We can feed the model various question answering examples such as:

  • Prompt: "Please tell me how to make brownies."
  • Response: "First prepare the chocolate, flour, eggs and milk. Then follow the recipe and finally bake for about 30 minutes."
  • Prompt: "What is the largest planet in the solar system?"
  • Response: "The largest planet in the solar system is Jupiter."

With examples like this, we teach the model to correctly answer different types of questions. With supervised fine-tuning, the model gradually learns to generate user-desired responses from the prompts, thereby better adapting to specific tasks and use cases. Through this step, our main model will gradually become more intelligent and accurate, laying the foundation for the subsequent RLHF stage. The figure below is a simple illustration of supervised fine-tuning.

insert image description here

Train the Reward Model

insert image description here
In this step, our goal is to collect a dataset containing (input text, output text, reward) triples.

As shown in the figure above: use input text data (preferably production data), generate corresponding output text through your model, and assign a reward value to the generated output text by humans.

The reward value is generally between 0 and 5, and can also be represented by 0/1.

The task of the reward model (RM) is to train a model with a pair (prompt, response) and a reward score. Outputting a score on a given input is a very common task in machine learning, which can be seen as a classification or return mission. The model scores each result from text input to text output to evaluate the performance of the model .

In order to calculate the optimal reward model, we use large models with different parameters (with different fine-tuning or no fine-tuning) to output different responses, and the reward scores may be different. The purpose of optimizing the reward model is to make the same prompt input to the response output under different large models as much as possible, and the rewards obtained in the end are similar.

Setting: r θ r_\thetariFor the reward model being trained, the model parameter is θ \thetai

x x x:prompt
y w y_w yw: winning response indicates the response with the highest reward among these large model outputs
yl y_lyl: losing response means that these large model outputs get the response with the lowest reward

For each training sample ( x , yw , yl ) (x, y_w, y_l)(x,yw,yl) , with:

s w = r θ ( x , y w ) s_w=r_\theta(x,y_w) sw=ri(x,yw) : The score of the reward model for the winning response
sl = r θ ( x , yl ) s_l=r_\theta(x,y_l)sl=ri(x,yl) : The score of the reward model for the losing response

Loss value: − log ( σ ( sw − sl ) ) −log(\sigma(s_w−s_l))l o g ( σ ( swsl))

To better understand what this loss function does, let's visualize it. Let d = sw − sld=s_w−s_ld=swsl. The following is f ( d ) = − log ( σ ( d ) ) f(d)=−log(\sigma(d))f(d)=Graph of − l o g ( σ ( d )) . For negative d, the loss value is large, which encourages the reward model not to make the score of the winning response lower than the score of the losing response.

Regarding the value of training rewards, it is necessary to manually rank the answers generated by LM.

For specific ranking methods, a successful approach is to compare the output of different LMs given the same prompt, and then use the Elo system to build a complete ranking. These different ranking results will be normalized to a scalar reward value for training.

ELo mechanism to play the glory of the king, the League of Legends should be familiar

By deriving a reward model, we provide a reliable measure for the subsequent RLHF process.

Fine-tuning with reinforcement learning

Reinforcement learning fine-tuning is one of the key steps in RLHF. It helps train language models to be able to generate more appropriate responses to user prompts. However, since the output reward itself is not differentiable, we need to use reinforcement learning (RL) to construct a loss function to be able to backpropagate the LM.

The big guys implement reinforcement learning techniques through Kullback-Leibler (KL) divergence and proximal policy optimization (PPO).

To better illustrate why reinforcement learning can be applied to LM, we first formulate the fine-tuning task as an RL problem.

The policy is an LM that takes a hint and returns a sequence of texts (or a probability distribution of texts). The action space of this strategy is all the tokens corresponding to the LM vocabulary (generally on the order of 50k), and the observation space is the possible input token sequence, which is also relatively large (vocabulary ^ input token quantity) . The reward function is a combination of a preference model and a policy shift constraint.

As shown in the figure below, you will understand it better compared to general reinforcement learning.
insert image description hereinsert image description here
At the beginning of training, we will create a LM that is exactly the same as the LM with its trainable weights frozen. This model will help prevent a trainable LM from completely changing its weights and start outputting some nonsense text to satisfy the reward model.

We do this by computing the KL divergence of the sequence of text output distributions (probability distributions) between frozen LM and non-frozen LM values ​​as a loss function.

With the reward and KL loss in place, we can now apply reinforcement learning to make the reward loss differentiable.

To make the loss differentiable, we employ the Proximal Policy Optimization (PPO) algorithm! The following are the detailed steps of the whole fine-tuning:

  • Step 1: Leverage the Reward Model

First, user input or prompts are sent to an RL policy, which is effectively a tuned version of LM. The RL policy generates a response that is evaluated by the reward model along with the output of the initial LM. The reward model then generates a scalar reward value corresponding to the quality of the response.

  • Step Two: Introduce a Feedback Loop

This process iterates in a feedback loop, with the reward model assigning rewards to as many samples as resources allow. Over time, responses that receive higher rewards guide the RL policy, helping it generate responses that are more in line with human expectations.

  • Step 3: Measure the difference using KL divergence

Kullback-Leibler (KL) divergence, a statistical measure of the difference between two probability distributions, plays a crucial role here. In RLHF, the KL divergence is used to compare the difference between the probability distribution of the RL policy's current response and a reference distribution representing the ideal or best human-desired response.

  • Step 4: Fine-tuning using proximal policy optimization

An important part of fine-tuning is proximal policy optimization (PPO). PPO is a well-known reinforcement learning algorithm known for its effectiveness in optimizing policies in complex environments with high-dimensional state and action spaces. PPO is especially useful during RLHF fine-tuning because it effectively balances exploration and exploitation during training. For RLHF agents, this balance is critical for learning from human feedback and trial-and-error exploration. Therefore, integrating PPO enables faster and more robust learning.

  • Step Five: Avoid Inappropriate Responses

The fine-tuning process helps stop language models from producing inappropriate or nonsensical output. Since responses with low rewards are less likely to be repeated, the language model is driven to produce output that is more in line with human expectations.

The PPO algorithm calculates the loss formula and diagram explanation (for small updates to LM) as follows:

  1. Set "Initial probs" to "New probs" to initialize.
  2. Computes the ratio between the new output text probability and the initial output text probability.
  3. current loss = r θ ( y ∣ x ) − λ KLDKL ( π ppo ( y ∣ x ) ∣ ∣ π base ( y ∣ x ) ) loss = r_\theta(y|x)- \lambda_{KL}D_{ KL}(\pi_{ppo}(y|x)||\pi_{base}(y|x))loss=ri(yx)lKLDKL( pppo(yx)∣∣πbase( y x )) calculates the loss
  4. The weights of the LM are updated through backpropagation.
  5. "New probs" (i.e. new output text probabilities) are calculated using the newly updated LM.
  6. Repeat steps 2 to 5 N times (usually, N=4).

The probs here are the text output probability π ( y ∣ x ) \pi(y|x) of the LM modelπ ( y x )

insert image description here

Advantages and limitations of RLHF

Reinforcement learning from human feedback (RLHF) provides a powerful methodology for refining AI systems. However, like any other approach, it has both obvious advantages and potential challenges.

Advantages of RLHF:

  1. Adaptability : RLHF is a dynamic learning strategy that can adapt based on the feedback it receives. This adaptability makes it ideal for a variety of tasks and enables it to adapt its behavior based on real-time interaction and feedback.
  2. Reduced bias : In theory, RLHF helps reduce the bias of the model. With carefully selected and diverse human feedback, these models can learn from a broader, more representative perspective, reducing overgeneralization or bias inherent in the initial training data.
  3. Continuous Improvement : The RLHF model has the capability of continuous improvement. As these models interact with users and get more feedback, they can learn and adapt, improving performance and user experience.
  4. Security : RLHF can play a key role in enhancing the security of AI systems. Through human feedback, these systems can avoid potentially harmful or inappropriate behavior, making them safer for interaction and use.

Challenges and limitations of RLHF:

1. Scalability : Scalability remains a major challenge for RLHF. Because these models rely on human feedback for learning, scaling them up for larger or more complex tasks can be resource and time-intensive.
2. Relying on human factors : RLHF models rely heavily on the quality of human feedback. Ineffective or insufficient feedback can lead to poor performance or even inadvertently foster harmful behavior in the model.
3. Human bias : The potential problem introduced by bias is a key concern of RLHF. Feedback provided by human raters can be inherently biased, leading to biased learning. These biases can take many forms, including selection bias, confirmation bias, inter-rater variability, and limited feedback.

It is worth noting, however, that effective strategies exist to mitigate these biases. Selection of diverse raters, consensus assessment, calibration of raters, regular assessment of the feedback process and agent performance, and methods of balancing feedback with other sources can all help reduce the effects of bias in RLHF. These strategies underscore RLHF's thoughtful and systematic approach, emphasizing the importance of continuous assessment and adjustment during the process.

As RLHF continues to evolve, there are a number of additional challenges and limitations that need to be addressed:

  1. Interpretation and transparency : Large language models tend to become more opaque as models grow in size and complexity. This makes it difficult to explain the decision-making process of the model, especially after fine-tuning with RLHF. For some application scenarios, especially in domains that require interpretability and transparency, this may become a limiting factor.
  2. Reward Design : Designing an efficient reward function is a key issue in RLHF. The reward function needs to accurately reflect the performance of the model and be discriminative enough to rank different responses. However, designing reward functions is not always intuitive and simple, especially with complex tasks and diverse responses.
  3. Adversarial examples : Reinforcement learning is often vulnerable to adversarial example attacks. In RLHF, if the model is targeted, it may lead to generating undesired responses. This requires robustness and security considerations of the model during RLHF training to prevent adversarial attacks.
  4. Training efficiency : Since the training of large models requires a lot of computing resources and time, the training cost of RLHF can be high. This can be a challenge for some resource-constrained environments and application scenarios.

Despite these challenges, RLHF continues to evolve and improve as a powerful learning method. Researchers and developers are working hard to solve these problems to further promote the application and development of RLHF. At the same time, society's credibility and controllability of artificial intelligence systems are also increasing, and the requirements for interpretability, transparency, and explainability will also receive more attention during the development of RLHF.

In the future, we can expect to see more innovations and improvements to make RLHF a more general and reliable method, providing stronger support for applications in various fields and promoting the continuous advancement of artificial intelligence technology.

epilogue

As the field of data science and artificial intelligence continues to develop, large-scale language models and RLHF as powerful tools are gradually becoming important components in various fields. Through pre-training and fine-tuning, large-scale language models can have rich language expression capabilities, while RLHF can continuously improve the performance of the model based on human feedback, making it more intelligent and adaptable to different tasks.

However, we must also realize that RLHF still faces some challenges, such as scalability, human bias, interpretability, etc. Addressing these issues requires interdisciplinary research and collaboration to ensure the application of RLHF to meet real-world challenges safely, reliably, and efficiently.

In the future, we have reason to believe that with the continuous advancement of technology and the in-depth understanding of artificial intelligence, RLHF will continue to grow and bring more benefits and innovations to human society. At the same time, we also need to pay close attention to the moral and social issues that may arise during its development, and continue to promote the balance between technological development and social value. Only in this way can RLHF truly become a booster for the development of artificial intelligence technology and create a better future for mankind.

Guess you like

Origin blog.csdn.net/weixin_42010722/article/details/131895074