Deep Learning Made Easy: What is the difference between the RLHF process used by chatGPT and fine tuning?

Wondering what is the difference between the RLHF method and the fine-tuning method? Fine-tuning is actually the first step in the RLHF method. continue reading.

Reinforcement learning with human feedback (RLHF) has been shown to be an effective way to align the underlying model with human preferences. This technique, which involves fine-tuning the model, has played a key role in recent advances in artificial intelligence, as shown by the success of OpenAI's ChatGPT model and Anthropic's Claude model.

The implementation of RLHF brings subtle but important improvements in the usability and performance of the model. These improvements include improving tone, mitigating bias and harmful elements, and enabling domain-specific content generation. This article will delve into the application of RLHF in fine-tuning large language models (LLM).

Understanding reinforcement learning from human feedback

RLHF arose from a fundamental challenge in reinforcement learning: the definition of the complexity, ambiguity, and difficulty of goals for many reinforcement learning tasks. This dilemma leads to a misalignment between our values ​​and the goals of RL systems, as highlighted in the paper Deep Reinforcement Learning from Human Preferences.

Many AI applications, especially in the enterprise, face hard-to-specify goals. For example, in content curation, the fine-grained policy context of curation can conflict with algorithmic enforcement decisions. Likewise, when it comes to content generation, such as automated support agents, achieving the best quality is also difficult. While generative AI can enable cost-effective content creation, concerns around brand style and consistency of tone are holding back widespread adoption. How can the team establish a reward function that is consistently aligned with the brand guidelines? In situations where the risks associated with AI-generated content are high, opting for a definitive chatbot or human support agent may be a sound investment.

In traditional reinforcement learning, an explicit reward function can guide the algorithm. However, in more complex tasks, determining an appropriate reward function can be challenging. In this case, human preferences can effectively guide the AI ​​system to make the right decision. This is because people, even without specialized knowledge, have the intuitive understanding to navigate complex and situational tasks. For example, given a sample of a brand’s marketing copy, an individual can easily assess how well the AI-generated copy aligns with the brand’s intended tone. However, the main challenge lies in the time and cost required to incorporate human preferences directly into the reinforcement learning training process. As stated in the Deep Reinforcement Learning from Human Preferences paper: "Direct use of human feedback as a reward function is prohibitively expensive for reinforcement learning systems that require hundreds or thousands of hours of experience".

To address this challenge, the researchers introduced reinforcement learning from human feedback (RLHF), which involves training a reward predictor or preference model to estimate human preferences. Utilizing a reward predictor significantly improves the cost-effectiveness and scalability of the process compared to providing human feedback directly to the RL algorithm.

RLHF Process: Insights from OpenAI

Improving Large Language Models with RLHF

RLHF is a powerful tool for improving the utility, accuracy, and reducing harmful bias of large language models. A comparison of GPT-3 and InstructGPT (a model fine-tuned using RLHF) by OpenAI researchers showed that annotators "significantly prefer" the output of InstructGPT. InstructGPT also demonstrates improvements over GPT-3 in terms of authenticity and harmfulness assessment. Similarly, Anthropic documented similar benefits in a 2022 research paper, noting that "RLHF provides a dramatic improvement in both beneficial and harmless properties compared to simple extended models." A strong case is made in terms of achieving various business goals for large language models.

Let's explore the RLHF workflow for fine-tuning.

Step 1: Collect demonstration data and train a supervised policy

To start fine-tuning a large language model (LLM), the first step is to collect a dataset called demo data. This dataset contains text cues and their corresponding outputs, representing the desired behavior of fine-tuned models. For example, in an email summary task, the prompt could be the full email, and the completion could be a two-sentence summary. In a chat task, the prompt could be a question, and the completion part could be the ideal answer.

Demonstration data can be collected from various sources, such as existing data, annotation teams, or even data generated from the model itself, as shown by the concept of self-referential language models and self-generated directives. According to OpenAI's fine-tuning guidelines, a few hundred high-quality examples are usually required for successful fine-tuning. The performance of the model tends to scale linearly with the size of the dataset. It is important to manually review demonstration datasets to ensure accuracy, avoid harmful content, mitigate bias, and provide helpful information, as researchers at OpenAI suggest.

Platforms like OpenAI and Cohere provide detailed guides on fine-tuning large language models using supervised learning.

Step 2: Collect comparison data and train the reward model

Once a large language model has been fine-tuned using supervised learning, it is able to autonomously generate task-specific completions. The next stage of the RLHF process involves gathering human feedback in the form of comparisons against the completed parts of model generation. These comparison data are then used to train the reward model, which will be used to optimize the fine-tuned supervised learning model via reinforcement learning (as described in step 3).

To generate comparative data, an annotation team ranks the multiple completions generated by the model. Annotators rank these completions from best to worst. The number of completed parts can vary from a simple side-by-side comparison to a sequence of three or more completed parts. During fine-tuning of InstructGPT, OpenAI found it effective to show annotators a range of 4 to 9 completed parts for ranking.

There are third-party vendors or tools that can help with the comparison task, either by directly uploading model completions, or through model endpoints for real-time generation.

Comparing fine-tuned LLMs to benchmarks is crucial to assess their authenticity, beneficialness, bias, and harmfulness. Standard LLM benchmarks such as TruthfulQA, Question Answering-Oriented Bias Benchmark, and RealToxicityPrompts for assessing harmfulness can be used.

Step 3: Optimizing the Supervised Policy Using Reinforcement Learning

In this step, the supervised learning baseline representing fine-tuned LLM is further optimized by utilizing reinforcement learning (RL) algorithms. A notable class of RL algorithms developed by OpenAI is Proximal Policy Optimization (PPO). Details about the PPO algorithm can be found on OpenAI's website.

The reinforcement learning process aligns the behavior of the supervised policy with the preferences expressed by the annotators. Through iterations of steps 2 and 3, the performance of the model can be continuously improved.

The above is the workflow of fine-tuning a large language model using RLHF. By combining supervised learning and reinforcement learning, the RLHF method can make the model more in line with human preferences and intentions, thereby improving the usability, performance and quality of the model. This approach has played a key role in the success of models such as ChatGPT and Claude, and has shown great potential in achieving various commercial goals.

It should be pointed out that the RLHF method is not limited to the fine-tuning of large language models, but can also be applied to other fields and tasks, such as recommender systems, robot control, etc. By combining human feedback and reinforcement learning, RLHF provides a powerful approach to address the difficult problem of defining reward functions in complex tasks, thereby improving the performance and adaptability of AI systems.

read

English version

focus on

No public

Guess you like

Origin blog.csdn.net/robot_learner/article/details/131280499