RLHF Technology: How Can It Be More Effective? What are the limitations?

Editor's note: Since the launch of ChatGPT, reinforcement learning based on human feedback (RLHF) technology has become a hot spot for large model construction and application personnel. However, the effect of this method is not satisfactory in some cases, and some basic models perform worse after being tuned by RLHF. The applicability and specific operational details of the RLHF technique appear to be a mystery.

This article explores the working mechanism of reinforcement learning with human feedback (RLHF), analyzing three key components of RLHF: base model, preference model, and reinforcement learning. And summarize the limitations of RLHF: unable to correct factual errors or add new capabilities. But it is still a powerful tool, and in the future RLHF may integrate more feedback methods and continue to optimize its technology stack. let us wait and see!

The following is the translation, Enjoy!

Author | NATHAN LAMBERT

Compile | Yue Yang

For a while, the question I was most often asked was: "Why does reinforcement learning with human feedback (RLHF) work, and still work so well? " 21) , my answer has always been: "No one knows." However, now the answer to this question has begun to have some eyebrows.

RLHF can achieve long-term success in language models and other fields only if the following two conditions are met .

First of all, there needs to be some practice or experiments showing that only applying traditional supervised learning (supervised learning) is not enough, pairwise preference data is one example. (Translator's note: pairwise preference data here refers to user preference data obtained by comparing two or more options. When collecting this kind of data, respondents need to compare options pairwise, and then choose the option they prefer. In this way, the relative user preference relationship between different options can be determined. For example, suppose we want to evaluate the popularity of three different mobile phone brands (A, B, and C). We can do this by providing respondents with a A series of pairwise comparisons are selected to collect pairwise preference data. Respondents will be asked to compare A with B, A with C, and B with C, and then choose the option they prefer. After collecting enough pairwise preference data, statistics can be made for each the number of user preferences for each option, thereby determining their relative popularity among respondents.).

Second, although less important, I suspect that RLHF can also perform quite well in situations where, in order to be successful, incremental tuning and optimization in a complex optimization landscape is required . (When discussing path dependency, it is best to use the content described in the constitutional AI paper to specify)

In this post, I'll share some of our data and optimization findings that really show why RLHF can produce such amazing results, while also revealing the difficulties of using RLHF.

picture

DALL·E 2023-06-18 16.45.45 - two humans plugged physically plugged into a computer, digital art

The impact of RLHF is relatively difficult to measure. Historically, there have been no reports indicating that RLHF has a significant impact on model performance in traditional model performance tests. From this point of view, RLHF is not just to choose the correct answer. This point of view will be elaborated in the preference model section later. We can refer to Figure 8 in the GPT4 technical report [1] (it is mentioned that RLHF has a role in dealing with toxicity (toxicity), but this is not a definitive evaluation criterion ).

picture

Excerpt from the GPT4 Technical Report

In a related paper on InstructGPT, it was pointed out that the RL step actually increases hallucinations compared to instruction tuning (see Figure 4[2]), so this paper has attracted much attention.

However, in the WebGPT[3] related article, the researchers found that compared to the "best-of-N sampling" method that uses a reward model to select the highest-scoring generated samples, the optimization effect of RL is lackluster.

In WebGPT and InstructGPT, the output style and content are relatively less dependent on user's preference or guidance, which also reveals some limitations of RLHF.

The development of dialogue systems is more challenging compared to other tasks. Dialogue involves many aspects, such as semantic understanding, language generation, context processing, and accurate understanding of user intentions. Therefore, the design and implementation of dialogue systems face broader and more complex problems.

So, if we look back at the common pipeline for generating chat agents that are helpful to humans without producing toxic, harmful, or incorrect statements, we see that we first create a large language model (LLM) that is helpful to humans, and then , by applying a series of fine-tunings and systems engineering methods to achieve the goal of reducing its harmfulness. This is strongly supported by Anthropic's technical research line (from dialogue agents[4] to RLHF[5] to CAI[6]) and John Schulman's communication on RLHF (speech[7] and podcast[8]) .

01Basic model BASE MODEL

The starting point for RLHF is very important and seems to require a very strong underlying model that can follow instructions. Attempts to imitate proprietary models and use imperfect datasets will seriously affect the performance of the model, making the initial optimization of RLHF more difficult.

Imitating models means that some people use the output of a better model on the market and put that data into the transformer's autoregressive prediction loss (an indicator used to evaluate the predictive accuracy of the generated model).

Imperfect datasets stem from people being unsure of how they were collected in the past (e.g., insufficient control over the prompt distribution of community sources) or repeatedly replicating datasets without adequately examining them. Under the topic "as a language model trained by...", both types of questions reduce their usefulness by introducing data that avoids answering the question or diverts the topic. Anthropic discusses this in detail in their paper, you can check out the original paper if you want to learn more.

This decline in usefulness manifests itself as a rise in the ranking of models marked as "uncensored" on the leaderboard. I don't like this label, and I don't like associating models in such a way, because in reality, there is still room for improvement for these models in controlling the topic distribution of generative models in the generated text, so there is not really a complete system of censorship . (Translator's Note: The topic distribution here refers to the probability distribution of the topics presented when the generation model generates text. In other words, it represents the relative probability that the generation model generates different topic texts. The generation model may generate text on multiple topics On generating text, the probability of each topic may be different. For example, if we have a generative model for generating news reports, then it may generate text on different topics such as sports, politics, entertainment, etc. The topic distribution can tell us where In this case, what is the generation probability of each topic.)

For now we can tentatively understand the label "uncensored" as filtered. This kind of screening is more critical, and we will see the importance of this step many times later.

There are some common data manipulation methods that may cause these data to be included. For example, the HH dataset [9] provided by Anthropic on Hub is one of the largest RLHF datasets. In order to omit the steps of building a preference model and RLHF, some people directly use the entire dataset and use the preferred entry in the preference dataset pairs as instructions. (Translator's Note: Suppose there is a preference data set, which contains multiple data pairs, and each data pair consists of two entries (A and B). Users have selected their preferred or higher priority in each data pair option. Then the preferred option chosen by the user in each data pair is used as a guideline, which can be used to train or guide the model to make predictions in other data pairs.) This seemingly innocuous decision can lead to many confusing responses from the model, thus Reduce the use effect of downstream applications.

02 Preference Model PREFERENCES

Once we have a working model, we can easily use it to generate a large batch of responses, and ask a group of people to label these pairs of responses as preferred pairs (“chosen” and “rejected” are the preferred pairs used in this paper. the term). These preference data are relatively simple at first glance, but once you get into the details, especially in the pursuit of optimal model performance, making these preference decisions becomes extremely difficult. (Translator's Note: The preference decisions here are the process of reading two responses and deciding which is better, based on some style or other factors. These decisions may be subjective and based entirely on personal intuition and preferences.)

I do batches of data labeling every few months to calibrate the model or gain experience with work projects, and the last one was really tough. This task involves reading two coherent and well-organized multi-paragraph responses to a question and deciding which is better (usually based on the language style of the answer). To add to the challenge, we collected only a subset of preference data (if any), and enabled fact-checking in the process. When it comes to judging the accuracy of dates for events, we can only rely on our intuition. This is why papers like InstructGPT report that the reward model only learns to 60-70% agreement among workers, since this is probably the highest level of agreement achievable on the raw data.

harmless preference data

HARMLESSNESS PREFERENCE DATA

Collecting preference data for harmful prompts is actually quite simple. Now that we have a large model that can be controlled via prompts, we need to completely exclude the model's responses to harmful prompts. A pair is then created by matching any of the model's responses to negative requests with the canned answer "As a language model, I don't want to answer this question". Now this preference data is ready for mass optimization.

Here's one reason why RLHF is not a panacea, but still very useful: In supervised learning, it is difficult to encode this pairwise nature into text generation models. By training a reward model, we improve the performance of the model by increasing the difference between the predicted rewards of selected and unselected samples.

As for how many negative prompts can be automatically converted into preference data, it is unclear. In practice, I'm guessing it's not an automated process, but more performed by systems like Surge and Scale, which have serious data backlog issues (too many people wanting to do RLHF, not enough willing to make a living reading model-generated content all day long). I suspect that we get a lot of unfiltered preference data from open-source friendly APIs like BlenderBot or OpenAssistant. Also, I suspect that this data may not be directly usable to train a good model, nor may it be constructed in a way that many companies using the dataset will be happy with, regardless of the "preferences" encoded in the model.

In the preference model step, we generally capture some strange model output, which leads us to be confused about reinforcement learning based on human feedback (RLHF), mistakenly thinking that it can help reduce fictional content [10 ] generation. During preference labeling, humans can easily notice issues such as silly-looking text formatting, repeated phrases, annoying writing styles, forgetting the last message in a conversation, and so on. Now, my point is to think of RLHF as a text style transfer : RLHF can be seen as a topic filter with milder error correction added. Topic filters are mainly used to detect harmful content, and need to help reduce the production of strange text is to adapt to the broad and vague phenomenon of "hallucinating".

Encoding preferences in this model will change the model behavior in a way that is quite different from instruction tuning, which I think is the core of RLHF. In this view, preference modeling is the core step to enable RLHF to be used, and there may indeed be no other good resources, data, or information sources to guide the model learning and training process.

03 Reinforcement Learning RL

Finally, with this model that can guide what to say (i.e. the prefernece model), an important remaining part of RLHF is to know how to guide the large model through the preference model (prefernece model). At this stage, two core distribution checks are required. First,  the preference model needs to cover all texts (even basic questions) handled by the RLHF model . Second, the prompt list of RLHF needs to cover all the content of the preference model, but cannot exceed its scope. If the prompt set of RLHF is incomplete, it will not be possible to extract all the advantages from the preference model and data set. Of course, each step may face out-of-distribution problems. So good luck if you try to do RLHF on a preference model you don't really understand.

I think the main reason we don't see many preference models or RLHF's models released is this strong demand for matching distributions. (Translator's Note: matching distributions means that the distribution of the data generated by the model and the real data need to be as similar as possible.) The open source community may create a smaller, more focused preference model in a certain field (such as StackLLaMA), rather than Have a general and fairly complete model like OpenAI/Anthropic. I wouldn't be surprised if you also need to introduce prompt distributions (or something similar) for RLHF. According to relevant literature, RLHF is a process rich in change. How much accuracy is improved cannot be judged simply by evaluating the loss value (although that is of course part of it). In the training loss function, a series of evaluations need to be performed on automatic benchmarks (such as multiple choice questions) and model preferences (through human annotation, the model learns from the preference model). I've always heard that RLHF extracts information from a preference model, but it's not well documented how quickly/slowly RLHF integrates all the preference data. This process should vary with model size, dataset size, training settings, RL hyperparameters, etc.

Model scalability SCALING

The underlying theme about model extensions runs through every line of this essay, and while not explicitly presented, it is implicit in every point of the essay. Anthropic's RLHF paper (pictured below) shows that expanding the model parameters to a scale of 50B can bring obvious benefits. It will be valuable to the open source community and academia that training a 50 B-parameter model will be relatively easy to achieve next year, because experimenting with limited computing resources is very difficult. In addition, Anthropic also released another paper on preference models with 200B parameters (Fig. 1, Ganguli et al., 2023 [11]), showing great progress on this parameter scale. Therefore, the RLHF that people use will vary. These projects involve scaling in many ways, and until we have more data, it will be difficult to say for sure the answer to "Can RLHF solve problems in my field".

picture

Figure from Anthropic’s CAI work.

I've checked this opinion with several employees of some of the commonly mentioned companies and they confirmed it's true, so I think I'm beginning to understand how RLHF works 6 months ago . This took me 6-8 months, so don't be discouraged if it takes you some time to get the hang of it. I believe companies are already trying to scale this process.

04 Conclusion and discussion Loose ends

This post will likely quickly generate a lot of questions and directions, and hopefully I'll answer a few of them in advance.

When I say this describes my opinion of the current RLHF implementation relatively completely, it means that it doesn't do the key things that some people might think it does:

  • Correct some minor facts/check questions for authenticity/fix typos.
  • Add basic capabilities to the model (eg, learn intuition and reasoning about a topic, learn how to code, etc.). (Translator's Note: Learning intuition about a topic means that the model extracts and understands information, concepts, and knowledge about a specific topic by interacting with a large amount of relevant text. In this way, the model is able to form an intuitive understanding of the topic, So as to better apply and use the knowledge related to the topic. It is not only the memory of facts and information, but also the grasp of the internal connections and patterns between the topics. This intuitive understanding helps the model in the More flexibility and precision in answering questions, reasoning, and generating content.)

Also, there are some very exciting papers coming out that include full RLHF (for example, Allen AI's work on different types of feedback), RLCF (when human preferences are not considered), or other different optimization techniques (eg, Direct Policy Optimization, DPO) . (Translator's Note: Direct Policy Optimization can directly use preference data (preference data) without training an additional reward model. This means that human preference information can be used more directly in the training process, thereby improving the quality of the strategy and Adaptability.) I'm very much looking forward to continuing to follow up on these studies, especially DPO (since I think it might perform well in those small-scale scenarios where the data distribution is more clearly defined).

I think that although there are some problems in the selected population samples and population structure, the preference data is still the core part of the training model .

Also, I have some questions:

  • How sensitive the RLHF results are is temporarily unknown : How many seeds, trials, and babysitters can an open source organization or academic team manage?
  • How to integrate large-scale dialogue datasets : There are many new datasets released (such as the recent BlenderBot3) with different feedback types, such as likes/dislikes on messages. Who will be the first to be able to efficiently integrate these datasets?
  • There are no big players in the market that explicitly use RLHF : Google just says it uses RLHF for Bard, and Cohere didn't use RLHF at all when I last read. Is there any reason for this? Thanks to Ryan Sullivan for mentioning this in the Bard report:

The current RLHF technology stack is indeed a powerful hammer, but at the same time we also need a series of finely tuned chisels. Here are some of the reasons why I am actively interested in further developing the current technology:

  • Outcome-based optimization (multiple reward objectives, where chatbots are rewarded for completing tasks the user wants): This can be done by training the model on user preferences, and then doing a second fine-tuning/RLHF on the results (more traditional RL form) to achieve.
  • Continuous optimization of the intended stack : As we have seen in Anthropic's workflow, current models no longer require steps like context distillation. Additionally, chain-of-thought reasoning (adding a system prompt before user input, quietly telling the model to explain its work) has shown some benefit in some cases. I guess that in the future, all language models will no longer need instruction adjustment after basic training (more instruction data is added to the basic model), and a certain degree of thinking chain reasoning will be built in.
  • Constitutional AI (CAI) : While I don't agree with any unsupervised corporations choosing to incorporate their values ​​into the models I use, sooner or later we will get generation time controls beyond the Constitution's corporate mandate : Can be used to limit the scope or time period in which the model generates specific content. This means that the user can set a time limit so that the model can only generate content before or after the time limit; Constitutional Artificial Intelligence (CAI) refers to human Oversight will come entirely from a set of principles that should govern the behavior of the AI, and a small number of examples for a few hints. Together, these principles form a constitution. The AI ​​will iteratively refine its guidance model based on these constitutional principles).

END

References

1.https://arxiv.org/abs/2303.08774

2.https://arxiv.org/abs/2203.02155

3.https://arxiv.org/abs/2112.09332

4.https://arxiv.org/abs/2112.00861

5.https://arxiv.org/abs/2204.05862

6.https://arxiv.org/abs/2212.08073

7.https://www.youtube.com/watch?v=hhiLw5Q_UFg&t=4s

8.https://www.talkrl.com/episodes/john-schulman

9.https://huggingface.co/datasets/Anthropic/hh-rlhf

10.https://www.interconnects.ai/p/specifying-hallucinations-llms

11.https://arxiv.org/abs/2302.07459

This article is authorized by the original author and compiled by Baihai IDP. If you need to reprint the translation, please contact us for authorization.

Original link :

https://www.interconnects.ai/p/how-rlhf-works

Guess you like

Origin blog.csdn.net/Baihai_IDP/article/details/132016632