How to effectively label RLHF data?

Editor's note: With the widespread application of large language models in the field of natural language processing, how to perform reinforcement learning from human feedback (RLHF) has become an important technical challenge. And RLHF requires a large amount of high-quality manual data annotation, which is a very laborious process.

The author of this paper has extensive experience in the field of data labeling. In this paper, he discusses in depth the key issues related to data labeling in the RLHF process. The author first introduces the basic elements of data annotation, such as task decomposition, quality control, etc., and then specifically compares the key differences between supervised fine-tuning and human preference feedback, including data volume and task design.

Based on his own rich experience, the author dissects the challenges of data labeling in the RLHF process and gives many specific suggestions. Finally, he also summarized three insightful insights and looked forward to future research directions. This article is simple and easy to understand, which will help readers fully understand the data labeling problems in the RLHF process, and has reference significance for AI researchers and practical developers.

The following is the translation, enjoy!

Author | Dr. Dmitry Ustalov

Compile | Yue Yang

Table of contents

01 Introduction

02 What to label? What to Annotate?

03 A Quick Tour to the Basics of Data Labeling

3.1 Data labeling using crowdsourcing

3.2 Core Elements of Data Labeling

04 Supervised Fine-Tuning data annotation with supervised fine-tuning

4.1 How to get text data?

4.2 How to label text data?

05 Data annotation of human preference feedback Human Preferences

5.1 Sort the content of data annotations

5.2 How much data needs to be labeled?

5.3 Structure of reward model

06 Conclusion

This article is part of Dmitry Ustalov and Hugging Face's Nathan Lambert[1], published on ICML 2023[2], focusing on the data annotation of RLHF.

For the full content of the study, please visit: https://doi.org/10.5281/zenodo.8186168

picture

01Introduction _

Who wants to do reinforcement learning from human feedback?

The entire room raises their hands.

Great! And who wants to annotate the data to obtain human feedback?

Only five, maybe ten hands remain.

Who wants to use RLHF technology?

The whole room raised their hands.

Very good! So, who wants to annotate data so that models get human feedback?

Only five people raised their hands, and maybe the ten hands that were raised before are still there.

I think data labeling is very important, although maybe not everyone wants to do it. We all know that RLHF stands for Reinforcement Learning from Human Feedback and both aspects are important. Distributed training alone is not enough for large language models (LLMs) if our underlying dataset is not of high quality and does not represent well what we want from the model.

We want LLM to be useful, harmless and honest. But the computer does not know the meaning of these words yet, so human feedback is needed to adjust the model to develop in the desired direction, to evaluate the output of the model, and to avoid the tedious reward engineering in reinforcement learning (RL) [3] (Translator's Note : similar to feature engineering).

02 What should be marked? What to Annotate?

By learning deep learning, we can train models. But asking about the best way to train an LLM led to a lot of interesting discussions. Opinions vary about the capabilities of LLM and the effort required to make it performant. Two classic cases are given below. Interestingly, both cases are from the same company - Meta:

  • In a recently published preprint paper, Zhou et al. [4] (2023) proposed Superficial Alignment Hypothesis. They believe that the model already has all the required knowledge, and the user only needs to define and specify the specific format of the input and output data required by the model. Therefore, they believe that complex data labeling schemes are not required, and only a small amount of instructions are needed to complete the task. If this hypothesis is true, it will be a major breakthrough, let us wait and see whether this paper will be accepted.
  • In contrast, in the paper proposing Llama 2, Touvron [5] et al. (2023) claim that "the superior writing power of LLMs is fundamentally driven by RLHF", suggesting that considerable Data labeling workload.

Now look at three other examples and try to see if you can spot some patterns. First the now famous OpenAI's InstructGPT correlation chart (Ouyang et al. [6], 2022), then a similar chart from Anthropic's Claude paper (Bai et al. [7], 2022), and last but not least is the diagram provided in Meta's Llama 2 paper (Touvron et al. [5], 2023).

picture

picture

picture

Training overview of InstructGPT (Ouyang et al., 2022), Claude (Bai et al., 2022), and Llama 2 (Touvron et al., 2023).

There are three common steps in these three diagrams:

First, we pre-train the model on a large collection of documents called a text corpus, enabling it to predict the next word at inference time (next word prediction). This step does not require data labeling.

We then perform supervised fine-tuning (SFT) on a smaller set of well-written question instructions and responses.

Finally, human preferences are used to improve the behavior of the model (through the reinforcement learning process of reward modeling).

While each step is relatively intuitive, it's the little things that make the difference. How much data labeling is required? How many iterations are required? What type of text is required? What kind of expertise is required? It's all about design decisions.

We need fast, accurate, and large-scale human review and adjustment through text and scores. For supervised fine-tuning, we can use synthetic, crawled or labeled data. For the reward model, we need access to human preference data.

03 A Quick Tour to the Basics of Data Labeling

Before we move on to the importance of data in RLHF (a combination of reinforcement learning and supervised learning), a brief introduction to the basics of data labeling is needed.

Who will label the data? There are many options:

  • Trained Professional Annotators (Experts)
  • Crowdsourced non-professional annotators (crowdsourcing)
  • Pretrained Machine Learning Models
  • Any combination of the above three methods

No matter who labels the data, we must design corresponding labeling instructions and quality control methods.

3.1 Data annotation using crowdsourcing

I will focus on crowdsourcing as the data labeling method because it is the most scalable and generalizes well to other types of data labeling. I started researching crowdsourcing as a data annotation method in 2012, and worked for four years on one of the largest data annotation platforms. Based on my experience, I firmly believe that the core challenge of data annotation is how to make the task understandable by the annotator as well as the publisher.

Usually, data annotation is performed on a specific data annotation platform. There are on-premise platforms such as Label Studio[8], CVAT[9], and Prodigy[10], as well as platforms that provide managed services such as Mechanical Turk[11], Scale[12], and Toloka[13] (hosted platforms). The advantage of using on-premise platforms is that the demand side can deploy these platforms on their own infrastructure and adjust accordingly according to their needs. The benefit of using a platform that offers managed services is that these platforms provide convenient labeling tools and payment functionality out of the box.

3.2 Core Elements of Data Labeling

An excellent data labeling project has six core elements: 1) Decomposition of labeling tasks into smaller, more specific subtasks (decomposition); 2) Clear and detailed instructions for labelers (instructions); 3) Labeling 4) the methods and strategies used to monitor and evaluate the quality of annotations (quality control methods); 5) the consistency level of the annotators at different times and in different tasks, and their understanding of the guidance (annotation reliability); 6) the trade-off between speed and cost. The most important of these are items 1) and 2).

picture

Six Core Elements of a Successful Data Labeling Project

Task decomposition (decomposition) is to decompose the original complex task into a series of smaller and simpler subtasks. These subtasks are distributed among many annotators so that each subtask is answered by two or more different annotators.

The main advantage of breaking tasks down is that they can solve extremely difficult problems. For example, what if we wanted to use box annotations for an image dataset? We could use a task sequence with two subtasks: first we ask the annotator to pull a bounding box, and then we ask the annotator to indicate whether the bounding box was correctly drawn.

A good task instruction should include the goals of the labeling task, an introduction to the annotator interface, required steps, examples of good and bad actions, instructions for handling rare and non-obvious cases, and relevant references. It is very difficult to write a good description on the first try, see the case of "No Vehicles in the Park" [14], which is a very good example of this problem.

In general, if the task decomposition is done well, the instructions and interface will be simpler, the annotator will complete the task with higher quality, the methods and strategies for monitoring and evaluating annotation quality will work reliably, and it is easy to control and Optimize labeling cost.

Now let's come to the main content of this article.

04Supervised Fine-Tuning with supervised fine-tuning data annotation

During the initial model training and supervised fine-tuning process, by inputting text content, the model learns the ability to predict the next word according to the given context. Usually, these text contents come from publicly available corpora, such as Common Crawl [15], RefinedWeb [16] (Penedo et al. [17], 2023), The Pile (Gao et al., 2020 [18]), etc. So, how do we get good prompts and answers?

Some companies have identified the appropriate prompt types and their ratios. For example, OpenAI’s InstructGPT (Ouyang et al., 2022 [6]): text content generation (45.6%), open question and answer (12.4%), brainstorming (11.2%), chatting (8.4%), text rewriting (6.6%) , content summary (4.2%), closed question and answer (2.6%), text content classification (3.5%), other types (3.5%) and keyword extraction (1.9%). These prompt types and their proportions are not chosen at random! The OpenAI team constructed the SFT dataset in proportion to actual usage by analyzing and labeling logs of its GPT-3 user interactions.

Besides the prompt type, there is a major question of how much data we need. By analyzing recently published papers, we obtained the following data (unit: number of instruction-following data sets):

  • Flame 2 (Touvron et al., 2023[5]): 28K
  • InstructGPT (Ouyang et al., 2022[6]): 15K
  • Alpaca (Taori et al., 2023[19]): 52K
  • Vicuna (Chiang et al., 2023[20]): 70K
  • Dolly (Conover et al., 2023[21]): 15K
  • OpenAssistant (Köpf et al., 2023[22]): 10K+
  • Claude (Bai et al., 2022[7]): 137K + 369K
  • WizardLM (Xu et al., 2023[23]): 624K
  • LIMA (Zhou et al., 2023[4]): 1K

The size of the data set is not the most important, the key is to have high-quality prompts and corresponding answers!

4.1 How to get text data?

How do we get the text data we need? There are three options:

1.  Model-derived datasets derived from the use of the model (Model-derived datasets) : based on user interaction data with available models. While some companies prohibit other companies from using this dataset to train commercial competitors, some still do it for research purposes.

2.  Web-based datasets : Utilize public data from the web from Reddit, Quora, Stack Exchange and other communities. However, it is often unclear whether such data is used with permission, and additional data cleaning is required.

  1. Crowdsourced datasets : Obtaining data through manual input by experts or other annotators is the safest but most laborious option.

Here is a summary of five popular publicly available datasets along with their prompts and corresponding responses:

  • Dolly [24]: A dataset annotated by in-house experts from the Databricks team, containing 15,000 entries.
  • Alpaca: 175 seed tasks were converted into 52,000 entries through the self-instruct method. This dataset is a dataset derived from the model use process (Model-derived datasets) obtained using OpenAI's GPT-3.5 model.
  • WizardLM: Complicate and re-arrange the Alpaca dataset through a set of rules, resulting in a larger dataset of 624,000 entries.
  • ShareGPT [25]: This is a browser plugin that downloads ChatGPT conversation data and stores it on a server (centralized). The license to use this dataset is unclear, but a Vicuna model was trained to perform well on a subset of this dataset consisting of 70K entries.
  • OpenAssistant[26]: This is an open source multilingual dataset annotated by crowdsourcing, including prompts and instructions, a model, and a data annotation framework to make it reusable. The data set comes from LAION (the Stable Diffusion model uses the company's data set). But we need to ensure that the interactions of labeled volunteers are similar to those used to train state-of-the-art LLMs.

Whenever using an in-process model-derived datasets approach, we can replace the upstream model with human annotation.

4.2 How to label text data?

Initial prompts should be written by experts or obtained from trusted web corpus sources, as these are an important part of supervised fine-tuning (SFT) data. However, we safely annotate the responses to these prompts.

One of my favorite examples of crowdsourcing is published in a paper by Bernstein et al. (2010) [27] called Soylent. Another group of researchers has created a plugin for Microsoft Word that, without using any machine learning methods, uses crowdsourcing methods to perform text summarization very efficiently. They proposed a general data annotation schema called Find-Fix-Verify, which decomposes the raw text summarization task into three subtasks:

  • Find: Given a text sample, find problematic fragments.
  • Fix: Given a text sample and a problematic snippet, write a better snippet.
  • Verify: Determine if the written snippet is better. (whether)

We can take a similar approach to writing responses. In the first step, the data labeler can be asked to write a response according to the given prompt. In the second step, the verifier needs to be asked to verify that the reply is good or bad. Both must have the same instructions so that writers and verifiers have the same (writing, judging) criteria.

So, instead of solving the seemingly very difficult problem of writing responses, we have solved a seemingly simpler problem of binary label (yes/no) aggregation. This problem has been more or less addressed in the academic research community (Zheng et al., 2017) [28]. For smaller datasets, majority votes should be chosen, and for larger datasets containing more than 1,000 responses, the Dawid-Skene (1979) [29] probabilistic aggregation model should be used.

Data labeling for supervised fine-tuning is important but difficult to get right. It's all about design decisions. Where are you getting the initial prompt from? Are you planning to use synthetic data? Will you involve professional annotators? Which prompts do you need, and how do you summarize the data? I strongly recommend using annotated golden tasks to evaluate annotators in the above verification step (golden tasks are still the most effective quality control method in crowdsourcing data annotation schemes), and pay attention to data license issues.

05 Data annotation of human preference feedback Human Preferences

After training and fine-tuning the initial language model, we polish it to be helpful, harmless, and honest. Unlike the supervised fine-tuning part, the task design of reinforcement learning from human feedback is simpler. Since RLHF requires a large number of labels, and we cannot directly convert human opinions into a reward function, we will use a reward model to achieve the effect of approximating human scoring.

Given a prompt and a corresponding answer, the reward model estimates how a human would rate it. Therefore, we need to design the human labeling task according to the training details of LLM. If we generate only two responses per prompt, we can stick to the simple binary classification task design. But what if, like InstructGPT, there are more answers? The situation becomes more complicated, and we need to sort and aggregate it.

picture

Compare the answers of the given prompt in pairs

5.1 Sort the content of data annotations

What are the feasible sorting schemes for data annotation content?

  • Point-to-point (Pointwise) solution. Given a prompt and answers, provide a single numeric score for each answer. Unfortunately, different annotators have different subjective scoring criteria, which complicates further use of the data. Some communities perform score standardization in post-processing (Adelani et al., 2022) [29], but this is still not the most reliable way to perform ranking.
  • Listwise scheme. Given a prompt and an answer, sort them from best to worst (or vice versa). While this approach is intuitive from an annotation perspective, it is unclear how to integrate it into the model training process.
  • Pairwise comparisons. Given a prompt and an answer, pairs of responses are randomly selected and labeled, and an algorithm similar to Bradley-Terry (1952) [30] is used to convert the labeled pairs of responses into individual response scores.

Most studies use a pairwise comparison scheme. You can read the publicly available annotator guidelines from OpenAI [31] and Hugging Face [32].

5.2 How much data needs to be labeled?

Touvron et al. (2023) [5] have conducted a detailed analysis of existing human preference datasets, see Table 6 in the paper for details:

  • Anthropic Helpful[33]: 122K
  • Anthropic Harmless[33]: 44K
  • OpenAI Summarize:[34] 177K
  • OpenAI WebGPT[35]: 20K
  • Stack Exchange[36]: 1038K
  • Stanford SHP[37]: 75K
  • Synthetic GPT-J: 33K

Since human preferences are subjective, preparing golden tasks becomes more difficult. I recommend using synthetic data from other models that performed poorly, or using a previous snapshot version of our model, or using prompts that don't require deep reasoning or complex thinking, or other datasets with similar themes. We have successfully applied a smaller model in the past, as detailed in the work of Pavlichenko and Ustalov (2023) [38].

(Translator's Note: "Golden tasks" refer to tasks that serve as references or standards in labeling tasks. These tasks are usually done by experts or high-quality labelers whose labeling results are considered accurate and reliable.)

5.3 Structure of reward model

While pairwise comparisons of such schemes are not difficult to implement, we still have another design decision to make: what granularity of feedback do we want to incorporate into the reward model? If we are only comparing pairs of responses, what is the criterion for one answer to be better than the other? Merely being factually correct is not enough, as responses may be offensive, rude, or malicious. Therefore, we may be willing to enforce multidimensional quality criteria such as beneficialness, harmlessness, and honesty (Bai et al., 2022) [7]. However, do you want to train one reward model for all cases, or do you want to mix three models, each corresponding to specialized criteria?

picture

Helpful labeling of reward models

A reward model allows the LLM to reveal its personality and maintain a balance between helpfulness, harmlessness, and honesty. However, we had to make some tough design decisions. How many responses should each prompt have? What is our sampling method? How to build a reward model and design labeling tasks? If you decide to do data labeling, use synthetic data for quality control. For longer texts, we should incentivize data annotators to use their expertise. Note that similar labeling methods can also be used in the post-processing step of the red team [39] after training is complete. (Translator's Note: Red Team (Red Team) generally refers to a group of people or teams responsible for simulating the role of an attacker in order to assess the security and weaknesses of the system. In this article, the task of the Red Team is to complete the training The model is further evaluated and tested to find possible vulnerabilities or other issues, and corresponding post-processing measures are taken to improve the performance and reliability of the model. The post-processing steps of the red team are aimed at improving the robustness and security of the model to ensure its reliability and usability in practical applications.)

06 Summary

Here are three key takeaways readers should keep in mind:

  • For supervised fine-tuning (SFT), the data label size should be 10K+ prompts, and for reinforcement learning based on human feedback (Human Preferences), it should be 100K+ prompts.
  • Focus on high-quality small datasets rather than larger uncontrolled datasets because data labeling is not easy.
  • The data labeler must understand the labeling task in the same way as the requester, so synthetic data and cross-checks should be used for quality control during the data labeling process.

At present, I think there are three main themes for future data labeling work. First, current estimates of the size of the required data are based on experimental trial and error, so what are the theoretical requirements for the required data? Second, which instruction types are more important, and how do I find them? Third, what is the best process and task design?

In the scenario of using crowdsourcing for annotation, I mentioned several data aggregation techniques and quality control methods. Our team created an open-source Python library called Crowd-Kit [40]. The library efficiently implements all of the above methods, providing data quality and inter-annotator agreement measures (Translator's Note: It can help researchers and developers understand the reliability of the data set and take corresponding measures to improve it as needed. Data quality. It can also measure the level of consistency between the annotation information of different data labelers, and evaluate the reliability and consistency of the data set accordingly.), and provide a data set loader to efficiently process and process the data set experiment. I highly recommend using it when working with crowdsourced datasets.

References

1.https://open.substack.com/users/10472909-nathan-lambert?utm_source=mentions

2.https://icml.cc/Conferences/2023

3.https://medium.com/toloka/reinforcement-learning-without-reward-engineering-60c63402c59f

4.https://arxiv.org/abs/2305.11206

5.https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

6.https://arxiv.org/abs/2203.02155

7.https://arxiv.org/abs/2204.05862

8.https://labelstud.io/

9.https://www.cvat.ai/

10.https://prodi.gy/

11.https://www.mturk.com/

12.https://scale.com/

13.https://toloka.ai/

14.https://novehiclesinthepark.com/

15.https://commoncrawl.org/

16.https://huggingface.co/datasets/tiiuae/falcon-refinedweb

17.https://arxiv.org/abs/2306.01116

18.https://arxiv.org/abs/2101.00027

19.https://crfm.stanford.edu/2023/03/13/alpaca.html

20.https://lmsys.org/blog/2023-03-30-vicuna/

21.https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

22.https://arxiv.org/abs/2304.07327

23.https://arxiv.org/abs/2304.12244

24.https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

25.https://sharegpt.com/

26.https://open-assistant.io/

27.https://doi.org/10.1145/1866029.1866078

28.https://doi.org/10.2307/2346806

29.https://aclanthology.org/2022.wmt-1.72

30.https://doi.org/10.2307/2334029

31.https://docs.google.com/document/d/1MJCqDNjzD04UbcnVZ-LmeXJ04-TKEICDAepXyMCBUb8/edit?usp=sharing

32.https://docs.google.com/document/d/1c5-96Lj-UH4lzKjLvJ_MRQaVMjtoEXTYA4dvoAYVCHc/edit?usp=sharing

33.https://huggingface.co/datasets/Anthropic/hh-rlhf

34.https://huggingface.co/datasets/openai/summarize_from_feedback

35.https://huggingface.co/datasets/openai/webgpt_comparisons

36.https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences

37.https://huggingface.co/datasets/stanfordnlp/SHP

38.https://doi.org/10.1145/3539618.3592000

39.https://huggingface.co/blog/red-teaming

40.https://github.com/Toloka/crowd-kit

This article is authorized by the original author and compiled by Baihai IDP. If you need to reprint the translation, please contact us for authorization.

Original link :

https://evalovernite.substack.com/p/rlhf-math-aint-enough

Guess you like

Origin blog.csdn.net/Baihai_IDP/article/details/132534570