How InstructGPT Prepares and Labels Datasets

Table of contents

1. Labeler selection

2. The source of the dataset

3. Data preprocessing

4. Methods for Labeling Datasets

4.1 Prompts written by labelers

4.2 Prompts submitted through the API playground

4.3 Generate a dataset according to the prompt

5. Data Diversity

6. What did the InstructGPT model learn using this labeled dataset?

7. Inadequacies of such labeled data

8. Labeling Instructions

 Glossary


1. Labeler selection

        OpenAI hired about 40 contractors on the Upwork platform and Scale AI to label data for them, and through screening tests (screening test) to judge the ability of contractors to identify and respond to sensitive prompts, and contractors and researchers with Concordance rates on the detailed labeling task. And keep the contractor team small, which makes it easier to communicate with the full-time contractors who perform the tasks. The screening process primarily selects labelers who show a high propensity to detect and respond to sensitive content. More specifically, our training annotators were selected from an initial pool of annotator candidates according to the following criteria:

  • Consistency of sensitive speech . Created a dataset of prompts and completions, some of which were sensitive (i.e. anything that might elicit strong negative emotions, whether harmful, sexual, violent, judgmental, political, etc. ). They themselves labeled the sensitivity of this data and measured agreement between themselves and the labelers.
  • Rank Consistency . Submit the hints to our API, along with completions for several models, and have the labeler rank completions by overall quality. Measure the labeler's agreement with the researcher's label.
  • Sensitive exemplary writing . Create a small set of sensitive cues where responding appropriately to the output will require nuance. Then, score each sample essay on a Likert scale of 1-7, and calculate the average sample essay score for each labeler.
  • Self-assessment of the ability to identify sensitive speech from different groups . Select a broad field of labelers who can collectively identify sensitive content. For legal reasons, we cannot hire contractors based on demographic information. So they asked labelers to answer the question: "For what topic or cultural group would you be willing to identify sensitive speech?" and used that as part of our selection process.
     

2. The source of the dataset

        The prompts dataset in the InstructGPT model mainly contains text prompts submitted to the OpenAI API, especially those using earlier versions of the InstructGPT model (through supervised training on a subset of the example data) on the Playground interface. Customers using are informed through multiple notifications that their data may be used to train InstructGPT models at any time . There is no data for customers using the API deployed on production .

3. Data preprocessing

        Heuristically removing duplicate prompt prefixes by checking for prompts that share a long-term common denominator, they limit the number of prompts to 200 per user ID . In order for the validation and test sets not to contain the data of users in the training set, use the user ID to split the dataset into training, validation, and test sets . To avoid the model learning potentially sensitive customer details, they filter out prompts on the training set that can capture personally identifiable information (PII) .

4. Methods for Labeling Datasets

4.1 Prompts written by labelers

        To train the first InstructGPT model (i.e., Beta instructGPT), they asked labelers to write prompts themselves. This is because they need an initial source of instruction-like prompts. to guide the process, and these types of prompts are not often submitted to the regular GPT-3 model on the OpenAI API . They asked labelers to write three prompts:

  • Plain : The labeler proposes an arbitrary task while ensuring that the tasks have sufficient diversity .
  • Few-shot : The labeler proposes an instruction, and multiple query/response pairs corresponding to the instruction.
  • User-based : There are many use cases for waiting list applications in the OpenAI API. Labelers come up with hints that correspond to these use cases.

4.2 Prompts submitted through the API playground

        For API hints, use hints submitted by users to earlier versions of the InstructGPT model on the aforementioned OpenAI API Playground. Throughout the process they only use data from Playground , rather than customers using their models in production, because it is easier to get informed consent: every time a user switches to an InstructGPT model, a warning message pops up saying that a commit has been made Prompt these The model can be used to train future versions of our model. This was also communicated in a message on the developer Slack channel when launching the beta version of the InstructGPT model. They filter out cues from training splits that contain personally identifiable information (PII).

        To ensure diversity of use cases, they heuristically remove duplicate hints by checking for hints that share a long common prefix, and limit the number of hints to about 200 per organization. Also, we create training, validation and test splits based on organization ID, e.g. the validation set contains different use cases than the training set.

4.3 Generate a dataset according to the prompt

        According to the tips above, generate three different datasets ( SFT dataset, RM dataset, PPO dataset ) for fine-tuning procedure:

Table 6: Dataset size, according to number of hints

        The table above shows the size of the datasets used to train/validate the SFT, RM, and RL models, and whether the hints were written by labeling contractors or from the OpenAI API.

        (1) SFT dataset , labeler example data used to train SFT model, SFT dataset contains about 13k training tips (from API and labeler-written), for SFT, please note that labeler-written tips are faster than customer tips Much more, because, at the beginning of the project, the labeler writes the instructions with the user interface, asking them to provide the general template instructions and some - examples of the instructions. They synthetically constructed multiple SFT data points from the same instruction by sampling different small-shot sample sets.

        (2) RM dataset , used to train RM model, RM dataset has 33k training hints (from API and labeler-written), for RM, for each hint, they collected K outputs (from 4 to 9) rank, and \binomial{K}{2}train the model on all of them , so that the number of ranked pairs they train the model on is an order of magnitude greater than the number of hints.

        (3) The PPO dataset , without any artificial labels, is used as input for RLHF fine-tuning. The PPO dataset has 31k training cues (from API only). More details on dataset sizes are provided in the table above.
      

5. Data Diversity

        The data collected covers a wide range of categories and use cases. The diversity of categories labeled by contractors in the RM training and validation datasets. The class distribution of the PPO dataset is similar. A subset of their labeled cue metadata is shown in Table 7. Note that the comment fields change over the course of the project, so not every field is commented for every prompt.

Table 7: Dataset annotations

Table 8: Average Tips per Customer

Table 9: Prompt lengths by dataset

25% is the quarter quantile, 50% is the half quantile (that is, the median), and 75% is the third quarter quantile

Table 10: Prompt length by category

Table 11: Prompt and sample data lengths

         Use a lightweight classifier (langid.py) to classify the language of all instructions in the dataset. In their experience, about 96% of the dataset (110k data points) was classified as English, but due to inaccuracies in the classifier, it is estimated that the actual score could be 99% or higher.
        In addition to English, a small set of hints were found in at least 20 other languages: Spanish, French, German, Portuguese, Italian, Dutch, Romanian, Catalan, Chinese, Japanese, Swedish , Polish, Danish, Turkish, Indonesian, Czech, Norwegian, Korean, Finnish, Hungarian, Hebrew, Russian, Lithuanian, Esperanto, Slovak, Croatian, Swahili , Estonian, Slovenian, Arabic, Thai, Vietnamese, Malayalam, Greek, Albanian and Tibetan.
        Table 8 shows the average number of tips each customer contributed to the dataset. In Table 9, descriptive statistics on the hint lengths (in tokens) used to train various models are reported, and in Table 10, token lengths are broken down by use case. Finally, the lengths of the contractor-written demos used for our SFT model, including both contractor-written and labeler-written cues, are also reported in Table 11.

6. What did the InstructGPT model learn using this labeled dataset?

        When it comes to aligning a language model to human intent, the final behavior is a function of the base model (and its training data), the fine-tuning data, and the alignment method used. Some of the factors that affect fine-tuning the data are described in detail below to finally determine what and what to align. 
        The literature often uses terms such as "human preferences" or "human values" to construct alignments. In this work, they have adjusted the preferences of a group of labelers influenced by the instructions they received, the context in which they received the instructions (as paid work), and who they got their instructions from. Some important caveats apply:
        first, align the model to the exemplar data provided by the labeler and to the labeler's preferences , and the labeler directly generates the data they use to fine-tune the model. Generally speaking, they are mostly English-speaking people living in the US or Southeast Asia , hired through Upwork or Scale AI. Labelers disagreed on many examples; they found agreement between labelers to be about 73%.
        Second, adjust the labeler's preferences , as the researchers who designed this study (and thus by proxy for the wider research organization OpenAI): they write the labeling instructions, which the labelers use as guides when writing demos and choosing their preferred output , And answer their questions about edge cases in a shared chat room. More research is needed on the exact impact of different instruction sets and interface designs on the data collected from tags and their ultimate impact on model behavior.
        Third, the training data is determined by hints that OpenAI clients send to models on the OpenAI API Playground, and thus implicitly align with what clients find valuable, and in some cases, their end-users find valuable in currently using the API. The consumer and their end users may not agree, or the customer may not be optimizing for the well-being of the end user; for example, the customer may want a model that maximizes the time users spend on their platform, which doesn't have to be What the end user wants. In practice, annotators cannot see the context of a given hint or completion.
        Fourth, OpenAI's customers do not represent all potential or current language model users—let alone all individuals and groups affected by the use of language models . For most of the project, users of the OpenAI API were selected from a waitlist. This waitlist was initially OpenAI employees , biasing the final group towards our own network.
        Taking a step back, there are many difficulties in designing an adjustment process that is fair, transparent, and has appropriate accountability mechanisms. The purpose of this paper is to demonstrate that this alignment technique can be aligned to a specific human reference group for a specific application. Not that researchers, hired labelers, or their API customers are the correct sources of preference . There are many stakeholders to consider - the organization training the model, the customers who use the model to develop products, the end users of those products, and the wider population that may be directly or indirectly affected. It’s not just a matter of making the coordination process more participatory; it’s impossible to immediately train a system that fits everyone’s preferences, or that everyone agrees on the trade-offs.
       Training a model that can be conditioned on the preferences of a particular group, or that can be easily fine-tuned or prompted to represent different groups might be the right idea. Groups that support different values ​​can then deploy and use different models. However, these models may still ultimately affect wider society and require many difficult decisions about whose preferences are conditioned on and how to ensure that all groups are represented and can opt out of potentially harmful processes .

7. Inadequacies of such labeled data

        The behavior of the InstructGPT model depends in part on human feedback obtained from our contractors. Some labeling tasks rely on value judgments that may be influenced by who they are contractors, their beliefs, cultural background, and personal history . This group of labelers is obviously not representative of the entire population that will use and be affected by the model we deploy . As a simple example, our labelers are predominantly English-speaking, and our data consists almost entirely of English-language descriptions. They also have many ways to improve our data collection settings. For example, most comparative (RM) datasets are labeled by only 1 contractor for cost reasons. Marking examples multiple times can help identify areas where our contractors disagree , so a single model is unlikely to align with all of them. In cases of divergence, alignment with the average label preference may not be required. For example, when generating text that disproportionately affects a minority group, they might want the preferences of taggers belonging to that group to be given greater weight .

8. Labeling Instructions

        Throughout the development of the project, they provided labelers with instructions on how to label data, and they kept changing the content so that there was nothing confusing or inconsistent in the instructions.

        It is particularly noteworthy that during their training data labeling, they asked the labelers to prioritize helpfulness to users as the most important criterion (above authenticity and harmlessness), while in their final evaluation, they asked Flaggers prioritize authenticity and innocence. They are exploring avenues of research that allow models to sometimes prioritize authenticity and innocence over helpfulness during training, notably through the use of rejection: making the model refuse to answer certain instructions. This creates new challenges: different applications have different levels of risk, so they may want the model to reject what is configured at inference time. Additionally, the model runs the risk of overgeneralizing and rejecting innocuous instructions, which is undesirable for most applications.

        Excerpts from their instructions for the final evaluation of their prompt distributions are shown in Table 10 and the RealToxicityPrompts distribution is illustrated in Table 11

Figure 10: Excerpt from instructions given to labelers for final evaluation of model output on their cue distribution.
They provide full instructions here.

Figure 11: Complete instructions provided to labelers for evaluating
the distribution of toxicity RealToxicityPrompts output by the model.

Finally, another OpenAI API Playground interface

 

 Glossary

abbreviation noun full name annotation
RLHF Reinforcement Learning from Human Feedback
GPT Generative Pre-Trained Transformer
LMs Lanaguage Models
SFT Supervised fine-tuning on human demonstrations
PPO proximal policy optimization Proximal Policy Optimization (PPO) is presently considered state-of-the-art in Reinforcement Learning. The algorithm, introduced by OpenAI in 2017
FLAN Finetuned Language Net. 

FINETUNED LANGUAGE MODELS ARE ZERO-SHOT LEARNERS

https://arxiv.org/pdf/2109.01652.pdf

T0

T0 is an encoder-decoder model that consumes textual inputs and produces target responses
 
RM reward model

Guess you like

Origin blog.csdn.net/keeppractice/article/details/130457237