InstructGPT essay intensive reading

Background
Recently, large models such as ChatGPT are too popular. Take a look at the instructGPT (ChatGPT sister model) paper released by openAI, and record the reading process. (It’s not a direct translation, I added some of my own understanding, for reference only, please move to the original InstructGPT: Training language models to follow instructions with human feedback )

论文题目:InstructGPT:Training language models to follow instructions with human feedback

Summary

In terms of obeying the instructions entered by the user and correctly understanding the actual intention of the user, the model with larger and larger parameters will not perform better and better. Large mockups can sometimes produce content that is harmful, fake, or unhelpful to a user's problem. This situation can be thought of as the model not being aligned with the user's ideas. In this paper, we use a fine-tuning method based on human feedback to enable the model to align with human thinking on various tasks. First, screen a batch of prompts submitted on the openAI API, and the labeler writes the corresponding response to create a supervised dataset for supervised fine-tuning. Then, a batch of data for scoring and sorting the model output is collected, and finally the model is fine-tuned with this part of the data, and the final model becomes instructGPT. After manual evaluation of some prompts, it is found that although the parameters of GPT-3 have 175B, which is more than 100 times that of instructGPT, the performance of 1.3B instructGPT is better. In addition, although the performance of the public NLP dataset task has declined, the model's performance in the authenticity of the output and the output of reducing harmful content has improved. And although instructorGPT still makes some small mistakes, our results have demonstrated that fine-tuning based on human feedback is a promising direction.

1.instruction

Large language models can perform many natural language processing tasks by providing some examples in the input by giving some hints. But the models themselves often behave in unexpected ways, such as fabricating facts, outputting biased or harmful content, or simply ignoring user instructions. This is because the training purpose of many large language models is to predict the next token, which is not the same as understanding user instructions and outputting safe and correct content. Therefore, the target of the large language model under this training method is non-aligned, and from an application point of view, it is crucial to correct this unexpected behavior of the output.

Alignment research has made some progress in this area by training the model to follow user preferences to output, including clear instructions, or some relatively vague ones, such as maintaining authenticity, not being biased, and not outputting toxic or harmful content . In Askell terms, we want language models to be helpful (help users complete tasks), honest (should not make up information or mislead users), and harmless (do not cause physical or psychological harm to anyone or the environment). even social harm).

We focus on using the fine-tuning method to train the alignment ability of the model, specifically using the reinforcement learning method based on human feedback to fine-tuning GPT-3 so that it can obey various instructions. This approach uses human preferences as reward signals to fine-tune the model. At the beginning, we hired a team of 40 labelers to label the data according to the performance in a specially designed screening test. Then, collect a batch of prompts submitted through the openAI API and some prompts written by the labeler. The labeler writes the expected output of the model, and uses this part of the data set to train the supervised model baseline. Next, we collected the output of the baseline model submitted under the prompt on the API, and the labeler compared these outputs pairwise. On this comparison data set, a reward model (RM) is trained so that the model can predict which output is more in line with human preferences. Finally, use this RM model as the reward function, fine-tune the previous supervised model baseline through the PPO algorithm, and maximize the reward score (Figure 2). This process allows the model to align to the preferences of a small group of people (mainly our labelers and researchers), rather than broad human values. The final trained model is called instructGPT.
figure 2

We mainly rely on the labeler to evaluate the output quality of the model on the test set. The prompt in the test set comes from user data that has not appeared in the training set. We also evaluate on many public NLP datasets. A total of three models of different sizes (1.3B, 6B, 175B) are trained, so the model structure is the same as GPT-3. The main findings are as follows:

  1. Compared with GPT-3, the labeler clearly believes that the output of instructGPT is better . On the test set, the 1.3B model still performs better than the 175B GPT-3 despite a 100-fold difference in the number of parameters. These two models have the same structure, and the only difference is that instructGPT is fine-tuned on a supervised dataset annotated by humans. Even when some examples (few-shot prompt) are provided to GPT-3, the conclusion still holds. In the case of 85±3%, the instructGPT model of 175B is better than the GPT-3 of the same size, and in the case of 71±4%, it is better than the GPT-3 under the few-shot. According to the labeler feedback, the output of the instructionGPT is better. , and can well obey the restrictions in the instruction.
  2. In terms of the authenticity of the output content, instructGPT has improved compared to GPT-3 . In the TruthfulQA test, instructGPT outputs twice as many true and valid answers as GPT-3. In the closed-domain task of the prompt from the API, the output of the model should all come from the input (such as a summary or closed-domain QA task), and the number of times instructionGPT fabricates information other than the input is only half of GPT-3 (21% VS 41%).
  3. InstructGPT has some small improvements over GPT-3 in terms of toxicity, but no gap in bias . To measure toxicity, we performed automatic and human evaluations on the RealTOxicityPrompt dataset. When the prompt is more respectful, instructGPT outputs 25% less toxic content than GPT-3. On the Winogenender and CrowSPairs datasets, there is no significant improvement in instructGPT.
  4. **By adjusting the RLHF fine-tuning process, the performance degradation of the model on public NLP datasets can be minimized. **During RLHF fine-tuning, we found that the performance of the model decreased on public NLP datasets, especially on SQuAD, DROP, HellaSwag, WMT 2015 French-English translation datasets. On some of the tasks we compare relations with, the alignment process is followed by performance degradation on these tasks, which is a typical alignment tax. By adding pre-training data in the PPO update process without introducing labeler scoring, performance degradation can be greatly reduced.
  5. ** For those labelers who did not provide training data, the output of the model still has good generalization. **In order to test the generalization performance of the model, we conducted a preliminary test on this part of the labeler that did not provide training data. As a result, they also believed that the answer of instructorGPT was better than GPT-3, which was consistent with the score of the labeler that provided training data. Still, more work is needed to investigate how models perform on broader populations, and how models behave when humans disagree with the desired output.
  6. ** Publicly available NLP datasets do not reflect model usage. **We compare GPT-3 fine-tuned on human preference data with GPT-3 models fine-tuned on two different public NLP datasets (FLAN and T0++). These datasets contain a wide variety of NLP tasks, each with natural language instructions. Tested on the prompt from the API, it was found that the performance of the model trained on the public data set was worse than that of the SFT model, and the labeler obviously preferred the output of instructGPT (compared with the baseline, instructGPT won in 73.4±2% of the cases. , while the winning rates of the T0 and FLAN versions are 26.8±2% and 29.8±2%, respectively).
  7. ** InstructGPT also exhibits promising generalization on instructions not present in the RLHF fine-tuning process. ** Quantitatively explore the ability of instructorGPT, and found that it can follow instructions like summarizing code, answering code-related questions, and sometimes even instructions in different languages, and these instructions are very few in the fine-tuning process. Compared to GPT-3, although these tasks can also be performed, it requires a very strict prompt process, and usually does not understand the instructions correctly in these areas. This result is exciting because it means that the model is able to generalize the concept of "following instructions". The alignment ability is maintained with very little supervised data.
  8. **instructGPT still makes some small mistakes. **For example, it doesn't always understand commands correctly, fabricates facts, doesn't give straightforward answers to simple questions, or doesn't recognize commands with false premises.

2 related work

Research on alignment and learning from human feedback. Building on previous work, we leverage human intent to align the model, namely Reinforcement Learning with Human Feedback (RLHF). Originally developed to train bots in simulated environments and in games, the technique has recently been used to fine-tune language models that learn to summarize text. Dialogue, translation, semantic parsing, generative stories, comment generation, evidence extraction, etc., have in turn been influenced by similar work to use human feedback as rewards. Madaan uses the feedback of human writing to enhance the prompt and improve the performance of GPT-3. There are also some studies in the text domain, using RL with priors to align clients. And our work can be seen as a direct application of RLHF to align language models on multi-domain tasks.
The specific meaning of model alignment ability has recently attracted the attention of many researchers. Kenton listed some problems caused by language model alignment, including harmful content and misplaced game goals. In existing research, Askell proposes that language assistants can be used as a testbed for alignment research, investigating some simple baseline and scaling properties.
Train the model to follow instructions. Our work also involves the cross-task generalization of language models. Language models are fine-tuned on a series of public NLP datasets (usually with corresponding instructions), and then tested on different multiple NLP datasets. . There have been many research works in this area, but there are differences in training and testing data, instruction format, model parameter size, or other experimental details. These studies all have a common finding, that is, fine-tuning the language model with NLP data with instructions, whether in zero-shot or few-shot scenarios, can improve the performance of the model on downstream tasks.

Assessing the Hazards of Language Models. One purpose of modifying the behavior of a language model is to mitigate the harmfulness of the model when it is applied in the real world. These hazards have been described extensively. Language models can generate biased content, leak private data, mislead, and be used maliciously. For more information, please read WEidinger's article. The application of language models in specific domains also comes with risks and challenges, such as in the field of dialogue. Establishing a standard to assess these harms on an ongoing basis, especially with regard to harmfulness, stereotypes and social prejudice, is an emerging but growing field. However, it is very difficult to make great progress in this work because of the side effects of well-intentioned interventions in language models. For example, while reducing the toxicity of the model due to biased relationships in the training data, it also reduces its ability to work on minority texts.
Modify the behavior of the language model to mitigate the harm. There are many ways to change the output behavior of language models. Solaiman and Dennison's fine-tuning of models on small-scale, high-value datasets improves the model's ability to inherit these values ​​on QA tasks. Ngo removed text data for which the language model had a high probability of generating the trigger phrases set by the researchers. After training with the filtered dataset, although the model produced less harmful content, there was also a significant drop in performance. Xu used a series of methods to improve the security of chatbots, including data filtering, special word locking when outputting content, controlling security tokens, and human-in-loop data collection. Other methods to reduce the language model include word embedding regularization, data enhancement, full-space projection to make the distribution of sensitive tokens more uniform, different objective functions, or random stop analysis, etc. There is another language model to control the output of the model, and many of these methods are applied to mitigate the toxicity of the language model.

3. Methods and Experimental Details

3.1 Advanced method
We use the same method that Ziegler and Stiennon applied in the field of style continuation and summary. At the beginning, we need a pre-trained model, some prompt sets that we want the model to generate aligned content, and a team that trains human labelers. Then start the following three steps:
Step 1: Collect the desired data and train the supervision strategy. According to the input prompt, the labeler provides a batch of expected model output content. Use this part of the data to perform supervised fine-tuning on the pre-trained GPT-3 model.
Step 2: Collect comparative data and train the reward model. Given a prompt, compare the output of the model, and the labeler decides which output is better. This part of the data trains a reward model to predict which output will be more human-preferred.
Step 3: Use the PPO algorithm to optimize the strategy based on the reward model. The output of the RM model is used as the reward, and the PPO algorithm is used to fine-tune the supervision strategy to optimize the reward.
Step 2 and step 3 can repeat the cycle, collect more comparative data based on the current strategy, use it to train a new RM model, and then fine-tune the strategy model. In practical applications, most of the comparative data come from the supervision strategy, and some come from the PPO strategy.
(In human terms, it is to give a batch of prompts, and humans write the corresponding expected output, and use this part of the data to supervise and train an initial model; then give a batch of prompts, and use the initial model to generate multiple prompts corresponding to this part. Different output content, human beings score these multiple content, and use this batch of scoring data to train a reward model. The purpose of the reward model is to judge which output is most liked by humans; finally, use the PPO algorithm to establish a loop iterative optimization, that is, give A prompt, using the initial model to generate an output, the reward model judges how many points the output will get, as a reward, the ultimate goal is to maximize the reward for the output of the model, that is, to be liked by humans as much as possible. After the end of the round, there will be Get a batch of reward scores with the same input and different outputs, and then use this to train the reward model, and the new reward model will enter a new round of iterative optimization)

3.2 Dataset

The prompt dataset is mainly composed of prompts submitted on openAI. It is worth noting that the API model at that time was an early version of instructGPT, and the model only underwent supervised learning on the desired dataset. Users are reminded every time they use the API that their data will be used to train the model. In this article, the data on the production API is not used. The prompts are deduplicated by the common prefix, and the number of prompts of the same user ID does not exceed 200. The segmentation of the training set, verification set and data set is based on the user ID, so the data in the training set will not appear in the verification set and Test concentration. In order to prevent the model from learning potentially sensitive user information, the prompts in the training set are filtered by PII.
In order to train the first version of the instructionGPT model, we require the labeler to write the prompt itself, because an initial instruction prompt data source is needed to start the first step, and these prompts rarely appear in the prompts submitted by the GPT-3 model API. The prompt written by the labeler mainly has the following three types:

  1. Simple. The labeler writes out some prompts at will, just make sure that these prompts cover a wide enough range.
  2. few-shot. With instructions, and one instruction has multiple responses.
  3. user-based. The prompt that needs to be closed first with the user.

Based on these prompts, three parts of data were collected for fine-tuning: (1) SFT data set, used to train the SFT model. (2) RM data set, the labeler scores and sorts the output of the model to train the RM model. (3) The PPO dataset, without labeler participation, is used as input for RLHF fine-tuning. The SFT dataset includes a total of 13k training prompts from API and labeler, the RM dataset contains 33k training prompts, also from API and labeler, and the PPO dataset contains a total of 31k prompts from API only. See table 6 for more details about the dataset.
insert image description here

insert image description here
Table1 shows the type distribution of prompts submitted on the API, most of which are generative tasks, not QA or classification tasks. Table 2 is written by researchers imitating the prompts submitted on the API. See Appendix A for more prompt details.

3.3 Tasks

There are two sources of training tasks: (1) prompt written by labeler. (2) The prompt submitted through the API. These prompt categories are particularly broad, including generative classes, QA, dialogue, summarization, extraction, and other types of natural language tasks (Table 1). The dataset is 96% English corpus, but in Section 4.3 we also explore the performance of the model on other language and coding tasks.
For each natural language prompt, it is usually accompanied by a direct corresponding natural language instruction (eg write a story about a smart frog), and some are few-shot (eg provide two examples about a frog story, prompting the model to output a new), or an implicit continuation (eg to provide a beginning of a frog story). In each example, we ask the labeler to guess as much as possible about the intent of the written prompt, and skip the ones that are ambiguous. According to the guidelines provided, the labeler is also required to take into account harmful responses, such as biased or toxic output.

3.4 Manually Annotated Data Collection
In order to collect the data needed for evaluation, we employ a team of 40 people through Upwork and ScaleAI. Compared to earlier related work collecting human preference data on summarization tasks, our input data comes from various domains, occasionally including controversial or sensitive topics. Our ultimate goal is to find a batch of labelers that are sensitive to the preferences of different groups and can well identify which content is potentially harmful. To this end, we conducted a screening test on the labelers in these dimensions to measure the performance of these labelers. In the end, only good performers will be used as labelers. See Appendix B.1 for more screening details.
During training and evaluation, our alignment criteria are not always consistent: for example, when a user asks for a potentially harmful reply, the usefulness of the reply to the user is ranked first during training, but in the final evaluation During this period, the labeler is required to consider authenticity and harmlessness first.
As Sitennon said, we worked very closely together throughout the project. The labeler will have a set of induction training process when entering the job, including writing instructions for different tasks and answering the labeler's questions in the open chat room.
An initial purpose of the study was to focus on how the model generates data that is favored by other labelers. We hired another batch of labelers that did not provide training data, they were from the same company, but were not screened.
Despite the complexity of the task, we find that the rate of agreement between labels is quite high: the training labels agree with each other in 72.6 ± 1.5% of the cases, and for the remaining labels, in 77.3 ± 1.3%. For comparison with Stiennon's results, the inter-researcher agreement was 73 ± 4%.

3.5 Models
We start with GPT-3 pre-trained large language models, which are trained on huge Internet data, are able to adapt to a variety of downstream tasks, but perform poorly on feature behavior. Based on these models, we train the models using the following three techniques:

  1. Supervised fine-tuning (SFT). Supervised fine-tuning of the GPT-3 model on the expected data provided by the labeler. Trained for 16 epochs, using cosine learning rate decay, dropout=0.2. According to the RM score of the model on the verification set, the appropriate one is selected as the SFT model. Similar to Wu's opinion, we found that the SFT model will overfit after 1 epoch, but at the same time we found that training for more epochs can help improve the model's RM score and human preference performance.

  2. Reward Model (RM). Remove the last unembedding layer of the SFT model, use prompt and response as model input, and reward scores as model output to train the model. In this study, we only use the 6B size RM model, which can save a large part of computing power. We find that training RM on 175B is unstable and thus unsuitable as a value function during RL.
    In Stiennon's work, the data set of the RM model is composed of the same input and the output of two different models are compared with each other. They use the cross entropy loss function, and the comparison result is used as a label - the difference between the rewards represents the relative For another response, the probability that this response is preferred by the labeler.
    In order to speed up the collection of comparative data, we provide k=4 to k=9 responses to the labeler for sorting, and finally each prompt gets insert image description herea sorting result. Because the ranking results are closely related to each task, if these ranking results are simply disrupted to form a data set, a round of training will cause the reward model to overfit. So, we take all the sorted results of each prompt as a batch. In this way, each time the model only needs to calculate the forward propagation, the training efficiency is greatly improved, and there is no over-fitting, and the accuracy and log loss on the verification set are also improved.
    The loss function of the reward model is as follows:
    insert image description here
    Finally, when the loss of the reward model no longer changes with the reward score, the reward model is standardized so that the mean value is 0.

  3. Reinforcement Learning (RL). Use the PPO algorithm to fine-tune the SFT model, randomly provide a prompt, and generate a corresponding response. According to the prompt and response, the reward model outputs a reward score. In addition, a KL penalty function is added to each token to alleviate the overfitting of the reward model, and the initialization of the value function comes from RM. The whole model is called PPO.

(to be continued)

Guess you like

Origin blog.csdn.net/yaogepila/article/details/131333133