ChatGPT/InstructGPT paper (2)

1. Introduction

Interpretation of the first article: ChatGPT/InstructGPT paper (1)
After the ChatGPT fire, more and more people want to know about ChatGPT related technologies. Although the OpenAI official website does not give enough detailed information on ChatGPT, it gives a recommended reading paper InstructGPT. After comparison, it can be found that the two technologies are not much different, so you can understand ChatGPT through InstructGPT. The following is a detailed explanation of the content of InstructGPT. It is best to understand GPT1-GPT3 in advance to read the content of this article.

II. Summary and overview

The problem this article attempts to solve: the existing large-scale language models (such as GPT3.5) are not the bigger the model, the better. Larger models still generate fake, harmful, simplistic and useless content. Simply put, you can't get what the user wants.
Solutions: 1. Manual guidance, using supervised learning. 2. Reinforcement learning, in the case of human assistance, train a model that will further guide the GPT model to generate high-quality results.
Experimental results: InstructGPT's model with 1.3B parameters can achieve better performance than GPT-3's 1750B parameter model (that is, it can generate better, more useful, safer, and more harmless content, people who have molested ChatGPT should be more experienced).

3. Introduction

Existing large language models often generate unwanted content (understandable since the objective function is only to predict the next word and the reason for the high signal-to-noise ratio of the dataset).

Therefore, this article uses the fine-tuning method to improve on a large language model LM. The general idea is:

  • First, collect a large number of instruction prompts (which can be understood as questions, but actually instructions are bigger than questions. For example, writing a poem for me is an instruction, but we may not ask this way when we ask specific questions, that is, we have many ways to ask questions. , so one instruction corresponds to multiple questions), let humans provide answers, expand the data set, and guide the first step .
  • Second, manually rank the LM answers and use these rankings as labels. Train an RM model in the form of (prompt, label). After the training is completed, the model can score the answers generated by LM and tell LM which ones are good answers .
  • Finally, RM will be used as a reward function, and the PPO algorithm is used to fine-tune the GPT model to maximize the RM reward function .

Evaluation: Use prompts that do not appear in the data set to manually score the results generated by the GPT model.

Model: The InstructGPT model considers 3 setup scenarios: 1.3B, 6B, 150B. The structure of the model is consistent with GPT-3.

Experimental results:

  1. In terms of manual scoring results, InstructGPT is significantly better than GPT-3
  2. The result of InstructGPT is more real than GPT-3, that is, it will not talk nonsense (but judging from the results of ChatGPT, it still needs to be improved)
  3. The results of InstructGPT are less harmful than GPT-3, that is, there will be no racial discrimination or the like
  4. In the face of specific NLP data sets, it can also be improved
  5. In the face of human-guided prompts that did not appear, the results of InstructGPT will also be better, that is, the InstructGPT model has been generalized
  6. InstructGPT will be better than the GPT model after using the general public NLP dataset
  7. InstructGPT still makes simple little mistakes (and judging from the ChatGPT results it does).

4. InstructGPT process

The InstructGPT process is shown in the figure below. Can be divided into 3 stages:

  1. Stage 1: Before this stage, we first have a large language model GPT-3.5. Then, we randomly sample some instruction prompts from the instruction library, and give the answers manually. Finally, use these instructions and answers as a new data set to further train the GPT-3.5 model. Here, the answers given by humans are used as labels, so it is supervised learning . We call the model after this stage is SFT (supervised fine-tuning).
  2. Phase 2: Use SFT to generate multiple answers to the instructions in the instruction library. These answers were then manually ranked. These rankings are used to train a reward model RW, that is, reward values ​​can be given to multiple answers to an instruction through this model .
  3. Phase 3: Sample a batch of instructions from the instruction library, let SFT answer multiple answers, and use RW to give the predicted reward value. Finally, the model is updated using the PPO algorithm using the reward value .

chatgpt

5. Dataset

I think it mainly contains 4 data sets, and exactly 4 data sets correspond to different stages. Data sets 2, 3, and 4 correspond to stages 1-3 of the InstructGPT process, while the first data set corresponds to the stage of training the GPT3.5 model before the first stage.

1. Before understanding the 4 data sets, you must first understand OpenAI and make an instruction library data set

Instruction library data set: The instruction library has been mentioned many times before. This instruction library comes from the user questions collected in the API provided by OpenAI's previous GPT model. However, these issues have been dealt with to a certain extent, including removing duplication, limiting each user to a maximum of 200 instructions, and removing personally identifiable information . Table 1 below shows the distribution of commands from the API. However, the processed instruction library data set is not comprehensive and easy to use. So InstructGPT let people (please labelers) write 3 types of instructions for expansion: (1) Plain: Write various instructions as wide as possible to increase diversity . (2) Few-shot: Write some instructions and give multiple pairs (question, answer) . (3) User-based: According to the user's request queue waitlist obtained from the API provided by OpenAI just now, user requirements are collected, and corresponding instructions are written.

chatgpt

The following are the 4 datasets used by InstructGPT:

  1. Existing datasets for training GPT3.5: This dataset mainly comes from NLP datasets, web crawlers and OpenAI from other companies. Refer to the work of GPT-3 (Language Models are Few-Shot Learners).
  2. The model for training SFT (data volume: 13k): from the API in the instruction database and manually written (prompt, answer), used for the initial training of InstructGPT in the first stage .
  3. Ranking data set for training RM (data volume: 33k): first obtain instructions from the API in the instruction database and manually written instructions, then use InstructGPT to answer, and manually rank the answers of InstructGPT to obtain a ranking data set (prompt, Rank ). This dataset will be used for training to get an RM model .
  4. PPO dataset (data volume: 31k): Only get instructions from the API of the instruction database, and use the InstructGPT and RM models to get the scores given by answer and RM respectively. Finally, this data set consists of multiple triples (prompt, answer, score given by RM) for further fine-tuning using the PPO algorithm to fine-tune InstructGPT .

Summary: The unique data set of InstructGPT is not large, and 13k-33k is very small for a company. However, it is still valid and can significantly improve the performance of the model.

6. Model training

SFT (Supervised fine-tuning): The model SFT trained in the first stage. Specifically, the number of iterations uses 16 epochs, the learning rate uses cosine decay, the model residual connection, and the dropout rate is 0.2 .

RM: The reward model trained in the second stage. The model structure is to remove the last unembedding layer on the basis of SFT, and train the model with (prompt, answer, Rank=Reward). The model size uses 6B .

7. Related Links

  1. Detailed explanation of InstructGPT papers (must-read papers for learning ChatGPT)

Guess you like

Origin blog.csdn.net/flyingluohaipeng/article/details/129882953