Detailed explanation of chatGPT principle

InstructGPT original text: https://arxiv.org/pdf/2203.02155.pdf

chatCPT trial connection: https://chat.openai.com/auth/login

        Since chatGPT came out, it has exploded all the way, and currently has 100 million registered users. Its exit has caused major companies to deploy AIGC one after another. Many people predict that the changes brought about by ChatGPT will subvert Google's existing search products and business models. Just an hour before the post, Google announced the launch of Bard against ChatGPT, starting a defensive battle, and bard will be available to the public in a few weeks. How does chatGPT, which can cause such a big response in the field of artificial intelligence, work? Below I summarize the resources on the principle of chatGPT on the Internet.

        From the perspective of the overall technical route, ChatGPT uses the GPT-3.5 large-scale language model ( LLM , Large Language Model ) , and introduces reinforcement learning on the basis of this model to Fine-turn the pre-trained language model. The reinforcement learning here uses RLHF ( Reinforcement Learning from Human Feedback ) , that is, the method of manual labeling. The purpose is to let the LLM model learn to understand various NLP tasks and learn to judge what kind of answers are high-quality (three dimensions of helpfulness, honesty, and harmlessness) through its reward and punishment mechanism (reward). The following explains the relevant basic knowledge and the principle of chatGPT.

1,GPT

        The full name of GPT is Generative Pre-Trained Transformer. As the name suggests, the purpose of GPT is to use Transformer as the basic model and use pre-training technology to obtain a general text model.

        (1) GPT-1 was born a few months earlier than BERT. They all use Transformer as the core structure. The difference is that GPT-1 constructs pre-training tasks from left to right , and then obtains a general pre-training model. This model can be used for downstream tasks just like BERT. fine-tuning . GPT-1 achieved SOTA results on 9 NLP tasks at the time.

        (2) Compared with GPT-1, GPT-2 did not make a fuss about the model structure, but only used a model with more parameters and more training data (see Table 1). The most important idea of ​​GPT-2 is the idea that " all supervised learning is a subset of the unsupervised language model ", which is also the predecessor of Prompt Learning . GPT-2 also caused a lot of sensation at the beginning of its birth. The news it generates is enough to deceive most human beings and achieve the effect of falsehood. Even known as "the most dangerous weapon in the AI ​​​​world" at the time, many portals ordered a ban on the use of news generated by GPT-2.

        (3) When GPT-3 was proposed, in addition to its far superior effect of GPT-2, what caused more discussion was its 175 billion parameter volume. In addition to GPT-3 being able to complete common NLP tasks, researchers unexpectedly found that GPT-3 also has good performance in writing SQL, JavaScript and other language codes and performing simple mathematical operations. The training of GPT-3 uses situational learning ( In-context Learning ) , which is a kind of meta-learning ( Meta-learning ) . The core idea of ​​meta-learning is to find a suitable initialization range through a small amount of data, so that the model can be Fast fitting on limited data sets and good results .

Table 1 GPT series

2. Instruct Learning and Prompt Learning

       Instruction learning is an idea proposed by the Quoc V.Le team of Google Deepmind in an article titled "Finetuned Language Models Are Zero-Shot Learners" in 2021. The article proposes an instruction-tuning-based method called FLAN (Finetuned Language Models Are Zero-Shot Learners). Net), where Instruction-tuning - finetuning language models on a collection of tasks (more than 60 NLP tasks) described via instructions.

         Prompt Learning refers to processing the input text information according to a specific template, and reconstructing the task into a form that can make full use of the pre-trained language model. The following figure shows the difference between fine-tuning and prompt learning: In Fine-tuning: the pre-trained language model "accommodates" various downstream tasks; in Prompting, various downstream tasks "accommodate" the pre-trained language model.

        Instruct is to stimulate the comprehension ability of the language model , which allows the model to take correct actions by giving more obvious instructions. Prompt is to stimulate the completion ability of the language model , such as generating the second half of the sentence based on the first half of the sentence, or cloze and so on. Examples are as follows:

(1) Tips for learning: I bought this necklace for my girlfriend, she likes it very much, this necklace is too ____.

(2) Instruction learning: Judge the emotion of this sentence: I bought this necklace for my girlfriend, and she likes it very much. Options: A=good; B=fair; C=bad.

        The advantage of instruction learning is that it can also do zero-shot on other tasks after multi-task fine-tuning, while hint learning is all for one task. Generalization is not as good as instruction learning. Fine-tuning, hint learning and instruction learning can be understood by the following diagram:

insert image description here

3. Reinforcement learning

        The two most basic elements of reinforcement learning are observations such as states and reward functions . The two most basic elements of supervised learning are training data and labels. These two are completely seamlessly linked, because observations can be used as training data, reward functions can be used as loss functions, and labels are available. In this way, the training data and labels can be continuously obtained when the agent is running in the environment. In essence, the behavior of the agent is gradually fitted to the reward function . Reinforcement learning does not require manual labeling of data , but allows machines to automatically learn labeling . Supervised learning often requires a large number of artificially labeled data sets, while reinforcement learning requires artificially creating an environment and goals, so that an agent (agent) that can interact with the environment can learn how to achieve the goal by itself.

        Reinforcement learning guides model training through the reward ( Reward ) mechanism , which can be regarded as the loss function of the traditional model training mechanism. The calculation of the reward is more flexible and diverse than the loss function (AlphaGO’s reward is the outcome of the game), and the cost of this is that the calculation of the reward is not derivable , so it cannot be directly used for backpropagation. The idea of ​​reinforcement learning is to fit the loss function by sampling a large number of rewards, so as to realize the training of the model. Similarly, human feedback is not derivable, so we can also use human feedback as a reward for reinforcement learning , and reinforcement learning based on human feedback came into being.

        The RLHF used by InstructGPT/ChatGPT can be traced back to "Deep Reinforcement Learning from Human Preferences" published by Google in 2017. It uses manual annotation as feedback to improve the performance of reinforcement learning on simulated robots and Atari games.

Figure: Basic principles of reinforcement learning with human feedback

        InstructGPT/ChatGPT also uses another algorithm for reinforcement learning—Proximal Policy Optimization (PPO). The PPO algorithm is a new type of Policy Gradient algorithm. The Policy Gradient algorithm is very sensitive to the step size, but it is difficult to choose the right one. If the step size of the new and old strategies is too large during the training process, it is not conducive to learning. PPO proposes a new objective function that can update small batches in multiple training steps , which solves the problem that the step size is difficult to determine in the Policy Gradient algorithm.

4. Interpretation of InstructGPT/ChatGPT principle

(1) InstructGPT process

        InstructGPT/ChatGPT both adopt the network structure of GPT-3, and train a reward model (RM) that reflects the effect of predicting content by constructing training samples through instruction learning, and finally guide the training of the reinforcement learning model through the scoring of this reward model. The process is as follows:

The calculation process of InstructGPT: (1) supervised fine-tuning (SFT); (2) reward model (RM) training;

(3) Reinforcement learning is performed according to the reward model through PPO.

        The specific steps are explained as follows:

        Step 1.) Sampling part of the input from the GPT-3 input sentence dataset. Based on these inputs, use manual annotation to complete the expected output results and behaviors, and then use these annotation data for GPT-3 supervised training. This model serves as the cold-start model of imperative GPT.

        Step 2.) In the sampled input sentence, perform forward reasoning to obtain multiple model output results, and sort and mark these output results through manual annotation. Finally, these labeled data are used to train the reward feedback model.

        Step 3.) Sampling new input sentences, the policy strategy network generates output results, and then calculates the feedback through the reward feedback model, and the feedback acts on the policy strategy network in turn. Repeating this, here is the standard reinforcement learning training framework.

        So to sum up, ChatGPT (dialogue GPT) is actually the homologous model of InstructGPT (instructive GPT ) , and then the instructional GPT is based on GPT-3 . First, the cold start model and reward feedback model of reinforcement learning are trained through manual labeling , and finally A dialogue-friendly ChatGPT model is learned through reinforcement learning .

(2) Datasets used by InstructGPT

       InstructGPT uses three datasets, namely: SFT dataset, RM dataset, and PPO dataset.

        The SFT data set is used to train the supervised model in the first step, that is, to fine-tune GPT-3 according to the training method of GPT-3 using the new data collected. Because GPT-3 is a generative model based on prompt learning, the SFT dataset is also a sample composed of prompt-answer pairs. The RM dataset is used to train the reward model in step 2.

Table 2: Data distribution of InstructGPT

(3) Detailed explanation of the three stages of InstructGPT

        InstructGPT summarizes: RLHF is used in the first and second stages. First, the SFT model is obtained by manually labeling and fine-tuning the gpt model, and the SFT model is used to generate k answers for manual ranking, and the RM model is trained. In the third stage, the parameters of the SFT model are taken, and the reward obtained by the RM model is used for training to obtain the pro and pro-ptx models.

  • Supervised fine-tuning (SFT)

        The input of this stage is a batch of data (SFT data set) randomly extracted from the prompt (instruction or question) submitted by the test user, which is mainly divided into two steps:

  1. Manually provide high-quality answers to the extracted prompt data, and obtain <prompt, answer> data pairs
  2. Fine-turn the gpt-3.5 (InstructGPT) model with high-quality answers to help the model better understand input instructions in the first stage.

        In this way, a basic GPT-3.5 language model is learned as the SFT model here.

  • Reward Model (RM)

        The RM structure is the model after removing the last embedding layer of the SFT trained model. Its input is prompt and Response, and its output is a reward value. It is roughly divided into two steps:

  1. For each prompt, InstructGPT/ChatGPT will randomly generate K outputs (4<=K<=9), and then display the output results in pairs to each labeler (labeler), that is, each prompt displays a total of results , Then the user chooses the output with better performance, that is, manually annotating the ranking order.
  2. Use the sorting results to train the data pair <prompt, answer>. During training, InstructGPT/ChatGPT regards the response pairs of each prompt as a batch. This batch training method based on prompt is better than the traditional batch method based on samples. It is less likely to overfit, because in this way each prompt will be input into the model only once.

        The loss function of the reward model is to maximize the difference between the labeler's preferred response and the disliked response . The formula is as follows:

        Where r θ(x,y) is the reward value of prompt x and response y under the reward model with parameter θ, yw is the response result that labeler prefers, and yl is the response result that labeler does not like. D is the entire training dataset.

  • Reinforcement Learning Model (PPO)

        The data set in this step is larger and different from the first and second stages, and no manual work is required . Its method can be summarized in the following four parts:

  1. The parameters of the PRO model are initialized by the supervised model of the first stage
  2. PRO Model Generation Answers
  3. Responses are evaluated and scored using the second-stage RM model
  4. Update the training PRO model parameters by scoring

        InstructGPT/ChatGPT encountered two problems during the training process:

        Problem 1: As the model is updated, the difference between the data generated by the reinforcement learning model and the data used to train the reward model will become larger and larger. The author's solution is to add a KL penalty term βlog( π φ RL y x / π SFT(y|x)) to the loss function to ensure that the output of the PPO model and the output of the SFT will not be very different .

        Problem 2: If only the PPO model is used for training, the performance of the model on general NLP tasks will be greatly reduced . The author's solution is to add a general language model goal γ E X~ D pretrain [ log⁡ ( π φ RL(x))] , this variable is called PPO-ptx in the paper.

       In summary, the training objectives of PPO are:

5. Measurement of experiment and model

  • Helpful - whether user intent can be inferred

        The test method is that the annotator chooses a better one between the output of the model to be tested and the output of the SFT model . A score of 0.5 means that the model performs about as well as SFT. The left side of the figure below is the prompt provided by GPT, and the right side is the result of the prompt test provided by instructGPT. It can be seen that the pro and pro-ptx models have achieved better results than other models.

  • honest

        The method is to add an instruction prompt like the following before the model to remind the model what questions to be careful about and how to answer them.

        The result is as follows, the picture on the right is the effect of adding instruction. Among them, the gray represents the credibility, and the colored ones represent the ratio of both credible and information content:

  • harmless

        The method is also to add instruction, so as to reduce the model to produce harmful/impolite/biased answers, the effect is as follows:

Reference link:

1, ChatGPT/InstructGPT Detailed Explanation-Knowledge

2. Analysis of the basic principles of ChatGPT - Zhihu

Guess you like

Origin blog.csdn.net/qq_43704127/article/details/128920503