InstructGPT

Article directory

Abstract

Given human commands and manually annotated desired results , a dataset is formed and supervised learning is used to fine-tune GPT-3.

We then rank the model outputs to form a new dataset, and we use reinforcement learning to further fine-tune this supervised model.

We call the resulting model InstructGPT.

Starting with a set of labeler-written prompts and prompts
submitted through the OpenAI API, we collect a dataset of labeler demonstrations
of the desired model behavior, which we use to fine-tune GPT-3 using supervised
learning. We then collect a dataset of rankings of model outputs, which we use to
further fine-tune this supervised model using reinforcement learning from human
feedback. We call the resulting models InstructGPT.

Three specific steps for implementation

  1. Manually select some questions and manually give answers to train the SFT model with this data set.
  2. Let the trained SFT model answer some questions, manually score and sort the answers, and then use this part of the data set to train the reward model RM.
  3. Based on the scoring results of the reward model RM, continue to optimize the SFT model.

Insert image description here
result

The parameters of InstructGPT are 1/100 of GPT-3, but the performance is better. At the same time, InstructGPT is better in terms of credibility and reducing harmful output.

In human evaluations on
our prompt distribution, outputs from the 1.3B parameter InstructGPT model are
preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters.
Moreover, InstructGPT models show improvements in truthfulness and reductions
in toxic output generation while having minimal performance regressions on public
NLP datasets.

Article link: https://arxiv.org/pdf/2203.02155.pdf

Guess you like

Origin blog.csdn.net/shizheng_Li/article/details/128982197