ChatGPT pursues its ancestors: Interpretation of the key points of the GPT-3 technical report

Paper address: Language Models are Few-Shot Learners

Related articles from previous issues:

The reason why the title of this article is called a technical report rather than a paper is because the 63-page GPT-3 article is not a published paper, but a report. The article does not address the structure and entire training of the model. The detailed introduction of the process is basically a discussion, so this blog only selects some points that I think are worthy of attention. 

Abstract

Looking back at GPT-1 and GPT-2, GPT-1 mainly uses changing the input style to let the model learn to perform different tasks. The author of GPT-2 emphasizes zero-shot learning throughout the article and abandons the use of adding special symbols to the input to distinguish For different tasks, pure natural language input is used to perform different tasks. In GPT-3, the author no longer emphasizes zero samples, and believes that relying on a large amount of annotated data for task-related fine-tuning is not a good idea. A good idea, because the author mentioned that when humans learn new tasks, they often only need a small number of samples to learn new knowledge. I guess the author thinks that it is not possible to give not a single sample, so the author thought of a few-shot The method is awesome! Of course, the one-shot method is also mentioned, which will be discussed later. In the abstract, the author mainly explains that they developed a GPT-3 containing 175 billion parameters, which is 10 times larger than the previous non-sparse model. Why is it non-sparse? Because the weight of the sparse model has many 0s, which will cause the model to be inflated. , so there is no comparison significance. Secondly, the author found that the news text generated by GPT-3 is difficult for even humans to distinguish whether it is true or false and whether it was written by a human.

1. Instruct 

Next, the author mentioned the current paradigm for language model training, which is to pre-train on a task-independent data set and then fine-tune on a specific task data set. However, this paradigm has big problems. This is how the model is trained. A large amount of annotated data is still needed for fine-tuning. Specifically, the author listed three issues:

To sum up, there are mainly three issues:

  • Dependence on annotated data sets, that is, model training requires a large amount of annotated data, which is very difficult;
  • The good performance of the fine-tuned model is not necessarily due to the strong generalization ability of the pre-trained model. It may be because the large batch of data used in pre-training covers the information of the fine-tuned data. If the fine-tuned data does not have a corresponding distribution in the pre-training, then The model's performance may deteriorate.
  • When humans learn a new task, they often do not need a large number of examples to assist. For example, if you want to know cats, then in fact, if you are given the appearance of a few cats, you will most likely be able to learn the cats no matter what color or breed they are in the future. Identify it as a cat. GPT-3 wants to analogize the human learning process and believes that the model does not need a large number of task-related examples to learn. This is a few-shot.

In order to solve the above problems, the authors also put forward their ideas. In this paragraph, the authors mentioned a relatively new term called meta-learning, and also mentioned "in-context learning", which is context learning. mean. For meta-learning, it is actually not that advanced. To put it bluntly, it means sending a large number of samples of different tasks to the model at the same time for pre-training, which is similar to the multi-task learning in GPT2. When doing in-context learning, based on the sample samples Differentiate between zero-shot, one-shot and few-shot, so is this in-context learning process doing gradient updates? There is none. The author explains meta-learning and in-context learning at the end of this page:

The author said that the zero-sample learning mentioned before does not really learn from zero samples. In order to avoid this ambiguity, meta-learning is used to replace the pre-training process, and in-context leaning is used to represent the forward propagation process ( Note that it can be considered as inference because it does not involve updating the gradient). And according to the number of examples relied on in the inference process, it is divided into zero samples, single samples and few samples. To be honest, it's a bit convoluted. If you don't analyze the author's meaning in depth, it will be more troublesome. The author also attached a picture to illustrate this process:

The outer loop part is the unsupervised pre-training process, and the blue part at the bottom is in-context learning. Judging from the picture alone, the entire outer loop part is unsupervised pre-training. In this process, different in-context learning stages are included. However, according to the meaning of the article, it is impossible not to perform gradient updates during the pre-training process. In-context learning does not do gradient updates, so the above picture may represent some problems. So I think this picture is not about how GPT-3 is trained, but tells us that there are such component stages. Specifically, there is an unsupervised pre-training stage, and in-context learning only occurs in the forward propagation stage. The author wants to tell us that in the pre-training stage, as long as the amount of data is large enough, it is very likely to include in- Sample task-related examples in the context learning process. Just take a look, it's quite strange anyway. Of course, since GPT-3 is not open source, these are just guesses.

The author of this paragraph wants to explain that as the number of model parameters continues to increase, the performance of the model is indeed constantly improving. The summary is that the number of model parameters does have a direct impact on the performance of the model.

The author compares experiments on different shots. Judging from the graph, few-shot seems to be the best. Note that the three solid lines on the graph are the average of n experiments. In other words, as the number of examples given increases, the performance of the model generally gets better and better. However, it is worth noting that the example samples given are not used to fine-tune the model, but to enable the model to perform related tasks based on the example samples without updating the gradient. This is indeed a bit like the process of humans performing new tasks. .

The author also mentioned that GPT-3 still has a lot of room for improvement in natural language inference tasks, and they will continue to increase their efforts in improving research on few-shot learning. 

These paragraphs mainly summarize what the author will discuss next in the article. Including data sets, model sizes, different training methods, model limitations, social impact, etc., there is not much to say.

2. Approach 

Next, the author introduces what fine-tuning, few-shot learning, single-shot learning, and zero-shot learning are, and how they do it. It is worth mentioning here that the authors introduced in few-shot learning that the sample range of their experiments was 10 to 100, and also mentioned that a major shortcoming of few-shot learning is that it is still better than the current best model with a fine-tuning process. Poor performance. Zero-shot learning is the closest situation to human learning of new tasks. The following is an example diagram, showing different examples of fine-tuning, zero-sample, single-sample, and few-sample learning:

From the above figure, you can clearly understand what the four different processes are: fine-tuning is to use a certain amount of labeled samples for incremental training (with gradient updates) based on the pre-trained model; zero samples are only hints word, there is no example sample; single sample means there is a prompt word and an example; few samples means prompt word plus a small number of examples. The purpose of examples is to give the model an example to look at when reasoning, so that the model can know what you want it to do. For example, in the prompt word "translate English to French:" in the picture above, below the prompt word are the sample samples and the samples to be predicted. The sequence on the left of "=>" refers to the original text (English), on the right is the target text (French), and the last one Sequences without target text are samples to be predicted that require model prediction.

Next, the author lists the parameters of the models of different sizes they designed in the article:

You can see this clearly by looking at the picture. Among them, the parameters of GPT-3 Small are similar to Bert-base, and GPT-3 Medium is similar to Bert-Large. The number of parameters of GPT-3 XL is equivalent to that of GPT-2, but its vector dimension will be wider than GPT-2, and the number of layers will be deeper than GPT-2. GPT-2 has 48 layers. As for why these parameters are set in this way, I think it should be the metaphysical thinking of the authors that played a role. . . What is more curious is that as the number of GPT layers increases, its vector dimension does not increase to a high degree, and as the batch size increases, the learning rate also decreases. I feel that this is a bit contrary to our understanding. Usually the multiple of the increase in the number of layers is equivalent to the increase in the vector dimension. Because as the number of layers increases, there are more vector dimensions to remember more information. When the batch size increases, the learning rate should also increase. Yes, because with such a large batch of samples, there should be a large learning rate to quickly approach the optimal space. In short, this is quite metaphysical.

2.1 Model and Architectures

When it comes to the model structure issue that everyone is most concerned about, it is a pity that the most important part has very little space. The author said at the beginning that there is no essential difference in model structure between GPT-3 and GPT-2. The only difference is that GPT-3 uses a network structure called Sparse Transformer in the attention mechanism. Sorry, this I haven’t studied the model in detail yet, so I won’t discuss it for now. All eight model authors adopted an input limit of 2048 tokens, which means that GPT-3 supports up to 2048 tokens. The author also briefly mentioned the engineering issues of distributing the depth and width of the model on multiple machines in parallel, which can reduce the computational complexity.

2.2 Training Dataset

When you want a model as large as GPT-3, you have to consider big data. The authors introduce in this section that their main data source is based on Common Crawl sampling.

In the GPT-2 article, the author mentioned that the quality of this data set was low, so they did not use it. Instead, they used other methods to build a new data set. However, in GPT-3, in order to train such a large model, the author had to use this data set. In order to improve the quality of the data set, the author took 3 steps to improve the average quality of the data set: (1) Download and screen a Common Crawl version similar to a series of high-quality reference corpora. To put it simply, take a high-quality The reference corpus is compared to select the high-quality corpus in common crawl. The comparison method I saw someone on the Internet said is to use two classifications, which is to treat the low-quality data in the common crawl data set as negative samples, and other high-quality data sets as For positive samples, train a two-class model, and then use the two-class model to filter common crawl. This is a method. It doesn’t matter anyway. As long as it can complete the job well, you can use any method to filter it. It belongs to the data. Tasks of feature engineering; (2) Fuzzy deduplication is carried out at the document level to prevent redundancy and maintain the integrity of the verification set, as a basis for accurately measuring overfitting. I think such a large batch of deduplication is very effective. Eighty-nine uses a hashing method, similar to LSH; (3) Adds known high-quality reference corpora to the training mix to increase the diversity and richness of Common Crawl, such as GPT-2, Bert, etc. The corpus used.

As can be seen from the above table, although the common crawl base is large, its sampling rate is not as good as the sampling rate of the following data sets. This may be because the author still believes that the quality of the common crawl data set is low, so its sampling rate should be reduced. Sampling ratio. 

The author mentioned that there is a major methodological problem with language models pre-trained on large amounts of Internet data, especially large models with the ability to remember large amounts of content, that is, the test set or development set may be inadvertently seen during the pre-training process , thereby causing potential pollution to downstream tasks. In fact, it is a problem of "cheating" in the model, which can easily lead to falsely high model performance. Although the author has done some deduplication, due to some flaws in the filter design, the deduplication is incomplete. That is, there are still many overlaps between training samples and test samples in the GPT-3 data set. For this The problem,retraining is not possible because the cost is too high, so the authors,want to put it to be solved in subsequent research.

2.3 Training Process

This section briefly introduces the training process (no details), but there is not much worth paying attention to.

This paragraph mainly tells the author that they use v100 distributed training. There is not much else. You can read the content in Appendix B: 

The following points can be extracted from Appendix B:

  • The parameters of the Adam optimizer are β1=0.9, β2=0.95, \epsilon=10^-8;
  • The learning rate uses a cosine decay strategy to decay the learning rate from the initial value to 10% before the first 26 billion tokens, and then remains unchanged, and uses a linear learning rate warm-up strategy before 375 million tokens;
  • Depending on the model size, gradually increase the batchsize linearly from a small value (32k tokens) to the full value during the first 4-1.2 billion tokens of training;
  • Data sampling in the training phase is done without replacement;
  • All models are weighted attenuated at a ratio of 0.1;
  • The input length is limited to 2048 tokens. When the total number of tokens in a document is less than 2048, other documents are used to make up the amount. In short, each input sequence is guaranteed to have 2048 tokens. If a sequence consists of multiple documents, a special end character is used to distinguish the documents from each other;

2.4 Evaluation

This section is an introduction to experimental evaluation. The difference is that the author uses context learning. Since there is no fine-tuning involved, the evaluation of GPT-3 is to directly use the pre-trained model to take different numbers of examples. Evaluated.

At this point, the more important parts of GPT-3 have actually been introduced. The following content is an introduction to various experimental tasks, as well as a bunch of appendices. Interested friends can just pick and read according to their own needs.

Summarize

To summarize, in GPT-1, the author proposed a paradigm of unsupervised pre-training combined with task-related fine-tuning training, and changed the input structure to a task-related appearance for fine-tuning. This is also the training method for subsequent Bert, T5 and other models; GPT- In 2, the author describes all task-related inputs in natural language, instead of using special symbols to distinguish different tasks. This is also the paradigm of later instruction learning, and proposes the application of zero samples in language model training; GPT-3 Change the training paradigm of the previous model, directly use large-scale corpus to train the model, without using fine-tuning for specific tasks, and verify the performance improvement of large-scale data training and large-parameter models on few-sample examples. This is also The basic method used by major language models today. To sum up in one sentence: Awesome!

Guess you like

Origin blog.csdn.net/qq_36583400/article/details/132889472