GPT-3 总结

最近GPT-3比较热，本文根据GPT-3的论文，整理了一些GPT-3以及论文提到的其他few shot learning的方法

GPT-3:

Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly（动态） reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans.

task-Agnostic Meta-Learning 任务无偏的元学习 (以下摘自论文作者齐国君的文章）

基于梯度下降的训练算法，它有两个在传统机器学习框架下不可学习的超参数1）初始的模型参数；2）每步的更新步长。模型参数往往通过随机初始化来实现。但由于大部分深度学习模型都是非凸的，导致模型的学习效果非常依赖于不同的随机初始条件。一个好的初始模型参数会对模型的学习效果有着非常大的影响。而元学习的一个重要用途，就是通过学习的方法去学习一个对多个任务来说合适的初始参数，使得对这些训练任务和其代表的更多未来任务来说，从这个初始参数开始，对模型进行更新，都可以更快和更好地得到新的模型。这里更快的意思就是只需要少量的训练样本和少数的几次梯度下降，我们就可以期望得到合适的新任务的模型 (即few shot learning)。

经典的元学习方法忽略了在多个任务上学习最优初始模型的一个重要问题：如何保证学习得到的初始模型对所有任务是没有偏差（unbiased）的。一个很可能发生的情形是，初始模型对某些任务跟有效，而对另外一些任务就不是特别有效。这种情形，meta-learner对不同任务是有偏的。

为了解决这个问题，作者提出一种任务无关(task agnostic)的无偏元学习方法。作者通过对初始模型加上一个正则化条件，使得它对不同的任务能“一视同仁”。具体的，对一个分类任务，可以直接最大化初始模型在不同类别上的熵（Entropy Maximization）来实现对任务的无偏性。另一方面，对一般任务，比如回归或增强学习任务，往往可以通过定义一个损失函数(loss function)或者奖励函数（reward function）来定义和优化这些任务。如果把负损失或者奖励看着是给每个任务的收入（income），我们就可以基于经济学中的度量收入不平等（inequality）的方法来刻画meta-learner 在不同任务的bias。比如，我们可以用广泛应用的基尼系数来度量元学习在不同任务的偏差，除此之外还有GE指数、Theil指数等。这些不平等度量具有不同的特性，可以聚焦考虑在特定的损失或奖励（收入）区间上任务。同时，这些度量还满足若干性质，使得它们非常适合作为不平等度量。比如对称性、伸缩不变性、非负性、传递原则等等。通过最小化不平等度量，我们可以得到对不同任务无偏的meta-learner

这个方法的问题，根据GPT-3的原文‘this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples.‘这会导致如下问题：

First, from a practical perspective, the need for a large dataset of labeled examples for every new task limits the applicability of language models. (即，fine-tune还是需要较大的数据集来进行调试，但很多任务是提供不了用来fine-tune的数据的

Second, the potential to exploit spurious correlations in training data fundamentally grows with the expressiveness of the model and the narrowness of the training distribution. This can create problems for the pre-training plus fine-tuning paradigm, where models are designed to be large to absorb information during pre-training, but are then fine-tuned on very narrow task distributions.

For each task, we evaluate GPT-3 under 3 conditions: (a) “few-shot learning”, or in-context learning where we allow as many demonstrations as will fit into the model’s context window (typically 10 to 100), (b) “one-shot learning”, where we allow only one demonstration, and (c) “zero-shot” learning, where no demonstrations are allowed and only an instruction in natural language is given to the model.

由上图：Model performance improves with the addition of a natural language task description, and with the number of examples in the model’s context, K. Few-shot learning also improves dramatically with model size. Though the results in this case are particularly striking, the general trends with both model size and number of examples in-context hold for most tasks we study. We emphasize that these “learning” curves involve no gradient updates or fine-tuning, just increasing numbers of demonstrations given as conditioning.

we also train a series of smaller models (ranging from 125 million parameters to 13 billion parameters) in order to compare their performance to GPT-3 in the zero, one and few-shot settings. Broadly, for most tasks we find relatively smooth scaling with model capacity in all three settings; one notable pattern is that the gap between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models are more proficient meta-learners.

关于Fine-tuning的优点和缺点：

The main advantage of fine-tuning is strong performance on many benchmarks. The main disadvantages are the need for a new large dataset for every task, the potential for poor generalization out-of-distribution。文章特别提到，GPT-3本身是可以用来fine-tune的，且这也是其未来的研究方向之一

为什么要把one-shot从few-shot和zero-shot中分出来，因为one-shot实际上是最贴近人的情况。

GPT-3的结构：

We use the same model and architecture as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer。

下面是Jay Alammar 的关于GPT-3的介绍。以一个trained model为例：

The model is presented with an example. We only show it the features and ask it to predict the next word.

The model’s prediction will be wrong. We calculate the error in its prediction and update the model so next time it makes a better prediction.

Repeat millions of times：

How does a system process the word “robotics” and produce “A”?

High-level steps:

Convert the word to a vector (list of numbers) representing the word
Compute prediction
Convert resulting vector to word

See all these layers? This is the “depth” in “deep learning”.

Each of these layers has its own 1.8B parameter to make its calculations. That is where the “magic” happens. This is a high-level view of that process:

猜你喜欢