本文是完整读过GPT-2 paper之后记录下来的觉得重要的地方，其中大部分摘自paper原文，有用中文部分记录自己读paper时想到的东西以及不懂的地方，求指教！

读GPT-2 paper之前可以作为预习先看看张俊林大佬漫谈式的文章《效果惊人的GPT 2.0模型：它告诉了我们什么》 https://zhuanlan.zhihu.com/p/56865533 写的着实很好！

Abstract

The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks.

Introduction

Current systems are better characterized as narrow experts rather than competent generalists. We would like to move towards more general systems which can perform many tasks – eventually without the need to manually create and label a training dataset for each one.
Multi-task learning
- The two most ambitious efforts to date have trained on a total of 10 and 17 (dataset, objective) pairs respectively.
- From a meta-learning perspective, each (dataset, objective) pair is a single training example sampled from the distribution of datasets and objectives.
- Cons：Multitask training may need just as many effective training pairs to realize its promise with current approaches. It will be very difficult to continue to scale the creation of datasets and the design of objectives to the degree that may be required to brute force our way there with current techniques.

Approach

Learning to perform a single task can be expressed in a probabilistic framework as estimating a conditional distribution P(output | input). Since a general system should be able to perform many different tasks, even for the same input, it should condition not only on the input but also on the task to be performed. That is, it should model P(output | input, task).
Task conditioning is often implemented at an architecture level, such as the task specific encoder-decoders network. However, language provides a flexible way to specify tasks, inputs and outputs all as a sequence of symbols. (For example, a translation training example can be written as the sequence (translate to French, english text, french text 因为训练这个模型就是让它学习语言知识，这样自然也能够理解语言给的任务指令).
Language modeling is also able to, in principle, learn the tasks of McCann et al. (2018) without the need for explicit supervision of which symbols are the outputs to be predicted. The problem instead becomes whether we are able to, optimize the unsupervised objective to convergence. Preliminary experiments confirmed that sufficiently large language models are able to perform multitask learning in this toyish setup but learning is much slower than in explicitly supervised approaches.
Training dataset
- Our approach motivates building as large and diverse a dataset as possible in order to collect natural language demonstrations of tasks in as varied of domains and contexts as possible
- Web scrapes: while these archives are many orders of magnitude larger than current language modeling datasets, they have significant data quality issues (most content are mostly unintelligible). We also do not want to select a subsample of datasets that are most similar to target datasets for some tasks, since we want to avoid making assumptions about the tasks to be performed ahead of time.
- Instead, we only scraped web pages which have been curated/filtered by humans. As a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny.
Input representation
- Byte Pair Encoding (BPE) (Sennrich et al., 2015) is a practical middle ground between character and word level language modeling which effectively interpolates between word level inputs for frequent symbol sequences and character level inputs for infrequent symbol sequences.
- Our input representation allows us to combine the empirical benefits of word-level LMs with the generality of byte-level approaches. Since our approach can assign a probability to any Unicode string, this allows us to evaluate our LMs on any dataset regardless of pre-processing, tokenization, or vocab size.

Model

Transformer内部结构的改变
- Layer normalization was moved to the input of each sub-block, similar to a pre-activation residual network.
- An additional layer normalization was added after the final self-attention block.
初始化方法改变
A modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initialization by a factor of 1/√N where N is the number of residual layers.
Embedding 大小
The vocabulary is expanded to 50,257. We also increase the
context size from 512 to 1024 tokens and a larger batchsize of 512 is used.

Experiments

All models still underfit WebText and held-out perplexity has as of yet improved given more training time.
LM
- understand how WebText LM’s perform at zero-shot domain transfer on the primary task they are trained for. Since our model operates on a byte level and does not require lossy pre-processing or tokenization, we can evaluate it on any language model benchmark.（但需要de-tokenizer，因为很多其他模型的训练的时候会整理data，填入特殊符号，比如，但WebText里极少见这种符号）。
- 结果：有四种不同大小的模型117,345,762,1542M，最大的那个improved state-of-the-art on 7 out of 8 datasets. 最小的那个只提升了4个dataset。
Children’s book test
- Rather than reporting perplexity as an evaluation metric, CBT reports accuracy on an automatically constructed close test where the task is to predict which of 10 possible choices for an omitted word is correct.
- Data overlap analysis showed one of the CBT test set books, The Jungle Book by Rudyard Kipling, is in WebText, so we report results on the validation set which has no significant overlap. （于是我也理解为什么之前说要把WebText里Wikipedia的内容去掉了，因为之前有人用Wikipedia的语料训练过模型，然后GPT-2就不用Wikipedia的数据训练了，然后这样就可以挺直腰板告诉你：你看我训练的时候没看过你那些domain specific的数据，但跟你做一样的任务照样比你强，说明什么？说明我比较通用！）。
LAMBADA
- The LAMBADA dataset (Paperno et al., 2016) tests the ability of systems to model long-range dependencies in text. The task is to predict the final word of sentences which require at least 50 tokens of context for a human to successfully predict.
- Investigating GPT-2’s errors showed most predictions are valid continuations of the sentence, but are not valid final words. This suggests that the LM is not using the additional useful constraint that the word must be the final of the sentence. Adding a stop-word filter as an approximation to this further increases accuracy.
- 结果：perplexity 99.8 --> 8.6；accuracy 19% --> 52.66%
Winograd Schema Challenge
- This challenge was constructed to measure the capability of a system to perform commonsense reasoning by measuring its ability to resolve ambiguities in text.
- 结果：只有较大的两个size超过了SOTA
Reading comprehension
- The Conversation Question Answering dataset (CoQA) Reddy et al. (2018) consists of documents from 7 different domains paired with natural language dialogues between a question asker and a question answerer about the document.
- 结果：貌似表现并不好，没超过SOTA
Summarization
- To induce summarization behavior we add the text TL;DR: after the article and generate 100 tokens with Top-k random sampling (Fan et al., 2018) with k = 2 which reduces repetition and encourages more abstractive summaries than greedy decoding. We use the first 3 generated sentences in these 100 tokens as the summary.
- On the commonly reported ROUGE 1,2,L metrics the generated summaries only begin to approach the performance of classic neural baselines and just barely outperforms selecting 3 random sentences from the article.
- GPT-2’s performance drops by 6.4 points on the aggregate metric when the task hint is removed which demonstrates the ability to invoke task specific behavior in a language model with natural language.
Translation
- In order to help it infer that this is the desired task, we condition the language model on a context of example pairs of the format English sentence = french sentence and then after a final prompt of English sentence = we sample from the model with greedy decoding and use the first generated sentence as the translation.
- 结果：Achieving 11.5 BLEU. Outperforms several unsupervised machine translation baselines from (Artetxe et al., 2017) and (Lample et al., 2017) but is still much worse than the 33.5 BLEU of the current best unsupervised machine translation approach (Artetxe et al., 2019).
- Performance on this task was surprising to us, since we deliberately removed non-English webpages from WebText as a filtering step. In order to confirm this, we ran a byte-level language detector2 on WebText which detected only 10MB of data in the French language which is approximately 500x smaller than the monolingual French corpus common in prior unsupervised machine translation research. （问题是训练的时候都没见过法语，那它是怎么翻译的？？？木有搞懂，还是说这里目的在于说英语语料规模巨大，所以法语语料只有清理时遗漏的零星一点点，也可以完成翻译任务？？）
Question answering
- Similar to translation, the context of the language model is seeded with example question answer pairs which helps the model infer the short answer style of the dataset. （木有完全搞懂seed的原理，是在告诉模型这种问题的回答形式是什么吗？那seed的作用原理是什么？）
- GPT-2 answers 5.3 times more questions correctly, suggesting that model capacity has been a major factor in the poor performance of neural systems on this kind of task as of yet. （但这不是说明GPT-2容量魁梧记忆力强么，他如果能在看过那么多语料中的语句都“记住”的话，就像记住了整个Wikipedia的回答一样，就能回答很多问题啊）

Generalization V.S. Memorization

Recent work in computer vision has shown that common image datasets contain a non-trivial amount of near-duplicate images. For instance CIFAR-10 has 3.3% overlap between train and test images (Barz & Denzler, 2019). This results in an over-reporting of the generalization performance of machine learning systems.
To study this we created Bloom filters containing 8-grams of WebText training set tokens. To improve recall, strings were normalized to contain only lower-cased alphanumeric words with a single space as a delimiter. These Bloom filters let us calculate, given a dataset, the percentage of 8-grams from that dataset that are also found in the WebText training set.
Another potential way of determining whether the performance of WebText LMs is attributable to memorization is inspecting their performance on their own held-out set. Performance on both the training and test sets of WebText are similar and improve together as model size is increased. This suggests even GPT-2 is still underfitting on WebText in many ways.

Discussions

方向1：继续探索pre-training + fine-tuning 的两阶段模式。 While zero-shot performance establishes a baseline of the potential performance of GPT-2 on many tasks, it is not clear where the ceiling is with finetuning.
方向2：探索如何倔强的使用单项语言模型并且干过BERT。 Given the prior success of fine-tuning GPT, we plan to investigate fine-tuning on benchmarks such as decaNLP and GLUE, especially since it is unclear whether the additional training data and capacity of GPT-2 is sufficient to overcome the inefficiencies of uni-directional representations demonstrated by BERT (Devlin et al., 2018). （这是在公然挑衅BERT么？我就是用单向LM，我就是要用更巨无霸的Transformer和大到漫无天际的训练数据来超过你）

原paper：https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

Language Models are Unsupervised Multitask Learners 论文纪要