谷歌AI论文BERT双向编码器表征模型：机器阅读理解NLP基准11种最优(公号回复“谷歌BERT论文”下载彩标PDF论文)

数据简化DataSimp导读：谷歌AI语言组论文《BERT：语言理解的深度双向变换器预训练》，介绍一种新的语言表征模型BERT——来自变换器的双向编码器表征量。异于最新语言表征模型，BERT基于所有层的左、右语境来预训练深度双向表征量。BERT是首个大批句子层面和词块层面任务中取得当前最优性能的表征模型，性能超越许多使用任务特定架构的系统，刷新11项NLP任务当前最优性能记录，堪称最强NLP预训练模型！未来可能成为新行业基础。本文翻译BERT论文(原文中英文对照)，BERT简版源码10月30日已发布，我们后期抽空分析，祝大家学习愉快~要推进人类文明，不可止步于敲门呐喊；设计空想太多，无法实现就虚度一生；工程能力至关重要，秦陇纪与君共勉之。

谷歌AI论文BERT双向编码器表征模型：机器阅读理解NLP基准11种最优(62264字)

A谷歌AI论文BERT双向编码器表征模型(58914字)

一、介绍Introduction

二、相关工作RelatedWork

三、BERT变换器双向编码器表征

四、实验Experiments

五、消模实验AblationStudies

六、结论Conclusion

参考文献References

B机器阅读理解11种NLP任务BERT超人类(2978字)

一、BERT模型主要贡献

二、BERT模型与其它两个的不同

参考文献(1214字)Appx(845字).数据简化DataSimp社区简介

A谷歌AI论文BERT双向编码器表征模型(58914字)

BERT：语言理解的深度双向变换器预训练

文|谷歌AI语言组BERT作者，译|秦陇纪，数据简化DataSimp20181013Sat-1103Sat

名称：BERT：语言理解的深度双向变换器预训练

BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding

论文地址：https://arxiv.org/pdf/1810.04805.pdf

作者：Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

单位：Google AILanguage {jacobdevlin,mingweichang,kentonl,kristout}@google.com

摘要：本文介绍一种称之为BERT的新语言表征模型，意为来自变换器的双向编码器表征量(BidirectionalEncoder Representations from Transformers)。不同于最近的语言表征模型(Peters等，2018; Radford等，2018)，BERT旨在基于所有层的左、右语境来预训练深度双向表征。因此，预训练的BERT表征可以仅用一个额外的输出层进行微调，进而为很多任务(如问答和语言推理)创建当前最优模型，无需对任务特定架构做出大量修改。

BERT的概念很简单，但实验效果很强大。它刷新了11个NLP任务的当前最优结果，包括将GLUE基准提升至80.4%(7.6%的绝对改进)、将MultiNLI的准确率提高到86.7%(5.6%的绝对改进)，以及将SQuADv1.1问答测试F1的得分提高至93.2分(1.5分绝对提高)——比人类性能还高出2.0分。

Abstract：We introduce anew language representation model called BERT, which stands for BidirectionalEncoder Representations from Transformers. Unlike recent languagerepresentation models (Peters et al., 2018; Radford et al., 2018), BERT isdesigned to pre-train deep bidirectional representations by jointlyconditioning on both left and right context in all layers. As a result, thepre-trained BERT representations can be fine-tuned with just one additionaloutput layer to create state-of-the-art models for a wide range of tasks, suchas question answering and language inference, without substantial task-specificarchitecture modifications.

BERT isconceptually simple and empirically powerful. It obtains new state-of-the-artresults on eleven natural language processing tasks, including pushing the GLUEbenchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7% (5.6%absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5absolute improvement), outperforming human performance by 2.0.

除了上述摘要Abstact，论文有6节：介绍Introduction、相关工作Related Work、BERT、实验Experiments、消模实验Ablation Studies、结论Conclusion，末尾42篇参考资料References。

一、介绍Introduction

语言模型预训练已被证明可有效改进许多自然语言处理任务(Dai and Le, 2015;Peters等，2017, 2018; Radford等，2018; Howard and Ruder, 2018)。这些任务包括句子级任务，如自然语言推理inference(Bowman等，2015; Williams等，2018)和释义paraphrasing(Dolan and Brockett, 2005)，旨在通过整体分析来预测句子之间的关系；以及词块级任务，如命名实体识别(Tjong Kim Sang andDe Meulder, 2003)和SQuAD问题回答(Rajpurkar等，2016)，其中模型需要在词块级别生成细粒度输出。

Language modelpre-training has shown to be effective for improving many natural languageprocessing tasks (Dai and Le, 2015; Peters et al., 2017, 2018; Radford et al.,2018; Howard and Ruder, 2018). These tasks include sentence-level tasks such asnatural language inference (Bowman et al., 2015; Williams et al., 2018) andparaphrasing (Dolan and Brockett, 2005), which aim to predict the relationshipsbetween sentences by analyzing them holistically, as well as token-level taskssuch as named entity recognition (Tjong Kim Sang and De Meulder, 2003) and SQuADquestion answering (Rajpurkar et al., 2016), where models are required toproduce fine-grained output at the token-level. (译注1：token义为象征、标志、纪念品、代币、代价券，和sign意思相同但比sign庄重文雅，常用于严肃场合。token有语言学词义：[语言学]语言符号、计算机词义：[计算机]词块、词块。秦陇纪认为“符标”更合意，但常见NLP文献里token译为“词块”，随大流吧。)

将预训练语言表征应用于下游任务有两种现有策略：基于特征feature-based和微调fine-tuning。基于特征的方法，例如ELMo(Peters等，2018)，使用特定于任务的架构，其包括将预训练表征作为附加特征。微调方法，例如GenerativePre-trained Transformer(OpenAIGPT生成型预训练变换器)(Radford等，2018)，引入了最小的任务特定参数，并通过简单地微调预训练参数在下游任务中进行训练。在以前的工作中，两种方法在预训练期间共享相同的目标函数，它们使用单向语言模型来学习通用语言表征。

There are twoexisting strategies for applying pre-trained language representations todownstream tasks: feature-based and fine-tuning. The feature-based approach,such as ELMo (Peters et al., 2018), uses tasks-specific architectures thatinclude the pre-trained representations as additional features. The fine-tuningapproach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radfordet al., 2018), introduces minimal task-specific parameters, and is trained onthe downstream tasks by simply fine-tuning the pretrained parameters. Inprevious work, both approaches share the same objective function duringpre-training, where they use unidirectional language models to learn generallanguage representations.

我们认为，当前技术严重制约了预训练表征的能力，特别是对于微调方法。其主要局限在于标准语言模型是单向的，这限制了可以在预训练期间使用的架构类型。例如，在OpenAI GPT，作者们用一个从左到右的架构，其中每个词块只能注意变换器自注意层中的前验词块(Vaswani等，2017)。这种局限对于句子层面任务而言是次优选择，对于词块级任务的方法，则可能是毁灭性的。在这种任务中应用基于词块级微调法，如SQuAD问答(Rajpurkar等，2016)，结合两个方向语境至关重要。

We argue thatcurrent techniques severely restrict the power of the pre-trainedrepresentations, especially for the fine-tuning approaches. The majorlimitation is that standard language models are unidirectional, and this limitsthe choice of architectures that can be used during pre-training. For example,in OpenAI GPT, the authors use a left-to-right architecture, where every tokencan only attended to previous tokens in the self-attention layers of theTransformer (Vaswani et al., 2017). Such restrictions are sub-optimal forsentencelevel tasks, and could be devastating when applying fine-tuning basedapproaches to token-level tasks such as SQuAD question answering (Rajpurkar etal., 2016), where it is crucial to incorporate context from both directions.

在本论文，我们通过提出BERT模型：来自变换器的双向编码器表征量(Bidirectional Encoder Representations fromTransformers)，改进了基于微调的方法。BERT通过提出一个新的预训练目标：“遮蔽语言模型”(maskedlanguage model，MLM)，来自Cloze任务(Taylor，1953)的启发，来解决前面提到的单向局限。该遮蔽语言模型随机地从输入中遮蔽一些词块，并且，目标是仅基于该遮蔽词语境语境来预测其原始词汇id。不像从左到右的语言模型预训练，该MLM目标允许表征融合左右两侧语境语境，这允许我们预训练一个深度双向变换器。除了该遮蔽语言模型，我们还引入了一个“下一句预测”(nextsentence prediction)任务，该任务联合预训练文本对表征量。

In this paper,we improve the fine-tuning based approaches by proposing BERT: BidirectionalEncoder Representations from Transformers. BERT addresses the previouslymentioned unidirectional constraints by proposing a new pre-training objective:the “masked language model” (MLM), inspired by the Cloze task (Taylor, 1953).The masked language model randomly masks some of the tokens from the input, andthe objective is to predict the original vocabulary id of the masked word basedonly on its context. Unlike left-to-right language model pre-training, the MLMobjective allows the representation to fuse the left and the right context,which allows us to pre-train a deep bidirectional Transformer. In addition tothe masked language model, we also introduce a “next sentence prediction” taskthat jointly pre-trains text-pair representations.

我们的论文贡献如下：

•我们证明了双向预训练对语言表征量的重要性。与Radford等人(2018)不同，其使用单向语言模型进行预训练，BERT使用遮蔽语言模型来实现预训练的深度双向表征量。这也与Peters等人(2018)形成对比，其使用由独立训练的从左到右和从右到左LMs(语言模型)的浅层串联。

•我们展示了预训练表征量能消除许多重型工程任务特定架构的需求。BERT是第一个基于微调的表征模型，它在大量的句子级和词块级任务上实现了最先进的性能，优于许多具有任务特定架构的系统。

•BERT推进了11项NLP任务的最高水平。因此，我们报告了广泛的BERT消融，证明我们模型的双向性质是最重要的新贡献。代码和预训练模型将在goo.gl/language/bert上提供1。(注1 将于2018年10月底前公布。)

The contributions of our paper are as follows:

•Wedemonstrate the importance of bidirectional pre-training for languagerepresentations. Unlike Radford et al. (2018), which uses unidirectionallanguage models for pretraining, BERT uses masked language models to enablepre-trained deep bidirectional representations. This is also in contrast toPeters et al. (2018), which uses a shallow concatenation of independentlytrained leftto-right and right-to-left LMs.

•We show thatpre-trained representations eliminate the needs of many heavilyengineeredtask-specific architectures. BERT is the first fine-tuning based representationmodel that achieves state-of-the-art performance on a large suite ofsentence-level and token-level tasks, outperforming many systems withtask-specific architectures.

•BERT advancesthe state-of-the-art for eleven NLP tasks. We also report extensive ablationsof BERT, demonstrating that the bidirectional nature of our model is the singlemost important new contribution. The code and pre-trained model will beavailable at goo.gl/language/bert.1

1 Will be released before the end ofOctober 2018.

二、相关工作Related Work

预训练通用语言表征有很长历史，本节我们简要回顾中这些最常用的方法。

There is a long history ofpre-training general language representations, and we briefly review the mostpopular approaches in this section.

2.1 基于特征的方法Feature-based Approaches

广泛采用的单词表征学习，已经是数十年的活跃研究领域，包括非神经(Brown等，1992; Ando and Zhang, 2005; Blitzer等，2006)和神经(Collobert andWeston, 2008; Mikolov等，2013; Pennington等，2014)方法。预训练的单词嵌入被认为是现代NLP系统的组成部分，与从头学习的嵌入相比提供了显着的改进(Turian等，2010)。

这些方法已经被推广到更粗的粒度，如句子嵌入(Kiros等，2015; Logeswaran and Lee, 2018)或段落嵌入(Le and Mikolov, 2014)。与传统词嵌入一样，这些学习到的表征通常用作下游模型中的特征。

ELMo(Peters等，2017)将传统的词嵌入研究概括为不同维度。他们建议从语言模型中提取语境敏感型特征。把语境字词嵌入与现有任务特定架构集成时，ELMo针对一些主要的NLP基准(Peters et al., 2018)提出了最先进的技术，包括关于SQUAD问答(Rajpurkar等，2016)，情绪分析(Socher等，2013)，以及命名实体识别(Tjong Kim Sang和De Meulder，2003)。

Learning widely applicablerepresentations of words has been an active area of research for decades,including non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al.,2006) and neural (Collobert and Weston, 2008; Mikolov et al., 2013; Penningtonet al., 2014) methods. Pretrained word embeddings are considered to be anintegral part of modern NLP systems, offering significant improvements overembeddings learned from scratch (Turian et al., 2010).

These approaches have beengeneralized to coarser granularities, such as sentence embeddings (Kiros etal., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le and Mikolov,2014). As with traditional word embeddings, these learned representations arealso typically used as features in a downstream model.

ELMo (Peters et al., 2017)generalizes traditional word embedding research along a different dimension.They propose to extract contextsensitive features from a language model. Whenintegrating contextual word embeddings with existing task-specificarchitectures, ELMo advances the state-of-the-art for several major NLPbenchmarks (Peters et al., 2018) including question answering (Rajpurkar etal., 2016) on SQuAD, sentiment analysis (Socher et al., 2013), and named entityrecognition (Tjong Kim Sang and De Meulder, 2003).

2.2 微调方法Fine-tuning Approaches

一种源于语言模型(LMs)的迁移学习新趋势，是微调前预训练一些LM目标上的模型架构，该微调是相同型号的一种监督下游任务(Dai and Le, 2015;Howard and Ruder, 2018; Radford等，2018)。这些方法的优点是几乎没有参数需要从头开始学习。至少部分是由于这一优势，OpenAIGPT(Radford等，2018)在许多句子级别任务的GLUE基准(Wang等，2018)，取得此前最好测试结果。

A recent trend in transfer learningfrom language models (LMs) is to pre-train some model architecture on a LMobjective before fine-tuning that same model for a supervised downstream task (Daiand Le, 2015; Howard and Ruder, 2018; Radford et al., 2018). The advantage ofthese approaches is that few parameters need to be learned from scratch. Atleast partly due this advantage, OpenAI GPT (Radford et al., 2018) achievedpreviously state-of-the-art results on many sentencelevel tasks from the GLUEbenchmark (Wang et al., 2018).

2.3 从监督数据转移学习Transfer Learning fromSupervised Data

虽然无监督预训练的优势在于可获得的数据量几乎无限，但也有工作表明从具有大型数据集的监督任务中可有效迁移，例如自然语言推理(Conneau等，2017)和机器翻译(Mc-Cann等，2017)。在NLP之外，计算机视觉研究也证明了从大型预训练模型迁移学习的重要性，其中一个有效的方法是微调在ImageNet上预训练的模型(Deng等，2009; Yosinski等，2014)。

While the advantage of unsupervisedpre-training is that there is a nearly unlimited amount of data available,there has also been work showing effective transfer from supervised tasks withlarge datasets, such as natural language inference (Conneau et al., 2017) andmachine translation (Mc-Cann et al., 2017). Outside of NLP, computer visionresearch has also demonstrated the importance of transfer learning from largepre-trained models, where an effective recipe is to fine-tune modelspre-trained on ImageNet (Deng et al., 2009; Yosinski et al., 2014).

三、BERT变换器双向编码器表征

我们在本节介绍BERT及其详细实现。我们先介绍BERT的模型架构和输入表征。然后，我们将在3.3节中介绍预训练任务，即本文的核心创新。预训练程序和微调程序分别在第3.4节和第3.5节中详述。最后，第3.6节讨论了BERT和OpenAIGPT之间的差异。

We introduce BERT and its detailedimplementation in this section. We first cover the model architecture and theinput representation for BERT. We then introduce the pre-training tasks, thecore innovation in this paper, in Section 3.3. The pre-training procedures, andfine-tuning procedures are detailed in Section 3.4 and 3.5, respectively.Finally, the differences between BERT and OpenAI GPT are discussed in Section3.6.

3.1 模型架构Model Architecture

BERT模型架构是一种多层双向变换器编码器，基于Vaswani等人(2017年)描述并在tensor2tensor库2发行的原始实现。(注2https://github.com/tensorflow/tensor2tensor)因为变换器的使用最近变得无处不在，我们架构的实施有效地等同于原始实现，所以我们会忽略模型架构详尽的背景描述，并向读者推荐Vaswani等人(2017)的优秀指南，如“注释变换器”3。(注3 http://nlp.seas.harvard.edu/2018/04/03/attention.html)

在这项工作中，我们把层数(即Transformerblocks变换器块)表征为L，隐藏大小表征为H，自注意头数表征为A。在所有情况下，我们设置前馈/过滤器的尺寸为4H，如H=768时为3072，H=1024时为4096。我们主要报告在两种模型尺寸上的结果：

•BERTBASE：L=12，H=768，A=12，总参数=110M

•BERTLARGE：L=24，H=1024，A=16，总参数=340M

选择的BERTBASE模型尺寸等同于OpenAIGPT模型尺寸，以进行比较。然而，重要的是，BERT变换器使用双向自注意，而GPT变换器使用受限自注意，每个词块只能注意其左侧语境。我们注意到，在文献中，双向变换器通常指称为“变换器编码器”，而其左侧语境版本被称为“变换器解码器”，因为它可用于文本生成。BERT，OpenAIGPT和ELMo之间的比较如图1所示。

图1：预训练模型架构间差异。BERT使用双向变换器，OpenAI GPT使用从左到右的变换器，ELMo使用独立训练的从左到右和从右到左LSTM级联来生成下游任务的特征。三种模型中只有BERT表征基于所有层左右两侧语境。

Figure 1: Differences inpre-training model architectures. BERT uses a bidirectional Transformer. OpenAIGPT uses a left-to-right Transformer. ELMo uses the concatenation ofindependently trained left-to-right and rightto-left LSTM to generate featuresfor downstream tasks. Among three, only BERT representations are jointlyconditioned on both left and right context in all layers.

3.2 输入表征Input Representation

我们的输入表征(inputrepresentation)能在一个词块序列中明确地表征单个文本句子或一对文本句子(例如，[问题，答案][Question,Answer])。4(注4 在整个这项工作中，“句子”可以是连续文本的任意跨度，而不是实际的语言句子。“序列”指BERT的输入词块序列，其可以是单个句子或两个句子打包在一起。)对于给定词块，其输入表征通过对相应词块的词块嵌入、段嵌入和位嵌入求和来构造。图2给出了我们的输入表征的直观表征。

图2：BERT输入表征。输入嵌入是词块嵌入、段嵌入和位嵌入的总和。

Figure 2: BERT inputrepresentation. The input embeddings is the sum of the token embeddings, thesegmentation embeddings and the position embeddings.

具体是：

•我们使用WordPiece嵌入(Wu等，2016)和30,000个词块表。我们用##表征分词。

•我们使用学习的位置嵌入，支持的序列长度最多为512个词块。

•每个序列的第一个词块始终是特殊分类嵌入([CLS])。对应该词块的最终隐藏状态(即，变换器输出)被用作分类任务的聚合序列表征。对于非分类任务，将忽略此向量。

•句子对被打包成单个序列。我们以两种方式区分句子。首先，我们用特殊词块([SEP])将它们分开。其次，我们添加一个学习句子A嵌入到第一个句子的每个词块中，一个句子B嵌入到第二个句子的每个词块中。

•对于单句输入，我们只使用句子A嵌入。

3.3 预训练任务Pre-training Tasks

与Peters等人(2018)和Radford等人(2018)不同，我们不使用传统的从左到右或从右到左的语言模型来预训练BERT。相反，我们使用两个新型无监督预测任务对BERT进行预训练，如本节所述。

3.3.1 任务#1：遮蔽语言模型 Task#1: Masked LM

直观地说，有理由相信深度双向模型比左向右模型或从左到右和右到左模型的浅层连接更严格。遗憾的是，标准条件语言模型只能从左到右或从右到左进行训练，因为双向调节将允许每个单词在多层语境中间接地“看到自己”。

为了训练深度双向表征，我们采用一种直接方法，随机遮蔽输入词块的某些部分，然后仅预测那些被遮蔽词块。我们将这个过程称为“遮蔽LM”(MLM)，尽管它在文献中通常被称为Cloze完形任务(Taylor, 1953)。在这种情况下，对应于遮蔽词块的最终隐藏向量被馈送到词汇表上的输出softmax函数中，如在标准LM中那样预测所有词汇的概率。在我们所有实验中，我们随机地遮蔽蔽每个序列中所有WordPiece词块的15％。与去噪自动编码器(Vincent等，2008)相反，我们只预测遮蔽单词而不是重建整个输入。

虽然这确实允许我们获得双向预训练模型，但该方法有两个缺点。首先，我们正在创建预训练和微调之间的不匹配，因为在微调期间从未看到[MASK]词块。为了缓解这个问题，我们并不总是用实际的[MASK]词块替换“遮蔽”单词。相反，训练数据生成器随机选择15％的词块，例如，在句子：我的狗是毛茸茸的，它选择毛茸茸的。然后完成以下过程：

•并非始终用[MASK]替换所选单词，数据生成器将执行以下操作：

•80％的时间：用[MASK]词块替换单词，例如，我的狗是毛茸茸的！我的狗是[MASK]

•10％的时间：用随机词替换遮蔽词，例如，我的狗是毛茸茸的！我的狗是苹果

•10％的时间：保持单词不变，例如，我的狗是毛茸茸的！我的狗毛茸茸的。这样做的目的是将该表征偏向于实际观察到的单词。

变换器编码器不知道它将被要求预测哪些单词或哪些单词已被随机单词替换，因此它被迫保持每个输入词块的分布式语境表征。此外，因为随机替换只发生在所有词块的1.5％(即15％的10％)，这似乎不会损害模型的语言理解能力。

使用MLM的第二个缺点是每批中只预测了15％的词块，这表明模型可能需要更多的预训练步骤才能收敛。在5.3节中，我们证明MLM的收敛速度略慢于从左到右的模型(预测每个词块)，但MLM模型在实验上的改进远远超过所增加的训练成本。

3.3.2 任务#2：下一句预测Task#2: Next Sentence Prediction

很多重要的下游任务，例如问答(QA)和自然语言推理(NLI)，都是基于对两个文本句子间关系的理解，而这种关系并非通过语言建模直接获得。为了训练一个理解句子关系的模型，我们预训练了一个二值化下一句预测任务，该任务可以从任何单语语料库中轻松生成。具体来说，选择句子A和B作为预训练样本：B有50%的可能是A的下一句，也有50%的可能是来自语料库的随机句子。例如：

输入=[CLS]男子去[MASK]商店[SEP]他买了一加仑[MASK]牛奶[SEP]

Label= IsNext

输入=[CLS]男人[面具]到商店[SEP]企鹅[面具]是飞行##少鸟[SEP]

Label= NotNext

我们完全随机选择这些NotNext语句，最终预训练模型在此任务中达到97％-98％的准确率。尽管它很简单，但我们在5.1节中证明，面向该任务的预训练对QA和NLI都非常有益。

3.4 预训练过程Pre-training Procedure

BERT预训练过程主要遵循现有的语言模型预训练文献。对于预训练语料库，我们使用BooksCorpus(800M单词)(Zhu等，2015)和英语维基百科(2,500M单词)的串联。对于维基百科，我们只提取文本段落并忽略列表、表格和题头。至关重要的是，使用文档级语料库而不是洗牌式(乱词序)句子级语料库，例如Billion Word Benchmark(Chelba等，2013)，以便提取长的连续序列。

为了生成每个训练输入序列，我们从语料库中采样两个文本跨度，我们将其称为“句子”，即使它们通常比单个句子长得多(但也可以更短)。第一个句子接收A嵌入，第二个句子接收B嵌入。B有50％可能刚好是A嵌入后的下一个句子，亦有50％可能是个随机句子，此乃为“下一句预测”任务而做。对它们采样，使其组合长度≦512个词块。该LM遮蔽应用于具有15％统一掩蔽率的WordPiece词块化之后，并且不特别考虑部分字块。

我们训练批量大小为256个序列(256个序列*512个词块=128,000个词块/批次)，持续1,000,000个步骤，这比33亿个单词语料库大约40个周期。我们使用Adam(学习程序)，设其学习率为1e-4，β1=0.9，β2=0.999，L2权重衰减为0.01，学习率预热超过前10,000步以上以及线性衰减该学习率。我们在所有层上使用0.1的丢失概率。在OpenAIGPT之后，我们使用gelu激活(Hendrycks和Gimpel, 2016)而不是标准relu。训练损失是平均的遮蔽LM可能性和平均的下一句子预测可能性的总和。

在Pod配置的4个云TPU上进行了BERTBASE训练(总共16个TPU芯片)。5(注5 https://cloudplatform.googleblog.com/2018/06/Cloud-TPU-now-offers-preemptible-pricing-and-globalavailability.html)在16个云TPU(总共64个TPU芯片)进行了BERTLARGE训练。每次预训练需4天完成。

3.5 微调过程Fine-tuning Procedure

对于序列级分类任务，BERT微调很简单。为了获得输入序列的固定维度池化表征，我们对该输入第一个词块采取最终隐藏状态(例如，该变换器输出)，通过对应于特殊[CLS]词嵌入来构造。我们将该向量表示为C∈RH。微调期间添加的唯一新参数是分类层向量W∈RKxH，其中K是分类器标签的数量。该标签概率P∈RK用标准softmax函数，P=softmax(CWT)计算。BERT和W的所有参数都经过联动地微调，以最大化正确标签的对数概率。对于跨度级和词块级预测任务，必须以任务特定方式稍微修改上述过程。详情见第4节的相应小节。

对于微调，大多数模型超参数与预训练相同，但批量大小、学习率和训练周期数量除外。丢失概率始终保持在0.1。最佳超参数值是特定于任务的，但我们发现以下范围的可能值可以在所有任务中很好地工作：

•批量大小：16,32

•学习率(Adam)：5e-5,3e-5,2e-5

•周期数量：3,4

我们还观察到，大数据集(如100k+词块的训练样例)对超参数选择的敏感性远小于小数据集。微调通常非常快，因此需合理简单地对上述参数进行详尽搜索，并选择开发集上性能最佳的模型。

3.6 BERT和OpenAI GPT比较Comparison of BERT and OpenAI GPT

与BERT最具可比性的现有预训练方法是OpenAI GPT，它在大型文本语料库中训练左到右的变换器LM。实际上，许多BERT设计决策被有意地选择为尽可能接近GPT，以便最细微地比较这两种方法。这项工作的核心论点是占主要经验改进的3.3节中提出的两个新型预训练任务，但我们注意到BERT和GPT在如何训练之间还存在其他一些差异：

•GPT在BooksCorpus(800M单词)训练；BERT在BooksCorpus(800M单词)和维基百科(2,500M单词)训练。

•GPT使用一种句子分隔符([SEP])和分类符词块([CLS])，它们仅在微调时引入；BERT在预训练期间学习[SEP]，[CLS]和句子A/B嵌入。

•GPT用一个批量32,000单词训练1M步；BERT用一个批量128,000单词训练1M步。

•GPT对所有微调实验使用的5e-5相同学习率；BERT选择特定于任务的微调学习率，在开发集表现最佳。

为了分离这些差异的影响，我们在5.1节进行了消融实验，证明大多数改进实际上来自新型预训练任务。

The most comparableexisting pre-training method to BERT is OpenAI GPT, which trains a left-to-rightTransformer LM on a large text corpus. In fact, many of the design decisions inBERT were intentionally chosen to be as close to GPT as possible so that thetwo methods could be minimally compared. The core argument of this work is thatthe two novel pre-training tasks presented in Section 3.3 account for themajority of the empirical improvements, but we do note that there are severalother differences between how BERT and GPT were trained:

• GPT is trained on theBooksCorpus (800M words); BERT is trained on the BooksCorpus (800M words) andWikipedia (2,500M words).

• GPT uses a sentenceseparator ([SEP]) and classifier token ([CLS]) which are only introduced atfine-tuning time; BERT learns [SEP], [CLS] and sentence A/B embeddings duringpre-training.

• GPT was trained for 1Msteps with a batch size of 32,000 words; BERT was trained for 1M steps with abatch size of 128,000 words.

• GPT used the samelearning rate of 5e-5 for all fine-tuning experiments; BERT chooses a task-specificfine-tuning learning rate which performs the best on the development set.

To isolate the effect ofthese differences, we perform ablation experiments in Section 5.1 which demonstratethat the majority of the improvements are in fact coming from the newpre-training tasks.

四、实验Experiments

在本节中，我们将介绍11个NLP任务的BERT微调结果。

In this section, wepresent BERT fine-tuning results on 11 NLP tasks.

4.1 GLUE数据集GLUE Datasets

通用语言理解评估(GLUE)基准(Wang等，2018)是各种自然语言理解任务的集合。大多数GLUE数据集已存在多年，但GLUE的目的是(1)使用规范的Train、Dev和Test拆分发行这些数据集，以及(2)设置评估服务器以减轻评估不一致事件和测试集过度拟合。GLUE不会为测试集分发标签，用户必须将其预测上传到GLUE服务器进行评估，并限制提交的数量。

GLUE基准包括以下数据集，其描述最初在Wang等人(2018)的文章中进行了总结：

MNLI多类型自然语言推理是一项大规模的众包蕴涵分类任务(Williams等，2018)。给定一对句子，目标是预测第二句与第一句相比是蕴涵、矛盾还是中立。

QQP Quora问题对是一个二元分类任务，其目的是确定Quora上提出的两个问题是否在语义上是等价的(Chen等，2018)。

QNLI问题自然语言推理是斯坦福问题答疑数据集(Rajpurkar等，2016)的一个版本，已被转换为二元分类任务(Wang等，2018)。积极的例子是(问题，句子)对包含正确答案，而负面例子是(问题，句子)来自同一段落，不包含答案。

SST-2斯坦福情感树库2是一个二元单句分类任务，由从电影评论中提取的句子和人类注释的情绪组成(Socher等，2013)。

CoLA语言可接受性语料库是一个二元单句分类任务，其目标是预测英语句子在语言上是否“可接受”(Warstadt等，2018)。

STS-B语义文本相似性基准是从新闻标题和其他来源中提取的句子对的集合(Cer等，2017)。它们用1到5的分数进行注释，表示两个句子在语义上的相似程度。

MRPC微软研究院解释语料库由从在线新闻源自动提取的句子对组成，其中人类注释是否该对中的句子是否在语义上相等(Dolan和Brockett，2005)。

RTE识别文本蕴涵是类似于MNLI的二进制蕴涵任务，但训练数据少得多(Bentivogli等，2009)。6(注6 请注意，本文仅报告单任务微调结果。多任务微调方法可能会进一步推动结果。例如，我们确实观察到MNLI多任务培训对RTE的实质性改进。)

WNLI威诺格拉德自然语言推理是一个源自(Levesque等，2011)的小型自然语言推理数据集。GLUE网页指出，该数据集的构建存在问题7，并且每个提交给GLUE训练过的系统的性能都比预测大多数类别的65.1基线准确度差。(注7 https://gluebenchmark.com/faq)因此，我们将这一组排除在OpenAIGPT的公平性之外。对于我们的GLUE提交，我们总是预测其大多数的类。

The General LanguageUnderstanding Evaluation (GLUE) benchmark (Wang et al., 2018) is a collection ofdiverse natural language understanding tasks. Most of the GLUE datasets havealready existed for a number of years, but the purpose of GLUE is to (1)distribute these datasets with canonical Train, Dev, and Test splits, and (2)set up an evaluation server to mitigate issues with evaluation inconsistenciesand Test set overfitting. GLUE does not distribute labels for the Test set andusers must upload their predictions to the GLUE server for evaluation, withlimits on the number of submissions.

The GLUE benchmarkincludes the following datasets, the descriptions of which were originallysummarized in Wanget al. (2018):

MNLI Multi-Genre NaturalLanguage Inference is a large-scale, crowdsourced entailment classificationtask (Williamset al., 2018). Given a pair ofsentences, the goal is to predict whether the second sentence is an entailment, contradiction, or neutralwithrespect to the first one.

QQP Quora Question Pairs is abinary classification task where the goal is to determine if two questionsasked on Quora are semantically equivalent (Chen et al., 2018).

QNLI Question Natural LanguageInference is a version of the Stanford Question Answering Dataset (Rajpurkar et al., 2016) which has been convertedto a binary classification task (Wanget al., 2018). The positive examplesare (question, sentence) pairs which do contain the correct answer, and thenegative examples are (question, sentence) from the same paragraph which do notcontain the answer.

SST-2 The Stanford SentimentTreebank is a binary single-sentence classification task consisting ofsentences extracted from movie reviews with human annotations of theirsentiment (Socheret al., 2013).

CoLA The Corpus of LinguisticAcceptability is a binary single-sentence classification task, where the goalis to predict whether an English sentence is linguistically “acceptable” or not(Warstadtet al., 2018).

STS-B The Semantic Textual SimilarityBenchmark is a collection of sentence pairs drawn from news headlines and othersources (Ceret al., 2017). They were annotatedwith a score from 1 to 5 denoting how similar the two sentences are in terms ofsemantic meaning.

MRPC Microsoft Research ParaphraseCorpus consists of sentence pairs automatically extracted from online newssources, with human annotations for whether the sentences in the pair aresemantically equivalent (Dolanand Brockett, 2005).

RTE Recognizing TextualEntailment is a binary entailment task similar to MNLI, but with much less trainingdata (Bentivogliet al., 2009).6 (6Note that we only reportsingle-task fine-tuning results in this paper. Multitask fine-tuning approachcould potentially push the results even further. For example, we did observesubstantial improvements on RTE from multi-task training with MNLI.)

WNLI Winograd NLI is a smallnatural language inference dataset deriving from (Levesque et al., 2011). The GLUE webpage notesthat there are issues with the construction of this dataset, 7 (7https://gluebenchmark.com/faq) and every trained system that’s been submitted to GLUEhas has performed worse than the 65.1 baseline accuracy of predicting themajority class. We therefore exclude this set out of fairness to OpenAI GPT.For our GLUE submission, we always predicted the majority class.

4.1.1 GLUE结果GLUEResults

图3：我们的任务特定模型是由向BERT添加一个额外输出层而形成的，因此一小部分参数需要从头开始学习。在该任务中，(a)和(b)是序列级任务，(c)和(d)是词块级任务。图中E代表其输入嵌入，Ti代表词块i的语境表征，[CLS]是分类输出的特殊符号，[SEP]是分割非连续词块序列的特殊符号。

Figure 3: Our task specific models are formed byincorporating BERT with one additional output layer, so a minimal number ofparameters need to be learned from scratch. Among the tasks, (a) and (b) aresequence-level tasks while (c) and (d) are token-level tasks. In the figure, Erepresents the input embedding, Ti represents the contextual representation oftoken i, [CLS] is the special symbol for classification output, and [SEP] isthe special symbol to separate non-consecutive token sequences.

对GLUE微调，我们呈现了第3节中描述的输入序列或序列对，并使用对应于第一个输入词块([CLS])的最终隐藏向量C∈RH作为聚合表征。这都呈现在可视化图3(a)和(b)中。在微调期间引入的唯一新参数是分类层W∈RK×H，其中K是标签数量。我们用C和W计算标准分类损失，即log(softmax(CWT))。

对所有GLUE任务，我们均在其数据上使用一个批量大小为32和3个周期。对于每项任务，我们用学习率5e-5,4e-5,3e-5和2e-5做了微调，并选择了在其Dev集上性能最佳的那一个。此外，对于BERTLARGE，我们发现微调有时在小数据集上不稳定(如，某些运行会产生退化结果)，因此我们运行了几次随机重启并选择了在Dev集上性能最佳的模型。通过随机重启，我们使用相同的预训练检查点，但执行不同的微调数据混洗和分类器层初始化。我们注意到GLUE数据集分布不包括其测试标签，我们只为每个BERTBASE和BERTLARGE做单一的GLUE评估服务器提交。

表1：GLUE测试结果，评分来自其GLUE评估服务器。每个任务下面的数字代表该训练样本数量。“Average”列与GLUE官方分数略微不同，因为我们排除了有问题的WNLI集。OpenAI GPT = (L=12, H=768, A=12)；BERTBASE= (L=12, H=768, A=12)；BERTLARGE = (L=24, H=1024,A=16)。BERT和OpenAI GPT是单模型、单任务。所有结果来自于以下地址：https://gluebenchmark.com/leaderboard和https://blog.openai. com/language-unsupervised/。

Table 1: GLUE Testresults, scored by the GLUE evaluation server. The number below each taskdenotes the number of training examples. The “Average” column is slightlydifferent than the official GLUE score, since we exclude the problematic WNLIset. OpenAI GPT = (L=12, H=768, A=12); BERTBASE = (L=12, H=768,A=12); BERTLARGE = (L=24, H=1024, A=16). BERT and OpenAI GPT aresingle-model, single task. All results obtained from https://gluebenchmark.com/leaderboard and https://blog.openai.com/language-unsupervised/.

结果如表1所示。BERTBASE和BERTLARGE在所有任务上的性能均优于所有现有系统，相对于最先进水平，平均准确度提高了4.4％和6.7％。请注意，BERTBASE和OpenAIGPT在其注意遮蔽之外的模型架构几乎相同。对于规模最大、报道最广泛的GLUE任务，MNLI、BERT的绝对精度提高了4.7％，超过了最先进水平。在官方GLUE排行榜8上，BERTLARGE得分为80.4，而该排行榜系统登顶的OpenAIGPT在本文撰写之日获得72.8分。(注8 https://gluebenchmark.com/leaderboard)

有趣的是，BERTLARGE在所有任务中都明显优于BERTBASE，即使训练数据非常少的那些也是如此。第5.2节更全面地探讨了BERT模型尺寸的影响。

To fine-tune on GLUE, werepresent the input sequence or sequence pair as described in Section 3, and use the final hiddenvector C∈RHcorresponding to the first input token ([CLS]) as the aggregaterepresentation. This is demonstrated visually in Figure 3 (a) and (b). The only newparameters introduced during fine-tuning is a classification layer W∈RK×H, where K is the number of labels. We compute a standard classificationloss with C and W, i.e., log(softmax(CWT)).

We use a batch size of 32and 3 epochs over the data for all GLUE tasks. For each task, we ranfine-tunings with learning rates of 5e-5, 4e-5, 3e-5, and 2e-5 and selected theone that performed best on the Dev set. Additionally, for BERTLARGE we found that fine-tuningwas sometimes unstable on small data sets (i.e., some runs would producedegenerate results), so we ran several random restarts and selected the modelthat performed best on the Dev set. With random restarts, we use the samepre-trained checkpoint but perform different finetuning data shuffling andclassifier layer initialization. We note that the GLUE data set distributiondoes not include the Test labels, and we only made a single GLUE evaluationserver submission for each BERTBASEandBERTLARGE.

Results are presented inTable 1. Both BERTBASE and BERTLARGE outperform all existingsystems on all tasks by a substantial margin, obtaining 4.4% and 6.7%respective average accuracy improvement over the state-of-the-art. Note thatBERTBASEandOpenAI GPT are nearly identical in terms of model architecture outside of theattention masking. For the largest and most widely reported GLUE task, MNLI,BERT obtains a 4.7% absolute accuracy improvement over the state-of-the-art. Onthe official GLUE leaderboard 8 (注8 https://gluebenchmark.com/leaderboard),BERTLARGE obtains a score of 80.4,compared to the top leaderboard system, OpenAI GPT, which obtains 72.8 as ofthe date of writing.

It is interesting toobserve that BERTLARGEsignificantlyoutperforms BERTBASEacrossall tasks, even those with very little training data. The effect of BERT modelsize is explored more thoroughly in Section 5.2.

4.2 斯坦福问答数据集SQuAD v1.1

Standford问题回答数据集(SQuAD)是一种100k众包问答对的集合(Rajpurkar等，2016)。给出一个问题和包含答案的来自维基百科的一个段落，任务是预测该段落中的其答案文本的跨度。例如：

•输入问题：

水滴在哪里与冰晶碰撞形成沉淀？

•输入段落：

...沉淀形成为较小的液滴通过与云中的其他雨滴或冰晶碰撞而聚结。...

•输出答案：

在云中

这种类型的跨度预测任务与GLUE的序列分类任务完全不同，但我们能以简单的方式调整BERT以在SQuAD上运行。与GLUE一样，我们将输入问题和段落表示为单个打包序列，问题使用A嵌入和使用B嵌入的段落。在微调期间学习的唯一新参数是起始矢量S∈RH和结束矢量E∈RH。让来自BERT的第i个输入词块的最终隐藏向量表示为Ti∈RH。请参见可视化图3(c)。然后，单词i作为答案跨度开始的概率被计算为Ti和S之间的点积(dot product)，跟随着段落中所有单词的softmax：

Pi = e(S×Ti)/ Σj(e(S×Tj))

相同公式用于其答案跨度的末端，最大评分范围用作其预测。训练目标是正确的开始和结束位置的log似然(log-likelihood)。

我们以学习率5e-5批量大小32来训练3个周期。推理时，由于结束预测不以开始为条件，我们添加了在开始后必须结束的约束，但是没有使用其他启发式方法。词块化标记跨度与原始非词块化输入对齐，以做评估。

结果呈现在表2。SQuAD用很严格的测试过程，其提交者必须人工联系SQuAD组织者以在一个隐藏测试集上运行他们的系统，因此我们只提交了我们最好的系统进行测试。该表显示的结果是我们向SQuAD提交的第一个也是唯一的测试。我们注意到SQuAD排行榜最好高结果没有最新的可用公共系统描述，并且在训练他们的系统时可以使用任何公共数据。因此，我们通过我们提交的系统中使用非常适度的数据增强，在SQuAD和TriviaQA(Joshi等，2017)上联合训练。

表2：SQuAD结果。本BERT集成是使用不同预训练检查点和微调种子(fine-tuning seed)的7x系统。

Table 2: SQuADresults. The BERT ensemble is 7x systems which use different pre-trainingcheckpoints and fine-tuning seeds.

我们性能最佳的系统在整体排名中优于顶级排行榜系统+1.5 F1项，在单一系统中优于+1.3 F1项。事实上，我们的单一BERT模型在F1得分方面优于顶级全体系统。如果我们只微调SQuAD(没有TriviaQA)，我们将失去0.1-0.4的F1得分，但仍然大幅超越所有现有系统。

The Standford Question Answering Dataset (SQuAD) is acollection of 100k crowdsourced question/answer pairs (Rajpurkar et al., 2016). Given a question and aparagraph from Wikipedia containing the answer, the task is to predict theanswer text span in the paragraph. For example:

• Input Question:

Where do water droplets collide with ice crystals to formprecipitation?

• Input Paragraph:

... Precipitation forms as smaller droplets coalesce viacollision with other rain drops or ice crystals within a cloud. ...

• Output Answer:

within a cloud

This type of spanprediction task is quite different from the sequence classification tasks ofGLUE, but we are able to adapt BERT to run on SQuAD in a straightforwardmanner. Just as with GLUE, we represent the input question and paragraph as asingle packed sequence, with the question using the A embedding and theparagraph using the B embedding. The only new parameters learned duringfine-tuning are a start vector S∈RHand an end vector E∈RH. Let the final hiddenvector from BERT for the ith input token be denoted as Ti∈RH. See Figure 3 (c) for a visualization. Then,the probability of word i being the start of theanswer span is computed as a dot product between Ti and S followed by a softmax overall of the words in the paragraph:

Pi = eS×Ti / ∑j(eS×Tj)

The same formula is usedfor the end of the answer span, and the maximum scoring span is used as theprediction. The training objective is the log-likelihood of the correct startand end positions.

We train for 3 epochs witha learning rate of 5e- 5 and a batch size of 32. At inference time, since theend prediction is not conditioned on the start, we add the constraint that theend must come after the start, but no other heuristics are used. The tokenizedlabeled span is aligned back to the original untokenized input for evaluation.

Results are presented inTable 2. SQuAD uses a highly rigorous testing procedure wherethe submitter must manually contact the SQuAD organizers to run their system ona hidden test set, so we only submitted our best system for testing. The resultshown in the table is our first and only Test submission to SQuAD.We note thatthe top results fromthe SQuAD leaderboard do not have up-to-date public system descriptionsavailable, and are allowed to use any public data when training their systems.We therefore use very modest data augmentation in our submitted system byjointly training on SQuAD and TriviaQA (Joshi et al., 2017).

Our best performing systemoutperforms the top leaderboard system by +1.5 F1 in ensembling and +1.3 F1 asa single system. In fact,our single BERT model outperforms the top ensemble system in terms of F1 score.If we fine-tune on only SQuAD (without TriviaQA) we lose 0.1-0.4 F1 and stilloutperform all existing systems by a wide margin.

4.3 命名实体识别Named Entity Recognition

为了评估词块标记任务的性能，我们在CoNLL 2003命名实体识别(NER)数据集上微调BERT。该数据集由200k个训练单词组成，这些单词已注释为人员、组织、位置、杂项或其他(非命名实体)。

为做微调，我们将最终隐藏表征Ti∈RH提供给每个词块i到NER标签集上的分类层。此预测不以周围预测为条件(即，非自回归和无CRF)。为了使其与WordPiece词块化相兼容，我们将每个CoNLL词块化输入单词提供给我们的WordPiece词块化器，并使用与第一个子标记相对应的隐藏状态作为分类器的输入。例如：

Jim Hen##son是一个木偶##eer

I-PERI-PER X O O X.

在没有对X做预测的情况下。由于WordPiece词块化边界是一个该输入的已知部分，因此对训练和测试都做了预测。图3(d)中还给出了可视化呈现。一种事例WordPiece模型用于NER，而非事例模型用于所有其他任务。

结果呈现在表3中。BERTLARGE优于现有SOTA——具有多任务学习(Clark等，2018)的跨视图训练，在CoNLL-2003NER测试中达+0.2。

表3：CoNLL-2003命名实体识别结果。超参数通过开发集来选择，得出的开发和测试分数是使用这些超参数进行五次随机重启的平均值。

Table 3: CoNLL-2003 Named EntityRecognition results. The hyperparameters were selected using the Dev set, andthe reported Dev and Test scores are averaged over 5 random restarts usingthose hyperparameters.

To evaluate performance ona token tagging task, we fine-tune BERT on the CoNLL 2003 Named EntityRecognition (NER) dataset. This dataset consists of 200k training words which havebeen annotated as Person, Organization, Location, Miscellaneous, or Other (non-named entity).

For fine-tuning, we feed the final hidden representationTi∈RH for to each token i into aclassification layer over the NER label set. The predictions are not conditioned on the surroundingpredictions (i.e., non-autoregressive and no CRF). To make this compatible withWordPiece tokenization, we feed each CoNLL-tokenized input word into ourWordPiece tokenizer and use the hidden state corresponding to the first sub-token as input to the classifier. For example:

Jim Hen ##sonwas a puppet ##eer

I-PER I-PER X O OO X

Where no prediction is made forX. Since the WordPiecetokenization boundaries are a known part of the input, this is done for bothtraining and test. A visual representation is also given in Figure 3 (d). A cased WordPiecemodel is used for NER, whereas an uncased model is used for all other tasks.

Results are presented in Table 3. BERTLARGEoutperforms the existing SOTA, Cross-View Training with multi-task learning (Clark et al., 2018), by +0.2 on CoNLL-2003NER Test.

4.4 对抗生成情境数据集SWAG

此对抗生成情境(SWAG)数据集包含113k个句子对的完成样例，用于评估基础常识推理(Zellers等，2018)。

给定一个视频字幕数据集中的某一个句子，任务是在四个选项中决定最合理的后续。例如：

一个女孩正穿过一套猴架杆。她

(i)跳过猴架杆。

(ii)挣扎到架杆抓住她的头。

(iii)走到尽头，站在木板上。

(iv)跳起并做后退。

(译注2：monkey bars n.猴架，供孩子们攀爬玩耍的架子)

调到SWAG数据集的BERT，类似于其GLUE适配。对于每个样本，我们构造四个输入序列，每个输入序列包含给定句子(句子A)和可能后续(句子B)的串联。我们引入的唯一任务特定参数是一个矢量V∈RH，其具有最终聚合表征Ci∈RH的点积代表每个选择i的得分。概率分布是四种选择的softmax：

Pi = e(V×Ci) / Σj=1 to 4(e(V×Cj))

我们用学习率2e-5批量大小16，对此模型做了3个周期的微调。结果呈现在表4。BERTLARGE的性能优于该作者ESIM+ELMo系统的基线达+27.1％。

表4：SWAG开发和测试精度。测试结果由SWAG作者们对其隐藏标签进行评分。如SWAG论文所述，人类性能是用100个样本测量的。

Table 4: SWAG Dev andTest accuracies. Test results were scored against the hidden labels by the SWAGauthors. Human performance is measure with 100 samples, as reported in the SWAGpaper.

五、消模实验Ablation Studies

虽然我们已经演示了极其强大的实验结果，但到目前为止所呈现的结果并未分离BERT框架各个方面的具体贡献。在本节中，我们将对BERT多个方面进行消融实验，以便更好地了解它们的相对重要性。(译注3：Quora上对ablation study的解释：An ablation study typicallyrefers to removing some “feature” of the model or algorithm, and seeing howthat affects performance. 消模实验通常是指删除模型或算法的某些“特征”，并查看如何影响性能。ablation study是为研究模型中提出的一些结构是否有效而设计的实验。比如你提出了某结构，但要想确定这个结构是否有利于最终效果，就要将去掉该结构的模型与加上该结构的模型所得到的结果进行对比。ablation study直译为“消融研究”，意译是“模型简化测试”或“消模实验”。)

Although we have demonstrated extremely strongempirical results, the results presented so far have not isolated the specificcontributions from each aspect of the BERT framework. In this section, weperform ablation experiments over a number of facets of BERT in order to betterunderstand their relative importance.

5.1 预训练任务的影响Effect of Pre-training Tasks

我们的核心主张之一是BERT的深度双向性，这是通过遮蔽LM预训练实现的，是BERT与以前工作相比最重要的改进。为证明这一主张，我们评估了两个使用完全相同预训练数据、微调方案和变换器超参数的BERTBASE新模型：

1.无NSP：一种使用“遮蔽LM”(MLM)训练但没有“下一句预测”(NSP)任务的模型。

2.LTR＆NoNSP：使用从左到右(LTR)LM而不是MLM训练的模型。在这种情况下，我们预测每个输入单词，不应用任何遮蔽。左侧约束也用于微调，因为我们发现使用左侧语境预训练和双向语境微调，效果总是更差。此外，该模型在没有NSP任务的情况下做了预训练。这与OpenAIGPT直接相当，但使用我们更大的训练数据集、我们的输入表征和我们的微调方案。

结果显示在表5中。我们首先检查NSP任务带来的影响。我们可以看到，删除NSP会严重损害QNLI，MNLI和SQuAD的性能。这些结果表明，我们的预训练方法对于获得先前提出的强有力的实证结果至关重要。

表5：用BERTBASE架构做的预训练任务消融。“无NSP”是无下一句话预测任务的训练。“LTR＆无NSP”用作从左到右的LM，没有下一个句子预测，如OpenAI GPT的训练。“+ BiLSTM”在微调期间在“LTR +无NSP”模型上添加随机初始化BiLSTM。

Table 5: Ablation over the pre-training tasks usingthe BERTBASE architecture. “No NSP” is trained without the nextsentence prediction task. “LTR & No NSP” is trained as a left-to-right LMwithout the next sentence prediction, like OpenAI GPT. “+ BiLSTM” adds a randomlyinitialized BiLSTM on top of the “LTR + No NSP” model during fine-tuning.

接下来，我们通过比较“No NSP”与“LTR＆No NSP”来评估训练双向表征的影响。LTR模型在所有任务上的性能都比MLM模型差，在MRPC和SQuAD上有极大下降。对于SQuAD，直观清楚的是LTR模型在跨度和词块预测方面性能非常差，因为其词块级隐藏状态没有右侧语境。对于MRPC，目前尚不清楚性能不佳是由于其小数据量还是该任务本质，但我们发现这种不良性能在有很多随机重启的完整超参数扫描(full hyperparameter sweep)中是一致的。

为了诚心尝试加强该LTR系统，我们试着在其上面添加一个随机初始化BiLSTM做微调。这确实显着提升了SQuAD结果，但结果仍比预训练双向模型差得多。它还影响所有四个GLUE任务的性能。

我们认识到，也可以训练独立的LTR和RTL模型，并将每个词块表示为这两个模型的串联，如ELMo所做的那样。但是：(a)这是单一双向模型的两倍代价；(b)对于像QA这样的任务来说，这是不直观的，因为RTL模型无法对其问题的答案作出规定；(c)它的强度远低于深度双向模型，因为深度双向模型可以选择使用左或右语境。

5.2 模型大小的影响Effect of Model Size

在本节，我们将探讨模型大小对微调任务准确性的影响。我们训练了许多具有不同层数、隐藏单元和注意头的BERT模型，与此同时，使用与前面描述的相同的超参数和训练过程。

选定GLUE任务的结果如表6所示。此表中，我们报告了5次随机重启微调的平均DevSet开发集精度。我们可以看到，较大的模型导致所有四个数据集的严格精度提高，即使对于仅有3,600个标记训练样例的MRPC，并且与预训练任务有很大不同。同样令人惊讶的是，我们能够在相对于现有文献已经相当大的模型之上实现这种显著改进。例如，Vaswani等人(2017)探索的其最大变换器，是(L=6，H=1024，A=16)有100M参数的编码器，我们在文献中找到的最大变换器是(L=64，H=512，A=2)有235M参数(Al-Rfou等，2018)。相比之下，BERTBASE包含110M参数，BERTLARGE包含340M参数。

表6：BERT模型大小的消融。#L=层数;#H=隐藏的大小;#A=关注头数。“LM(ppl)”是保持训练数据的遮蔽LM混乱。

Table 6:Ablation over BERT model size. #L = the number of layers; #H = hidden size; #A= number of attention heads. “LM (ppl)” is the masked LM perplexity of held-outtraining data.

众所周知，增加模型尺寸将导致机器翻译和语言建模等大型任务的持续改进，这可通过表6中所示该LM训练数据的复杂性来证明。但是，我们相信这是第一个证明扩展到极端模型尺寸的工作也可以在非常小规模的任务上实现大幅改进，前提是该模型已经过充分预训练。

5.3 训练步数的影响Effect of Number of TrainingSteps

图4呈现了从已预训练k步的检查点进行微调后的MNLI Dev精度。这使我们可以回答以下问题：

1.问题：BERT是否真的需要如此大量预训练(128,000字/批*1,000,000步)才能实现高微调精度？

答：是的，当训练1M步时，BERTBASE在MNLI上实现了近1.0％的额外准确度，而步数为500k。

2.问题：MLM预训练是否比LTR预训练收敛慢，因为每批只有15％的单词被预测而不是每个单词？

答：MLM模型的收敛速度略慢于LTR模型。然而，就绝对精度而言，MLM模型几乎立即开始优于LTR模型。

图4：多次训练步骤的消融。这显示了微调后的MNLI精度，从已经预训练了k步的模型参数开始。x轴是k的值。

Figure 4:Ablation over number of training steps. This shows the MNLI accuracy afterfine-tuning, starting from model parameters that have been pre-trained for k steps. Thex-axis is the value of k.

Figure 4 presents MNLI Dev accuracyafter finetuning from a checkpoint that has been

5.4 基于特征的BERT方法Feature-based Approach with BERT

到目前为止呈现的所有BERT结果都使用了微调方法，其中将一个简单分类层添加到预训练模型，并且所有参数在下游任务上联合微调。然而，基于特征的方法具有某些优点，其固定特征从预训练模型中提取。首先，并非所有NLP任务都可以通过变换器编码器架构轻松表示，因此需要添加特定于任务的模型架构。其次，主要计算益处在于能够一旦预计算其训练数据的一个高开销表征，就在该表征顶部使用较少开销模型运行多次实验。

在本节中，我们通过在CoNLL-2003 NER任务上生成类似ELMo预训练的语境表征，来评估基于特征的方法中BERT性能如何。为此，我们用4.3节相同的输入表征，但用其来自一层或多层的激活，而不微调任何BERT参数。这些语境嵌入用作分类层之前随机初始化的双层768维BiLSTM作为输入。

结果显示在表7中。性能最佳的方法是连接来自预训练变换器其顶部四个隐藏层的词块表征，微调此整个模型后仅为0.3 F1。这表明BERT对于微调和基于特征的方法都是有效的。

表7：用BERT和CoNLL-2003 NER基于特征的方法消模。将来自此指定层的激活做组合，并馈送到双层BiLSTM中，而不向BERT反向传播。

Table 7:Ablation using BERT with a feature-based approach on CoNLL-2003 NER. Theactivations from the specified layers are combined and fed into a two-layerBiLSTM, without backpropagation to BERT.

六、结论Conclusion

近期实验改进表明，使用迁移学习语言模型展示出的丰富、无监督预训练，是许多语言理解系统的集成部分。特别是，这些结果使得即使低资源任务，也能从很深的单向架构中受益。我们的主要贡献是将这些发现进一步推广到深度双向架构，允许其相同的预训练模型去成功解决一系列广泛的NLP任务。

虽然实验结果很强，在某些情况下超过人类性能，但重要的未来工作是研究BERT能不能捕获其语言现象。

Recent empirical improvements due totransfer learning with language models have demonstrated that rich,unsupervised pre-training is an integral part of many language understandingsystems. In particular, these results enable even low-resource tasks to benefitfrom very deep unidirectional architectures. Our major contribution is furthergeneralizing these findings to deep bidirectional architectures, allowing the same pre-trained model to successfully tacklea broad set of NLP tasks.

While the empirical results are strong, insome cases surpassing human performance, important future work is toinvestigate the linguistic phenomena that may or may not be captured by BERT.

参考文献References

Ram等， 42篇paper略；字数限制，部分英文原文略。

请下载PDF查看完整版中英对照版译文

B机器阅读理解11种NLP任务BERT超人类(2978字)

机器阅读理解11种NLP任务BERT超人类

文|秦陇纪，数据简化DataSimp20181013Sat-1103Sat

读了新智元抄机器之心BERT简译稿等网媒文章，发现译者把sensor2sensor库译到BERT里了。为了避免低劣科普作品误导人，纠正其无NLP专业背景、识字翻译译文，也方便大家共同学习NLP领域技术。秦陇纪半月翻译论文中英文对照，包括3个译注。水平有限、错误在所难免，请直接留言指正，或电邮至[email protected]。

业界通用NLP任务——GLUE基准的11项性能测试中，BERT刷新11项性能得分记录！论文图3显示在(a)句子对分类任务：多类型自然语言推理MNLI、Quora问题对QQP、问题自然语言推理QNLI、语义文本相似性基准STS-B、微软研究院解释语料库MRPC、识别文本蕴涵RTE、对抗生成情境数据集SWAG，(b)单句分类任务：斯坦福情感树库2SST-2、语言可接受性语料库CoLA，(c)问答任务：斯坦福问答数据集SQuAD v1.1，(d)单句标签任务：CoNLL 2003命名实体识别NER等11个数据集上，BERT将GLUE基准推至80.4％(绝对改进7.6％)，MultiNLI准确度达到86.7% (绝对改进率5.6％)，SQuAD v1.1问答测试F1得分纪录刷新为93.2分(绝对提升1.5分)，超过人类性能2.0分。

毋庸置疑，谷歌AI团队语言组新发布的BERT模型开启了NLP新时代！

BERT在机器阅读理解顶级水平测试SQuAD1.1中表现出惊人成绩：全部2个衡量指标超越人类！创出最佳成绩。

谷歌团队的谷歌大脑研究科学家Thang Luong也说：BERT模型开启了NLP的新时代！

Twitter上也有众多研究者参与讨论、转发了这篇论文：

本文总结论文BERT模型贡献及业界评价。

一、BERT模型主要贡献

1、证明了双向预训练对语言表征的重要性。与之前使用的单向语言模型进行预训练不同，BERT使用遮蔽语言模型来实现预训练的深度双向表征。

2、预训练表征免去了许多工程任务需要针对特定任务修改体系架构的需求。BERT是第一个基于微调的表征模型，它在大量的句子级和词块级任务上实现了最先进的性能，强于许多面向特定任务体系架构的系统。

3、BERT刷新了11项NLP任务的当前最优性能记录。论文报告了BERT消模实验——模型简化测试(ablationstudy)，证明该模型的双向特性是最重要的一项新贡献。

最强NLP模型BERT在GLUE上性能排名第一，https://gluebenchmark.com/leaderboard；相关代码和预训练模型公布在goo.gl/language/bert上，BERT简版源码10月30日已发布在https://github.com/google-research/bert，我们后期抽空分析，大家关注“数据简化DataSimp”公号。

BERT模型重要意义：宣告NLP范式的改变。北京航空航天大学计算机专业博士吴俣在知乎上写道：BERT模型的地位类似于ResNet在图像，这是里程碑式的工作，宣告着NLP范式的改变。以后研究工作估计很多都要使用他初始化，就像之前大家使用word2vec一样自然。

从现在的大趋势来看，使用某种模型预训练一个语言模型看起来是一种比较靠谱的方法。从之前AI2的ELMo，到OpenAI的fine-tunetransformer，再到Google的这个BERT，全都是对预训练语言模型的应用。关于BERT这个模型本身，我个人觉得它再次验证了预训练在NLP当中是很有用的，其次继续验证了Transformer的拟合能力真的很强。

BERT一出，那几个论文里做实验的数据集全被轰平了，大家洗洗睡了。心疼swag一秒钟，出现3月，第一篇做这个数据集的算法，在超了baseline 20多点的同时也超过人了。膜一下Jacob大哥，在微软就一个人单枪匹马搞NMT。再心疼我软一秒，失去了一个这么厉害的人才。

二、BERT模型与其它两个的不同

它在训练双向语言模型时以减小的概率，把少量的词替成了Mask或者另一个随机的词。感觉这个目的在于使模型被迫增加对语境的记忆。至于这个概率，我猜是Jacob拍脑袋随便设的。

增加了一个预测下一句的loss。这个看起来就比较新奇了。

算笔账paper里大模型16TPU，如果用美帝cloud TPU的话，训一次要大概5万人民币。感觉BERT模型属核弹级别，大公司可以有，普通人暂时用不起。现在来看，性价比比较高的就是ELMo了，简单易用，还能跑得起来，效果也好。

从将来的趋势来看，预训练很有用，现在在很多NLP任务中取得重大突破。还剩下预训练在语言生成中的应用，比如机器翻译。我套用老板的话，说机器翻译是自然语言处理皇冠上的明珠。如果预训练的语言模型能帮助机器翻译就厉害了。不过就目前来看大家还没摸准怎么弄。

———————分割线—————————

通过BERT模型，吴俣有三个认识：

1、Jacob在细节上是一等一的高手

这个模型的双向和Elmo不一样，大部分人对论文作者之一Jacob的双向在novelty上的contribution 的大小有误解，我觉得这个细节可能是他比Elmo显著提升的原因。Elmo是拼一个左到右和一个右到左，他这个是训练中直接开一个窗口，用了个有顺序的cbow。

2、Reddit对跑一次BERT的价格讨论

For TPU pods:

4 TPUs * ~$2/h(preemptible) * 24 h/day * 4 days = $768 (base model)

16 TPUs = ~$3k(large model)

For TPU:

16 tpus *$8/hr * 24 h/day * 4 days = 12k

64 tpus *$8/hr * 24 h/day * 4 days = 50k

For GPU:

"BERT-Largeis 24-layer, 1024-hidden and was trained for 40 epochs over a 3.3 billion wordcorpus. So maybe 1 year to train on 8 P100s? "

3、不幸的是，基本无法复现，所以模型和数据谁更有用也不好说。

BERT的成功也说明，好的深度学习研究工作的三大条件：数据、计算资源、工程技能点很高的研究员(jacob在微软时以单枪匹马搭大系统而中外闻名)

NYU CILVR Lab Research Member：如果说这是里程碑式的工作的话，那我在Google实习期间真的是见证了历史。每周和Jacob一起开会，他复现openAI的带预训练语言模型的GPT只花费了一周，同时发现效果不如预期。拿到大数据，重新训练定位出问题只花了两天。再下次开会他的新想法已经超过openAI模型了。再下周开会就有了现在Single Model在几个任务上的成绩。

请你认真地感受一下这个速度。OpenAI做他们工作时，预训练他们的语言模型花了一个月，而Jacob用TPU只花了一天。OpenAI训练语言模型基本是按照原来Transformer的配置，调整了一些参数，而Jacob可以随心所欲地尝试自己新的想法。这是超强算力和超强工程能力碰撞而迸发的能量！未来真的是算力的时代。

-End-

参考文献(1214字)

1.Jacob Devlin，Ming-Wei Chang，Kenton Lee，Kristina Toutanova，Google．BERT: Pre-training of Deep Bidirectional Transformers for LanguageUnderstanding．[EB/OL]；arXiv，https://arxiv.org/pdf/1810.04805.pdf，2018-10-13．

2.Jacob Devlin，Ming-Wei Chang，Kenton Lee，Kristina Toutanova，Google．BERT: Pre-training of Deep Bidirectional Transformers for LanguageUnderstanding．[EB/OL]；goo.gl，https://goo.gl/language/bert，2018-10-13．

3.GLUE．BERT: GLUE Benchmark No.1．[EB/OL]；Gluebenchmark，https://gluebenchmark.com/leaderboard，2018-10-13．

4.知乎用户．如何评价BERT模型？．[EB/OL]；知乎，https://www.zhihu.com/question/298203515/answer/509562280，2018-10-13．

5.作者：JacobDevlin、Ming-WeiChang、Kenton Lee、Kristina Toutanova，机器之心编译，参与：路、王淑婷、张倩，选自arXiv．最强NLP预训练模型！谷歌BERT横扫11项NLP任务记录．[EB/OL]；机器之心，https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==&mid=2650749886&idx=1&sn=87080bc474d144b286d4673383f4b6d6，2018-10-13．

6.来源：arXiv、知乎．NLP历史突破！谷歌BERT模型狂破11项纪录，全面超越人类！．[EB/OL]；新智元，https://mp.weixin.qq.com/s?__biz=MzI3MTA0MTk1MA==&mid=2652028621&idx=1&sn=5366f2a95bc19862af2c4bbd468ccc19，2018-10-13．

x.秦陇纪．数据简化社区Python官网Web框架概述；数据简化社区2018年全球数据库总结及18种主流数据库介绍；数据科学与大数据技术专业概论；人工智能研究现状及教育应用；信息社会的数据资源概论；纯文本数据溯源与简化之神经网络训练；大数据简化之技术体系．[EB/OL]；数据简化DataSimp(微信公众号)，http://www.datasimp.org，2017-06-06．

谷歌AI论文BERT双向编码器表征模型：机器阅读理解NLP基准11种最优(62264字)

秦陇纪

简介：谷歌AI论文BERT双向编码器表征模型：机器阅读理解NLP基准11种最优。(公号回复“谷歌BERT论文”，文末“阅读原文”可下载16图65k字25页PDF资料)蓝色链接“数据简化DataSimp”关注后下方菜单有文章分类页。作者：谷歌BERT组。来源：谷歌AI语言组arXiv预印本、机器之心知乎等，数据简化社区秦陇纪微信群聊公众号，引文出处附参考文献。主编译者：秦陇纪，数据简化、科学Sciences、知识简化新媒体创立者，数据简化社区创始人OS架构师/C/Java/Python/Prolog程序员，IT教师。每天大量中英文阅读/设计开发调试/文章汇译编简化，时间精力人力有限，欢迎转发/赞赏/加入支持社区。版权声明：科普文章仅供学习研究，公开资料©版权归原作者，请勿用于商业非法目的。秦陇纪2018数据简化DataSimp综合汇译编，投稿合作、转载授权、侵权错误(包括原文错误)等请联系[email protected]沟通。欢迎转发：“数据简化DataSimp、科学Sciences、知识简化”新媒体聚集专业领域一线研究员；研究技术时也传播知识、专业视角解释和普及科学现象和原理，展现自然社会生活之科学面。秦陇纪发起期待您参与各领域~~ 强烈谴责超市银行、学校医院、政府公司肆意收集、滥用、倒卖公民姓名、身份证号手机号、单位家庭住址、生物信息等隐私数据！

Appx(845字).数据简化DataSimp社区简介

信息社会之数据、信息、知识、理论持续累积，远超个人认知学习的时间、精力和能力。应对大数据时代的数据爆炸、信息爆炸、知识爆炸，解决之道重在数据简化(Data Simplification)：简化减少知识、媒体、社交数据，使信息、数据、知识越来越简单，符合人与设备的负荷。数据简化2018年会议(DS2018)聚焦数据简化技术(Data Simplification techniques)：对各类数据从采集、处理、存储、阅读、分析、逻辑、形式等方ose 做简化，应用于信息及数据系统、知识工程、各类Python Web框架、物理空间表征、生物医学数据，数学统计、自然语言处理、机器学习技术、人工智能等领域。欢迎投稿数据科学技术、简化实例相关论文提交电子版(最好有PDF格式)。填写申请表加入数据简化DataSimp社区成员，应至少一篇数据智能、编程开发IT文章：①高质量原创或翻译美欧数据科技论文；②社区网站义工或完善S圈型黑白静态和三彩色动态社区LOGO图标。论文投稿、加入数据简化社区，详情访问www.datasimp.org社区网站，网站维护请投会员邮箱[email protected]。请关注公众号“数据简化DataSimp”留言，或加微信QinlongGEcai(备注：姓名/单位-职务/学校-专业/手机号)，免费加入投稿群或”科学Sciences学术文献”读者微信群等。长按下图“识别图中二维码”关注三个公众号(搜名称也行，关注后底部菜单有文章分类页链接)：

数据技术公众号“数据简化DataSimp”：

科普公众号“科学Sciences”：

社会教育知识公众号“知识简化”：

普及科学知识，分享到朋友圈

转发/留言/打赏后“阅读原文”下载PDF

阅读原文

微信扫一扫
关注该公众号