BERT reading papers

This article try to stick together BERT of the original paper, but considering the easy to understand, so not sentence by sentence translation, but the translation according to the author's personal understanding, some papers did not explain clearly the author failed to place or deep understanding, there original release, if inappropriate, please include a lot and want to be guided and corrected.

Posted

  • Bert: Bidirectional Encoder Representations from Transformers
    from one of the model Transformers come bidirectionally encoded representation model.

Papers address

Abstract

BERT design is expressed by the left and right bidirectional context jointly adjusted in all layers, deep trained beforehand from unlabelled text.

BERT pre-trained model can be created by fine-tuned in a wide range of tasks in the new best record, such as Q & A tasks, verbal reasoning tasks, without having to make substantial changes to the architecture itself BERT.

1 Introduction

BERT is a simple concept, the result of strong practice model. It created a new best record in 11 natural language processing tasks.

ELMo is based on feature-based methods [Note 2] Application of the pre-trained language representations.

OpenAI GPT method is based on fine-tuning [Note 3] Application of the pre-trained language representations.

The above two methods in the training phase, share the same objective function, they use single language model to learn common language representation.

For authors believe that the sentence, the way attention is suboptimal, for token-level tasks, such as Q & A tasks, it will bring a bad effect. Because a similar question and answer tasks, based on a combination of the two directions context is very important.

In this thesis, the author proposes BERT model to improve the method based on fine-tuning.
BERT:. Bidirectional Encoder Representations from Transformers Through Hole
BERT inspired cloze task, "masked language model" (MLM ) pre-training goals through the use of a way to alleviate the constraints mentioned above.
MLM randomly masks out some of the input tokens, their goal is predicted in the original vocabulary id from the context of these tokens. Pre-training language model does not want left-to-right's, MLM target characterization makes the integration of contextual information left and right, which allows the author to a depth of two-way pre-training Transformer models. In addition to MLM, the authors also used a "next sentence prediction" task, the associated text-pair characterization of pre-training. The contribution of this paper is as follows:

  • Demonstrating the importance of two-way pre-trained language representation. BERT MLM using such models may be pre-depth characterization bidirectional training; the GPT using unidirectional pre-trained on a language model; ELMO using good left-to-right and right-to-left, respectively, characterized by the training, and then only with a simple series .

  • It shows the characterization of pre-training can reduce the need for many engineering onerous task-specific architecture. BERT is the first to achieve the best performance on a huge order of sentence and word level task-based representation model of fine-tuning.
  • BERT best record breaking 11 NLP tasks. Pre-training model code and from here get.

Pre common language to characterize training has been quite a long history. This section provides a brief look at the use of the most widely used general language to characterize pre-trained.

2.1 Unsupervised Feature-based Approaches

For decades, learning widely used word representation has been an active area of research, the field of neural and non-neural area include. Pre-term training is a major part of modern NLP embedded systems, providing scratch learning words embedded in a significant improvement. In order to embed pre-term training vectors, people used the left-to-right language modeling goals, as well as to distinguish correct and incorrect modeling target words from the left and right context.
These methods have been extended to a more coarse particle size, such as embedded in sentences, paragraphs, or embedded. In order to characterize the training sentence, previous work has used these objectives: ranking the candidate sentences; Characterization of a given sentence on, left-to-right next sentence generation; automatic denoising from the encoder.

ELMo and its predecessor summarizes the research of traditional embedded word from a different dimension. Context sensitive feature extraction thereof from the left-to-right and right-to-left language model. Context characterizing each token (word, symbol or the like) is performed by left-to-right and characterization of the right-to-left series obtained. After the word in context embedded architecture and has a specific task, ELMo NLP in several key benchmarks (including: questions and answers, sentiment analysis, named entity recognition) achieved the best record. Melamud, who in 2016 proposed the use of LSTMs model to learn about the context of a context characterized by the words of a prediction task. And ELMo similar, their model is based on feature-based methods, and there is no depth two-way (Note 1). Fedus, who in 2018 demonstrated the cloze task can be used to improve the robustness of the text generation model.

Unsupervised Fine-tuning Approaches

As with the feature-based method, the pre-training Fangxiang Gang started just words on unmarked text embedded parameters (unsupervised learning).
Recently, other documents and generating sentence context token encoder characterized unlabeled text already pre-trained, and by fine-tuned manner as the downstream task. The advantage of these methods is that very few parameters need to learn from scratch. At least in part because of this advantage, before OpenAI GPT on a number of sentence-level mission from GLUE benchmark reached an optimum level. Left-to-right language modeling and automatic encoder target for this training model.

2.3 Transfer Learning from Supervised Data

There is also work to do to show the effectiveness of the transfer learning from a large data set of supervisory tasks, like natural language reasoning (NLI), and machine translation. Computer vision research also demonstrates the importance of migration learning, an effective technique is fine-tuning (fine-tune) ImageNet of pre-training model.

3 BERT

This section describes the detailed implementation of BERT. Using BERT has two steps: pre-training and fine-tuning. During the pre-training, BERT model is trained on the different tasks of unlabeled data. When trimming, BERT model pre-trained is initialized with parameters, and is based on the label data to a downstream task training. Each task has its own fine-tuning downstream model, despite initial pre-training all the time with good BERT model parameters. In FIG 1, an example of the present art section Q a sample run.

Figure 1: pre-training process and the operation of fine-tuning BERT. In addition to output layer, the two-stage architecture is the same. Pre-training model parameter initialization parameters of the model will be as different downstream tasks. When fine-tuning, all the parameters involved in fine-tuning. Symbol [the CLS] When a particular set is added in front of each input sample, indicates that this is a start of input samples, [the SEP] is a special set of division marks. Such partition questions / answers.

BERT a distinctive characteristic is its unified architecture across tasks, namely the smallest difference between the pre-training infrastructure and downstream infrastructure.

Model Architecture

BERT model architecture is a multi-layered two-way Transformer encoder (about Transformer can see this article ). Because Transformer use became widespread, and BERT associated with the Transformer and the original Tranformer achieve almost the same, so this paper will not elaborate, I recommend the reader to see the original Transformer papers, as well as "The Annotated Transformer" (This is the original an excellent explanation of the thesis set forth Transformer).

Here, L denotes the number of layers of indicating, H represents the number of dimensions of each of the hidden units, A represents the number of self-attention header. BERT There are two kinds of model size, namely BERT (base, L = 12, H = 768, A = 12, Total Parameters = 110M) and BERT (large, L = 24, H = 1024, A = 16, Total Parameters = 340M).

Same BERT (base) and a set of model size OpenAI GPT, to facilitate comparison. It is important to note is, BERT Transformer bidirectional self-attention, while GPT
the Transformer using constrained self-attention, each token can only note that the context in which it left.

Input/Output Representations

BERT made using various downstream task, a token in the input representation may clearly in sequence a sentence or a pair of sentences (such as <Question, Answer>). Here the "sentence" does not have to be language sentence, but can be any range of continuous text. "Sequence" refers to a BERT input sequence, and may be a sentence, it may be packaged together two sentences.

The authors used the word to do WordPiece embeddings embedded, corresponding vocabulary has 30,000 token. The first token of each sequence is always a particular classification token ([CLS]). This last token corresponding to the hidden state sequence is characterized as a polymerization classification task. Packaged into a sequence of sentences. There are two methods for distinguishing the sentence of the sentence. First, by delimiter [the SEP]; second, a model architecture added after learning embedded (learned embedding) to each token, to indicate that it is part of a sentence or a sentence A B. As shown in FIG. 1, E represents an input word embedded, C represents the vector of the last hidden layer [the CLS] is, Ti denotes the i th input vector in the last token of the hidden layer.

For a given token, which is characterized by adding an input configured by a corresponding token, segment and the position embeddings. 2.

3.1 Pre-training BERT

Task 1:Masked LM

Intuitively, the authors reason to believe that a depth of two-way model will indeed be stronger than the one-way or two-way shallow model.
Unfortunately, the standard conditions in accordance with the language model only from left-to-right or right-to-left way of training, until conditions permit two-way each word indirect "see itself", and can predict the target in the context of a multi-layer word.
Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly "see itself", and the model could trivially predict the target word in a multi-layered context. (original sentence)

为了训练一个深度双向表征,作者简单的随机mask一些百分比的输入tokens,然后预测那些被mask掉的tokens。这一步称为“masked LM”(MLM),尽管在文献中它通常被称为完型填空任务(Cloze task)。
mask掉的tokens对应的最后的隐藏层向量喂给一个输出softmax,像在标准的LM中一样。在实验中,作者为每个序列随机mask掉了15%的WordPiece tokens。和 denoising auto-encoders相比,BERT的做法是只预测被mask掉的词,而不是重建完整的输入。

尽管这允许作者获得双向预训练模型,其带来的负面影响是在预训练和微调模型之间创造了不匹配,因为[MASK]符号不会出现在微调阶段。所以要想办法让那些被mask掉的词的原本的表征也被模型学习到,所以这里作者采用了一些策略,具体参见:附录 A.1。

Task 2:Next Sentence Prediction (NSP)

许多下游任务,比如问答,自然语言推理等,需要基于对两个句子之间的关系的理解,而这种关系不能直接通过语言建模来获取到。为了训练一个可以理解句子间关系的模型,作者为一个二分类的下一个句子预测任务进行了预训练,这些句子对可以从任何单语言的语料中获取到。特别是,当为每个预测样例选择一个句子对A和B,50%的时间B是A后面的下一个句子(标记为IsNext), 50%的时间B是语料库中的一个随机句子(标记为NotNext)。图1中,C用来预测下一个句子(NSP)。尽管简单,但是该方法QA和NLI任务都非常有帮助。5.1节对此有展示。

NSP任务和 Jernite et al. (2017) and Logeswaran and Lee (2018)中的表示学习的目标密切相关。任务,先前的工作中,只将句子嵌入转移到了下游任务中,而BERT转移了所有参数来初始化终端任务模型的参数。

Pre-training data 预训练过程很大程度上参考了已有的语言模型预训练文献。预训练语料方面,作者使用了BooksCorpus(800M words),English Wikipedia(2500M words) 。作者只提取Wikipedia的文本段落,忽略列表,表格和标题。为了提取长连续序列,关键是使用文档级语料库,而不是像十亿词基准(Chelba et al., 2013)这样的无序的句子级语料库。

3.2 Fine-tuning BERT

微调很简单,因为Transformer中的self-attention机制允许BERT通过交换合适的输入和输出来为许多下游任务建模——无论是单个文本还是文本对。对于涉及到文本对的应用,常见的模式是分辨编码文本对中的文本,然后应用双向交叉的注意力。BERT使用self-attention机制统一了这两个步骤,BERT使用self-attention编码一个串联的文本对,其过程中就包含了2个句子之间的双向交叉注意力。
输入端,句子A和句子B可以是:(1)释义句子对(2)假设条件句子对(3)问答句子对 (4)文本分类或序列标注中的text-∅对。
输出端,对于,token表征喂给一个针对token级别的任务的输出层,序列标注和问答是类似的,[CLS]表征喂给一个分类器输出层,比如情感分析。

微调的代价要比预训练小的多。论文中的很多结果都从一个完全相同的预训练模型开始,在TPU上只要花费1小时的时间就可以复现,GPU上也只要几个小时。更多细节可以查看附录 A.5

4 Experiments

本节展示了BERT在11项NLP任务上的fine-tuning结果。

4.1 GLUE (General Lanuage Understanding Evaluation)

GLUE基准测试是一系列不同的自然语言理解任务。GLUE数据集的详细描述在附录B.1中。

GLUE上的fine-tune,作者使用第3节描述的句子和句子对,用最后的隐藏向量C作为表征,C对应首个输入token([CLS])。分类器层的权重系数矩阵W (形状:K×H),K是类别的个数。 作者使用C和W计算标准的分类损失,比如log(softmax(C·W )).

在所有的GLUE任务上,作者使用了batch-size=32,epochs=3。对于每个任务,都通过开发集的验证来选择了最佳的微调学习率(在5e- 5,4e - 5,3e -5和2e-5之间)。另外,对于BERT的large模型,作者发现微调有时候在小数据集上不稳定,所以随机重启了几次,并选择了开发集上表现最佳的模型。With random restarts, we use the same pre-trained checkpoint but perform different fine-tuning data shuffling and classifier layer initialization.9(?)

BERT base版本的模型架构和OpenAI GPUT除了attention masking以外,几乎相同。
BERT large 版本明显比base版本要表现的更好。关于模型大小的影响,在5.2节有更深入的探讨。

4.2 SQuAD v1.1 (Stanford Question Answering Dataset)

这是一个100k的问答对集合。给定一个问题和一篇短文,以及对应的答案,任务是预测出短文中的答案文本span(the answer text span in the passage)。
图1所示,在问答任务中,作者将输入问题和短文表示成一个序列,其中,使用A嵌入表示问题,B嵌入表示短文。在微调的时候,作者引入一个start向量S,和一个end向量E,维数都为H。answer span的起始词word i的概率计算公式:

答案末尾词的概率表示原理一样。
位置i到位置j的候选span的分数定义如下:

并将满足j>i的最大得分的span最为预测结果。训练目标是正确的开始和结束位置的对数似然估计的和。
作者微调了3个epochs,学习率设置为5e-5,batch-size设置为32。

Table2 显示了顶级排行耪和结果。其中SQuAD排行耪中的公共系统描述没有最新的,并且允许使用任何公开数据训练各自的网络。
因此,作者在系统中使用适度的数据增强,首先对TriviaQA进行微调(Joshi et al., 2017),然后再对SQuAD进行微调。

4.3 SQuAD v2.0

We treat questions that do not have an answer as having an answer span with start and end at the [CLS] token. The probability space for the start and end answer span positions is extended to include the position of the [CLS] token. For prediction, we compare the score of the no-answer span: snull = S·C + E·C to the score of the best non-null span

sˆi,j = maxj≥i S·Ti + E·Tj . We predict a non-null answer when sˆi,j > snull + τ , where the threshold τ is selected on the dev set to maximize F1. We did not use TriviaQA data for this model. We fine-tuned for 2 epochs with a learning rate of 5e-5 and a batch size of 48.

4.4 SWAG

The Situations With Adversarial Generations (SWAG)数据集包含113k个句子对完整示例,用于评估基于常识的推理。给定一个句子,任务是从四个选项中选择出最有可能是对的的continuation(延续/扩展)。

在微调的时候,作者构造了4个输入序列,每个包含给定句子A的序列和continuation(句子B)。引入的唯一特定于任务的参数是一个向量,它与[CLS]token做点积,得到每个选项的分数,该分数会通过一个softmax层来归一化。
作者微调的时候,使用了3个epochs,lr设置为2e-5,batch-size设置为16。Table4中有对应的结果,BERT在该领域的表现接近人类。

5 Ablation Studies 消融研究

本节通过在BERT的各方面做消融实验,来理解相对重要的部分。

5.1 Effect of Pre-training Tasks

通过去掉NSP后,对比BERT的双向表征和Left-to-Right表征,作者得证明了有NSP更好,且双向表征更有效。
通过引入一个双向的LSTM,作者证明了BILSTM比Left-to-Right能得到更好的结果,但是仍然没有BERT的base版本效果好。
具体对比结果如图:

另外,关于ELMo那样的分别训练LTR和RTL的方式,作者也给出了其不如BERT的地方:

  • this is twice as expensive as a single bidirectional model;
  • this is non-intuitive for tasks like QA, since the RTL model would not be able to condition the answer on the question;
  • this it is strictly less powerful than a deep bidirectional model, since it can use both left and right context at every layer.

5.2 Effect of Model Size

本节介绍模型大小对任务表现的影响。作者训练了一些不同层数、隐藏单元数、注意力头的BERT模型,但使用相同的超参数和训练过程。
Table6展示了对比结果。大模型带来更好的表现。

For example,
the largest Transformer explored in Vaswani et al. (2017) is (L=6, H=1024, A=16) with 100M parameters for the encoder, the largest Transformer we have found in the literature is (L=64, H=512, A=2) with 235M parameters (Al-Rfou et al., 2018). By contrast,
BERT(base) contains 110M parameters
BERT(large) contains 340M parameters.

本节作者最后给出的结论如下:
we hypothesize that when the model is fine-tuned directly on the downstream tasks and uses only a very small number of randomly initialized additional parameters, the taskspecific models can benefit from the larger, more expressive pre-trained representations even when downstream task data is very small.
大致意思是,通过微调,下游任务即使能提供的数据量非常小,依然可以利用预训练模型得到不错的训练效果。

5.3 Feature-based Approach with BERT

相比于上面一直在说的fine-tuning的方式,feature-based的方式也有着其关键的优势。
首先,不是所有的任务都可以轻易的表示成Trasformer encoder 架构,所以会有需要添加一个基于特定任务的模型架构的需求。
其次,预先计算一次训练数据的昂贵表示,然后在此表示之上使用更便宜的模型运行许多实验,这对计算有很大的好处。

本节,作者在BERT的命名实体识别应用上比较了fine-tuning和feature-based方式。
在BERT的输入中,使用了一个保留大小写的单词模型,并包含了数据提供的最大文档上下文。按照标准实践,作者将其表示为标记任务,但在输出中不使用CRF层。作者使用第一个sub-token的表征,作为token-level的NER分类器的输入。

为了和fine-tuning方法做消融实验,作者以从没有微调任何参数的一层或多层提取activations的方式应用feature-based方法。这些上下文的嵌入用做一个随机初始化的两层768维BiLSTM的输入,然后送入分类器层。

Table 7显示了实验结果:

可以看到,feature-based方法中,拼接最后4个隐藏层的方式,可以达到96.1的F1分数,仅比BERT(base)少了0.3。
实验结果表明,BERT的2种应用方法都是有效的。

6 Conclusion

近来通过迁移学习改善模型学习的例子表明了丰富的,无监督的预训练是许多语言理解系统的重要组成部分。特别是,这些结果使得即使是低资源的任务也可以从深层单向架构中获益。
BERT的主要贡献是进一步将这些发现推广到深层双向架构,使得相同的预训练模型可以成功应对一组广泛的NLP任务。

附录A Additional Details for BERT

A.1 Illustration of the Pre-training Tasks

作者在这里提供了预训练的样例。

Masked LM and the Masking Procedure 假设原句子是“my dog is hairy”,作者在3.1节 Task1中提到,会随机选择句子中15%的tokens位置进行mask,假设这里随机选到了第四个token位置要被mask掉,也就是对hairy进行mask,那么mask的过程可以描述如下:

  • 80% 的时间:用[MASK]替换目标单词,例如:my dog is hairy --> my dog is [MASK] 。
  • 10% 的时间:用随机的单词替换目标单词,例如:my dog is hairy --> my dog is apple 。
  • 10% 的时间:不改变目标单词,例如:my dog is hairy --> my dog is hairy 。 (这样做的目的是使表征偏向于实际观察到的单词。)

上面的过程,需要结合训练过程的epochs来理解,每个epoch表示学完了一遍所有的样本,所以每个样本在多个epochs过程中是会重复输入到模型中的,知道了这个概念,上面的80%,10%,10%就好理解了,也就是说在某个样本每次喂给模型的时候,用[MASK]替换目标单词的概率是80%;用随机的单词替换目标单词的概率是10%;不改变目标单词的概率是10%。

有的介绍BERT的文章中,讲解MLM过程的时候,将这里的80%,10%,10%解释成替换原句子被随机选中的15%的tokens中的80%用[MASK]替换目标单词,10%用随机的单词替换目标单词,10%不改变目标单词。这个理解是不对的。

然后,作者在论文中谈到了采取上面的mask策略的好处。大致是说采用上面的策略后,Transformer encoder就不知道会让其预测哪个单词,或者说不知道哪个单词会被随机单词给替换掉,那么它就不得不保持每个输入token的一个上下文的表征分布(a distributional contextual representation)。也就是说如果模型学习到了要预测的单词是什么,那么就会丢失对上下文信息的学习,而如果模型训练过程中无法学习到哪个单词会被预测,那么就必须通过学习上下文的信息来判断出需要预测的单词,这样的模型才具有对句子的特征表示能力。另外,由于随机替换相对句子中所有tokens的发生概率只有1.5%(即15%的10%),所以并不会影响到模型的语言理解能力。对此,本论文的C.2节做了对此过程影响的评估。

相比标准的语言模型训练,masked LM在每个batch中仅对tokens的15%的部分进行预测,所以模型收敛需要更多的预训练步骤。C.1节演示了MLM比left-to-right模型(会对每个token进行预测)收敛的稍慢,但是学习效果的改善远远超过了增加的训练成本。

Next Sentence Prediction
”下个句子预测“的任务的例子:

Input = [CLS] the man went to [MASK] store [SEP]
            he bought a gallon [MASK] milk [SEP]
            
Label = IsNext

Input = [CLS] the man [MASK] to the store [SEP]
            penguin [MASK] are flight ##less birds [SEP]

Label = NotNext

A.2 Pre-training Procedure 预训练过程

本节首先介绍了下一句预测任务的样本获取策略,大致是从语料库文本中选取2个span,这里的span可以理解为一个完整话。然后,2个span分别对应句子A和句子B。其中,50%的情况下,句子B是句子A的下一句,而50%的情况下,B不是A的下一句。并且,句子A和B组合起来的长度要<=512个tokens。
然后介绍了LM的分词情况:
The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no special consideration given to partial word pieces.?
作者预训练的时候采用batch-size=256,也就是说每个batch由256*512=128000个tokens,总共训练了1,000,000步,将近40个epochs,超过33亿个单词。梯度优化算法采用Adam,学习率=1e-4,β1=0.9,β2=0.999,0.01的L2权重衰减,学习率在首个10000步进行warmup【注释4】 ,然后进行线性衰减。作者在所有层使用了0.1概率的的dropout。在激活函数上,作者选择了gelu,而不是标准的relu,这个选择跟随了OpenAI GPT。The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood.(训练损失是masked掉的语言模型的似然均值与下一句预测的似然均值之和。)
BERT base模型在4块云TPU上训练(共16块TPU芯片)。BERT large在16块云TPU上训练(共64块TPU芯片)。每个预训练持续4天的时间完成。
由于注意力的计算复杂度是序列长度的平方,所以更长的序列所增加的成本是昂贵的。为了加速实验中的预训练过程,作者对90%的步骤使用128长度的序列预训练,然后用512长度的序列训练剩余的10%的步骤,以便学习到位置嵌入(positional embeddings)。

A.3 Fine-tuning Procedure

在fine-tuning的时候,模型的大多数超参数和预训练的时候是一样的,除了batch-size,learning rate和epochs。dropout的概率始终保持在0.1。优化超参数的值是特定于任务来做的,但是作者提到了下面的可能的值的范围,该范围内的值在跨任务上也工作的很好:

  • Batch size: 16, 32
  • Learning rate (Adam): 5e-5, 3e-5, 2e-5
  • Number of epochs: 2, 3, 4

作者也观察了10万+的训练样本,超参数选择的敏感度远低于小数据集。Fine-tuning仍然非常快,所以简单粗暴的在上面的参数上运行一个穷举搜索来选择出可以让模型在开发集上表现最好的那些参数的方式也是可以接受的。

A.4 BERT,ELMo,OpenAI GPT对比

图3展示了这3个模型架构的对比:

  • BERT使用了双向的Transformer架构
  • OpenAI GPT使用了left-to-right的Transformer
  • ELMo分别使用了left-to-right和right-to-left进行独立训练,然后将输出拼接起来,为下游任务提供序列特征
    上面的三个模型架构中,只有BERT模型的表征在每一层都联合考虑到了左边和右边的上下文信息。
    除了架构不同,另外的区别在于BERT和OpenAI GPT是基于fine-tuning的方法,而ELMo是基于feature-based的方法。

除了MLM和NSP,BERT和GPT在训练的时候还有如下几处不同:

  • GPT is trained on the BooksCorpus (800M words); BERT is trained on the BooksCorpus (800M words) and Wikipedia (2,500M words).
  • GPT uses a sentence separator ([SEP]) and classifier token ([CLS]) which are only introduced at fine-tuning time; BERT learns [SEP], [CLS] and sentence A/B embeddings during pre-training.
  • GPT was trained for 1M steps with a batch size of 32,000 words; BERT was trained for 1M steps with a batch size of 128,000 words.
  • GPT used the same learning rate of 5e-5 for all fine-tuning experiments; BERT chooses a task-specific fine-tuning learning rate which performs the best on the development set.

作者为了证明BERT模型是因为2个预训练任务和双向的Transformer才比其他模型表现更好,所以在5.1节中阐述了他们做的消融实验过程和结果。

A.5 不同任务下的Fine-tuning图例

如图4所示:

(a)和(b)是序列级别的任务;(c)和(d)是token级别的任务。
图中的E表示输入的词嵌入,Ti表示第i个token的上下文表征,[CLS]是分类输出的特定符号,[SEP]是分隔非连续token序列的特定符号。

B 详细的实验配置

B.1 GLUE基准实验的详细描述

以下是模型训练和评测使用的各种下游任务的数据集:

  • MNLI 目标是预测第二个句子对于第一个句子是蕴含、矛盾还是中性的关系。
  • QQP 目标是判断两个问题是否等价。
  • QNLI 将标准问答数据集转换成一个二分类任务。包含正确回答的句子对为正样本,反之为负样本。
  • SST-2 对电影评论做情感分类。
  • CoLA 预测一个句子是否符合语言学定义。
  • STS-B 用1-5的分数表示2个句子的语义相似度。
  • MRPC 判断2个句子是否语义上等价。
  • RTE 和MNLI类似,但是数据集小的多。
  • WNLI 一个小型自然语言推理数据集。该数据集有一些问题,所以排除在评测之外。

C 其他消融研究

C.1 训练步数的影响

图5展示了在MNLI开发集上使用预训练了k步的模型进行微调后得到的准确度。

通过此图,就可以回答下面的问题了:

  • BERT真的需要这么巨大的预训练量级吗(128,000 words/batch * 1000,000 steps)?
    是的。相对于500k的steps,准确度能提高1.0%
  • MLM预训练收敛速度比LTR慢吗?因为每个batch中只有15%的单词被预测,而不是所有单词都参与。
    确实稍稍有些慢。但是准确度因此而立刻超过了LTR模型,所以是值得的。

C.2 不同Masking过程的消融实验

之前说过,mask策略的目的是减轻预训练和微调之间的不匹配,因为[MASK]符号在微调的时候几乎不会出现。Table8展示了基于Fine-tune和基于Feature-based的方式下,不同的MASK策略对结果的影响:

可以看到,Feature-based的方式下,MASK造成的不匹配的影响更大,因为模型在训练的时候,特征提取层没有机会调整特征表示(因为被冻结了)。

在feature-based方法中,作者将BERT的最后4层输出拼接起来作为特征,因为这样的效果最好,具体见5.3节。

另外,我们还可以看到,fine-tuning方式在不同的mask策略下都具有惊人的鲁棒性。然而,如作者所料,完全使用MASK的策略在feature-based方式下应用到NER领域是有问题的。有趣的是,全部使用随机的策略也比第一行的策略差的多。

注解

  1. 深度双向:深度双向和浅度双向的区别在于,后者仅仅是将分开训练好的left-to-right和right-to-left的表征简单的串联,而前者是一起训练得到的。
  2. feature-based: 又称feature-extraction 特征提取。就是用预训练好的网络在新样本上提取出相关的特征,然后将这些特征输入一个新的分类器,从头开始训练的过程。也就是说在训练的过程中,网络的特征提取层是被冻结的,只有后面的密集链接分类器部分是可以参与训练的。
  3. fine-tuning: 微调。和feature-based的区别是,训练好新的分类器后,还要解冻特征提取层的顶部的几层,然后和分类器再次进行联合训练。之所以称为微调,就是因为在预训练好的参数上进行训练更新的参数,比预训练好的参数的变化相对小,这个相对是指相对于不采用预训练模型参数来初始化下游任务的模型参数的情况。也有一种情况,如果你有大量的数据样本可以训练,那么就可以解冻所有的特征提取层,全部的参数都参与训练,但由于是基于预训练的模型参数,所以仍然比随机初始化的方式训练全部的参数要快的多。对于作者团队使用BERT模型在下游任务的微调时,就采用了解冻所有层,微调所有参数的方法。
  4. warmup:学习率热身。规定前多少个热身步骤内,对学习率采取逐步递增的过程。热身步骤之后,会对学习率采用衰减策略。这样训练初期可以避免震荡,后期可以让loss降得更小。

ok,本篇就这么多内容啦~,感谢阅读O(∩_∩)O。

Guess you like

Origin www.cnblogs.com/anai/p/11645953.html