Bert's new rules for pre-training!

Paper Introduction: Should 15% Occlusion Probability Still Be Used in Occlusion Language Models?
Paper title: Should You Mask 15% in Masked Language Modeling?
Paper link: https://arxiv.org/pdf/2202.08005.pdf
Paper author: {Alexander Wettig∗ Tianyu Gao∗ Zexuan Zhong Danqi Chen}

Introduction to the paper

In previous occlusion pre-training models, occlusion language models typically used an occlusion rate of 15%. The author team believes that more occlusion will provide enough context to learn a good representation, while less occlusion will make training too expensive. Surprisingly, we found that an input sequence with a 40% occlusion probability can outperform the 15% baseline, as measured by fine-tuning on downstream tasks and found that even occlusion of 80% of the characters preserves most of the performance.

  • Adding masking has two obvious effects, which the authors found through careful ablation experiments:
  • No need to use 80% [MASK], 10% keep original token and 10% random replacement token.
  • With the improvement of the masking rate, the effect of random uniform masking (Uniform) will also perform better than Span Masking and correlation interval principle masking (PMI-Masking).

Overall, the findings contribute to a better understanding of masked language models and point to new avenues for effective pre-training. Next, we look at the detailed experimental results.

The pre-training "15% occlusion rate" convention can be broken

"15% occlusion rate" means that in a pre-training task, 15% of the words are randomly occluded, and the AI ​​is trained to learn to predict the occluded words.

In this work, the authors found that under an effective pre-training scheme that masks 40-50% of the input text, the model achieves better downstream performance than the default 15%. Table 1 shows examples of occlusion 15%, 40% and 80%, and their performance on downstream tasks. We can find that with 80% masking, even when most of the context is broken, the model still learns good pretrained representations and retains over 95% of downstream task performance compared to 15% masking. This phenomenon breaks the previous convention of choosing 15% for the masking rate, and raises the question of how the model can benefit from a high masking rate, which may be a hot spot for future research on masked language models.

Pre-training requires more than 15% occlusion rate

To understand how much characters can be masked in MLM and how the masking rate affects the performance of the pretrained model, this paper pretrains a series of models with different masking rates ranging from 15% to 80%. Figure 1 shows the variation of downstream task performance with respect to different masking rates.

We can find that up to 50% occlusion can achieve comparable or even better results than the default 15% occlusion model. Masking 40% achieves the best downstream task performance overall (although the optimal masking rate varies across downstream tasks). The results show that language model pre-training does not have to use mask rates less than 15%, while large models using efficient pre-training side rates achieve optimal mask rates as high as 40%.

To further compare the 15% and 40% occlusion rates, the GLUE test results for both are shown in Table 2:

And plot the downstream task performance as a function of different training steps in Figure 2:

Table 2 further validates that the performance of masking 40% is significantly better than 15% —SQuAD improves by nearly 2% . We also see that 40% masking has more than 15% consistency advantage over the entire training process in Figure 2

"Re"understand Mask Rate

In this section, the authors analyze how the masking rate affects the MLM pre-training process from two different perspectives: task difficulty and optimization effect. Under the Mask mechanism, the authors further discuss the relationship between occlusion rate, model size, and different damage strategies, and their impact on the performance of downstream tasks.

The relationship between shadowing rate and damage rate and prediction rate

具体来说,就是将掩蔽率拆分为破坏率 (corruption rate)和预测率 (prediction rate)2个指标。其中,破坏率是句子被破坏的比例,预测率是模型预测的比例。论文进一步针对破坏率(mcorr)和预测率(mpred)进行了研究,发现了一个新规律:
预测率高,模型效果更好;但破坏率更高,模型效果更差:

表3显示了使用解破坏率 mcorr和预测率 mpred的消融结果。我们可以看到,(1)将mcorr固定为40%,将mpred从40%降低到20%,导致下游任务持续下降,表明更多的预测导致更好的性能;(2)将mpred固定为40%,降低mcorr导致持续更好的性能,这表明较低的破坏率使预训练任务更容易学习。(3)高预测率带来的收益可以超过破坏率带来的缺陷,性能更好。

高遮蔽率更适合大模型

从上图我们可以看到,在有效的预训练设置下,大型模型可以平均取40%作为最优遮蔽率;基础模型和中等模型大约取20%作为最优遮蔽率。这清楚地表明具有更大参数量的模型从更高的遮蔽率中获益更多。

揭开“80-10-10”规则的神秘面纱

2019年以来,大多数认为用将原始token10%替换(保持单词不变),用随机token替换10%是有益的。从那时起,在过往预训练模型研究中,80-10-10规则在几乎所有的MLM预训练工作中被广泛采用。其动机是遮蔽标记在训练前和下游微调之间造成不匹配,使用原始或随机的标记作为[MASK]的替代方法可以缓解这种差距。基于这一推理,理应认为屏蔽更多的上下文应该会进一步增加差异,但作者在下游任务中观察到更强的性能。这就引出了是否完全需要80-10-10法则的疑虑。首先,作者重新讨论了80-10-10规则,并将其与破坏率和预测率两种指标联系起来,作者思考如下:

相同字符预测:预测相同的字符是一项非常简单的任务——模型可以简单地将输入复制到输出中。来自相同的字符预测的损失非常小,这个目标应该被视为一个辅助的正则化,它确保了文本信息从嵌入传播到最后一层。因此,同样的token预测既不应该计入破坏率,也不应该计入预测率——它们不会破坏输入,而且对学习的贡献很小。

随机字符破坏:用随机token替换会提升破坏率和预测率,因为输入已经被损坏,预测任务并不重要。事实上,作者发现与[MASK]相比,随机token的损失略高,原因有两点:(1)模型需要决定所有token的信息输入是否来自随机字符和(2)预测需要对输入嵌入中的巨大变化需要保持一致。
为了验证结论,作者采用m=40%模型仅使用[MASK]替换作为基线,在此基础上我们添加了三个模型:

1.“+5%相同”:遮蔽40%的字符,预测45%的字符。

2.“w/5%随机”:遮蔽35%的字符,并随机替换了另外5%的字符,预测率为40%。

3.“80-10-10”:在BERT配置中,在所有的遮蔽文本中,80%被[MASK]取代,10%被原始token取代,10%被随机token取代。

结果如表4所示。我们观察到,相同的字符预测和随机字符损坏会降低大多数下游任务的性能。“80-10-10”规则比简单地使用[MASK]的所有任务效果更差。这表明,在微调范式中,[MASK]模型可以快速适应完整的、未损坏的句子,而不需要随机替换。鉴于实验结果,作者建议只使用[MASK]来做预训练。

在高遮蔽率下,Uniform Masking 效果更好

为了理解掩蔽率和掩蔽策略之间的相互作用,我们在不同掩蔽率下使用多种掩蔽策略进行实验,发现随机均匀掩码(Uniform)在最佳遮蔽率下比更复杂的遮蔽策略表现更好。

图5显示了在掩蔽率从15%到40%下,均匀遮蔽、t5 的span maskin和PMI遮蔽的结果。我们发现,(1)对于所有的遮蔽策略,最优遮蔽率都高于15%;(2)跨度遮蔽和PMI遮蔽的最优遮蔽率低于均匀遮蔽;(3)当所有策略都采用最优遮蔽率时,Uniform 遮蔽可以获得比高级策略相当甚至更好的结果。

为了理解更高的遮蔽率和高级遮蔽策略之间的关系,如下图显示,更均匀的遮蔽基本上增加了遮蔽高相关字符的几率,从而减少了琐碎的字符令牌,并迫使模型更稳健地学习。我们注意到,即使是Uniform 掩蔽,更高的遮蔽率也会增加“意外”覆盖整个PMI字符跨度的机会。通过对语料库上的掩码采样,我们计算图6中的这个概率,发现当遮蔽率从15%提高到40%时,概率增加了8倍。同样,更高的遮蔽率使遮蔽字符形成更长的跨度,说明增加的遮蔽率可以产生类似于高级遮蔽蔽策略效果,但是会产生学习更好的表征。

论文结论

在本文中,作者对掩蔽语言模型的掩蔽率进行了全面的研究,发现40%的遮蔽率在下游任务上的性能始终优于传统的15%遮蔽率。通过揭破坏率和预测率的关系,可以更好地理解遮蔽率,并表明更大的模型可以从更高的遮蔽率中获益更多。另外还证明了80-10-10规则在很大程度上是不需要的,简单的均匀遮蔽在更高的掩蔽率下与复杂遮蔽方案的效果是相当的。

参考资料

Guess you like

Origin blog.csdn.net/yanqianglifei/article/details/123016160