[NLP classic paper intensive reading] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

foreword

BERT plays a pivotal role in promoting the development of NLP. Its simple design, general deployment and implementation, and the successful paradigm of pre-training + fine-tuning have inspired many subsequent designs. This article is the second of the NLP classic paper, hoping to help readers further understand BERT.


Paper: https://arxiv.org/pdf/1810.04805.pdf&usg=ALkJrhhzxlCL6yTht2BRmH9atgvKFxHsxQ
Code: https://github.com/google-research/bert

Abstract

This paper proposes the BERT model, which is based on the Transformer implementation of bilateral encoding representation. It aims to pre-train bi-directional representations of unlabeled text by conditioning left and right contextual information in all layers, requiring only one additional output layer to be fine-tuned to specific task scenarios. The concept of BERT is simple, and the effect is remarkable. It has achieved SOTA on eleven NLP tasks.

1. Introduction

Pre-training language models have been proven to improve many NLP tasks, including sentence-level and word-level tasks. There are two existing pre-training strategies:

  • based on features. Like ELMo, using a task-specific architecture (RNN).
  • Based on fine-tuning, such as GPT, train downstream tasks by simply fine-tuning pre-trained parameters.

Both use a unidirectional language model to learn a general language representation. However, one-way limits the choice of architecture of the pre-training model, resulting in the inability to learn optimal information for sentence-level and word-level tasks.
The BERT in this paper is based on Transformer's bidirectional encoder. Inspired by the cloze task, BERT alleviates the one-way problem by using MLM pre-training targets. MLM randomly masks the tokens in the input and predicts the original tokens according to the context. Different from GPT, MLM combines left and right context information and can pre-train deep bidirectional Transformer. In addition, the author also constructed the NSP task, which is to predict the relationship between the current sentence and the next sentence. The contributions of this paper are as follows:

  • This paper shows the importance of bidirectional pre-training for language representation.
  • Pretrained representations reduce the complexity of task-specific architectures.
  • BERT achieves SOTA in eleven task scenarios.

2. Related Work

2.1 Unsupervised Feature-based Approaches

Pre-trained word-level representations provide a significant improvement over training embeddings from scratch. To pre-train word-level representations, the former work uses left-to-right language modeling objectives and an objective to distinguish correct and incorrect words in left and right contexts, among others.
ELMo studies traditional word embedding methods from different directions, extracts context features from left to right and then from right to left, and promotes the development of multiple NLP tasks. This work shows that the cloze task can improve the performance of text generation models. robustness.

2.2 Unsupervised Fine-tuning Approaches

Recently, there is a novel approach to pre-train sentences from unlabeled text and then fine-tune them for downstream tasks. The advantage of this approach is that there is no need to retrain the parameters. The representative work is GPT, which has achieved SOTA on multiple sentence-level tasks.

2.3 Transfer Learning from Supervised Data

Many works transfer efficiently from supervised tasks with large-scale data, such as natural language inference, machine translation.

3. BERT

The BERT architecture has two steps: pre-training and fine-tuning. Pre-training is trained on unlabeled data, while fine-tuning takes pre-trained parameters and fine-tunes using supervised data from downstream tasks.
image.png
The distinguishing feature of BERT is the unified architecture across tasks. The architecture of BERT is essentially a multi-layer bidirectional Transformer encoder. In this paper, L represents the number of Transformer blocks, the length of the hidden vector is H, and the number of headers is A. This paper provides two scales of the BERT model. BERT base \mathbf{BERT_{base}}BERTbaseBERT large \mathbf{BERT_{large}}BERTlarge, the former L=12, H=768, A=12, parameter quantity 110M, the latter L=24, H=1024, A=16, parameter quantity 340M.

Here it is necessary to explain how H and A change with L and the calculation process of the parameters. As the number of layers doubles, it is hoped that the number of parameters of the model will also double, but the number of parameters increases with the square of H, so in order to double the number of parameters, it is more appropriate to set H to 1024. A also changes with H, because in order to keep the dimensions of long positions consistent, that is, H/A must be equal (64 here), so A is set to 16 in large. The calculation of the parameters is shown in the figure below, where 30k is the size of the vocabulary, and the detailed explanation can be found in the Mushen video.

image.png
In order for BERT to handle a series of downstream tasks, the input representation enables a sequence of tokens to unambiguously represent a single sentence or a pair of sentences, where the sequence of tokens is a continuous range, not an actual sentence.
The author uses WordPiece embedding to process a vocabulary of 30K size. The first token of each sequence is [CLS]. The last hidden vector state of the token is used for the aggregation sequence representation of the classification task. There are two ways to distinguish sentence pairs. First, the two are separated by a special token [SEP]. Second, the author adds a learnable embedding for each token to learn whether the token belongs to the first sentence or the second sentence.
The embedding of each token is the sum of its token embedding, segment embedding and position embedding. The visualization process is as follows:
image.png

3.1 Pre-training BERT

Pre-training employs two unsupervised learning tasks.

Masked LM

Directly using the two-way model will learn more information than the one-way model. In order to train the two-way representation, the author randomly masks a certain proportion of tokens, and then predicts these tokens through the output of the last layer of tokens through softmax. This task is called an MLM, or cloze task. In the experiment, the proportion of mask is 15%.
But just a simple mask will cause a mismatch between training and fine-tuning (because [mask] does not appear in fine-tuning), so for the selected 15% tokens, 80% are used for mask, 10% are replaced by random tokens, and 10% remain unchanged .

Next Sentence Prediction (NSP)

Many downstream tasks such as question answering and natural language understanding are based on understanding the relationship between two sentences, but language models cannot directly capture this relationship. To learn sentence-level features, the NSP task is constructed. When the author constructs sentence pairs, 50% are adjacent sentences and 50% are not adjacent.

Pre-training data

The pre-training corpus uses 800M BooksCorpus and 2500M English Wikipedia.

3.2 Fine-tuning BERT

For each task, it is only necessary to input the input and output of the task into BERT, and then perform end-to-end fine-tuning. For the word-level task, the token representation is fed into the MLP to obtain the result, and for the sentence-level task, the [CLS] representation is fed into the MLP output result.
The fine-tuning process is less time-consuming than the pre-training process.

4. Experiments

4.1 GLUE

The GLUE benchmark is a collection of various natural language understanding tasks. In order to fine-tune on GLUE, the author inputs the data into BERT and uses the last hidden vector of [CLS] token C ∈ RHC \ in \mathbb{R}^HCRH is represented as aggregate. The only parameter introduced by the fine-tuning process is the weightW ∈ RK × HW \in \mathbb{R}^{K×H}WRK × H , whereKKK is the number of tags. The author usesCCC andWWW calculates the standard classification loss, ielog ( softmax ( CWT ) ) \mathrm{log(softmax}(CW^T))log(softmax(CWT )).
image.png
The training effect on some small-scale data sets is unstable, so the author chooses the best performance on these data sets as the result. In fact, the reason for the poor performance is that the author chose 3 epochs in the experiment, that is, only 3 full scans of the data set, which is obviously not enough. Only by increasing the number of epochs can stable results be obtained.
The experimental results are shown in the table above,BERT base \mathbf{BERT_{base}}BERTbaseBERT large \mathbf{BERT_{large}}BERTlargeBoth improve SOTA and outperform GPT with a similar model architecture. As the model size increases, the model performance is better.

4.2 SQuAD v1.1

Stanford's question-answer dataset collects 100,000 crowdsourced question-answer pairs. The task is to predict the range of the answer text in the article, essentially marking the starting position S and ending position E of the text. The results are shown in the table below:
image.png
the best results are pre-trained on the TriviaQA dataset and then fine-tuned on SQuAD.

4.3 SQuAD v2.0

The data set is expanded on the basis of v1.1, so that there is no short sequence of answers, so that the questions are more realistic.
image.png
The above table is the training result, and it is observed that BERT is 5.1 higher than SOTA.

4.4 SWAG

The SWAG dataset contains 113K sentence pairs for evaluating commonsense reasoning. Given a sentence, the task is to choose the most reasonable next sentence among the four options. When fine-tuning, the author constructs four output sequences for each sample, each sequence contains a given sentence A and a possible next sentence B. The last layer output of [CLS] is the score. The results are shown in the table below:
image.png

The effect is 27.1% higher than baseline and 8.3% higher than GPT.

5. Ablation Studies

5.1 Effect of Pre-training Tasks

This section evaluates the importance of two pre-training tasks by using the same pre-training data, fine-tuning scheme and parameters.
No NSP: Only use the MLM pre-training task without NSP.
LTR&No NSP: Change the bidirectional model to a left-to-right unidirectional model, and use a standard left-to-right LM for training. Equivalent to GPT, but pre-trained on different data.
image.png
The above table shows the performance of different models. It can be seen that removing NSP will damage the performance of the model, and further not using the bidirectional model will make the performance of the model worse. The author tried to increase the performance of LTR&NO NSP, and added BiLSTM on this basis, which did improve the model performance, but there is a gap compared with BERT.

5.2 Effect of Model Size

image.png
The author tried models of different scales, and the performance results are shown in the table above. It can be seen that the larger model brings serious performance improvements on all four datasets. The authors argue that as long as the model is sufficiently pre-trained, the fine-tuning phase can lead to huge improvements even on small tasks by increasing the size of the model.

5.3 Feature-based Approach with BERT

As a pre-training-based method, BERT only needs a simple classification layer to achieve good results, but the method based on feature extraction also has certain advantages. First, not all tasks can be achieved by Transformer encoders (not for generating tasks), and the cost of pre-training is high.
The two methods are compared here.
image.png
Therefore, BERT is not fine-tuned here, but is input as a static feature, and the effect is not as good as fine-tuning.

6. Conclusion

Recent results of language model-based transfer learning suggest that rich, unsupervised pre-training is an integral part of language understanding systems, and the main contribution of this paper is to further generalize this work to a deeper bidirectional The trained model can generalize to more NLP tasks.

read summary

Compared with GPT and Transformer, BERT may be more well-known in the first two years, although it only intercepted the encoder part of Transformer, and then made some improvements: two pre-training tasks were added, and segment embedding was added to the input. No extra work. Although the implementation is very simple, this provides a good paradigm for the trend in the NLP field, that is, pre-training + fine-tuning, which has been applied in the CV field before, and BERT successfully applied it to the NLP field. In addition, the article also puts forward the conclusion that the larger the model, the better the performance, which also lays a solid foundation for today's big language model competition.
The overall idea of ​​the article is clear, the logic is meticulous, and the design of each part is explained very well. This kind of writing is still very worth learning. Finally, I strongly recommend Mushen's explanation video , which will give you a deeper understanding of the article. .

Guess you like

Origin blog.csdn.net/HERODING23/article/details/131865915