In-depth understanding of deep learning - BERT (Bidirectional Encoder Representations from Transformers): MLM (Masked Language Model)

Category: General Catalog of "In-depth Understanding of Deep Learning"
Related Articles:
BERT (Bidirectional Encoder Representations from Transformers): Basic Knowledge
BERT (Bidirectional Encoder Representations from Transformers): BERT Structure
BERT (Bidirectional Encoder Representations from Transformers): MLM (Masked Language Model)
BERT (Bidirectional Encoder Representations from Transformers): NSP (Next Sentence Prediction) task
BERT (Bidirectional Encoder Representations from Transformers): input representation
BERT (Bidirectional Encoder Representations from Transformers): fine-tuning training-[sentence pairs Classification]
BERT (Bidirectional Encoder Representations from Transformers): fine-tuning training-[single sentence classification]
·
BERT (Bidirectional Encoder Representations from Transformers): Fine-tuning training- [ Text Q&A] BERT (Bidirectional Encoder Representations from Transformers): Fine-tuning training-[Single sentence annotation]
BERT (Bidirectional Encoder Representations from Transformers): Model summary and precautions


The author of BERT believes that the two-way encoder spliced ​​​​using a left-to-right encoding and a right-to-left encoding unidirectional encoder is not as powerful as directly using a deep two-way encoder in terms of performance, parameter scale, and efficiency. This is why BERT uses the Transformer Encoder as the feature extractor instead of using the two Transformer Decoders of left-to-right encoding and right-to-left encoding as the feature extractor. Since the training mode of the standard language model cannot be used, BERT uses the MLM method to train the model by drawing on the ideas of the cloze task and CBOW. Specifically, it is to randomly select some words (replaced with replacement characters [MASK]) for masking operation, and let BERT predict these masked words with P ( wi ∣ w 1 , w 2 , ⋯ , wn ) P(w_i|w_1, w_2, \cdots, w_n)P(wiw1,w2,,wn) to optimize the model parameters for the objective function (only calculate the sum of the cross-entropy of the masked words and use it as the loss function). By predicting mask words based on contextual information, BERT has the ability to extract more accurate semantic information based on different contexts.

During training, the masked word is replaced with [MASK]a probability of 15%. There may be multiple mask words in a sentence, assuming word AAA and the wordBBB is a mask word, then predict the mask wordBBB , the context of reference, the wordAAA 's information is missing (AAA has been replaced by[MASK], so the original semantic information is lost). Designing the MLM training method in this way will introduce disadvantages: in the model fine-tuning training stage or model inference stage, the input text does not contain[MASK], that is, the input text distribution is biased, and then there is a performance loss caused by the deviation between training and prediction data. Considering this disadvantage, BERT does not always use[MASK]replacement mask words, but selects replacement words according to a certain ratio. After selecting 15% of words as mask words, these mask words have three types of replacement options. Suppose the training text is "Earth is one of the eight planets in the solar system", and now you need to set "solar system" as a mask word, the replacement rules are as follows:

  • In 80% of the training samples, it needs to be used [MASK]as a replacement word: "Earth is [MASK]one of the eight planets"
  • In 10% of the training samples, there is no need to do any processing on the replaced words, for example: "The earth is one of the eight planets in the solar system"
  • In 10% of the training samples, a word needs to be randomly selected from the model vocabulary as a replacement word, for example: "The earth is one of the eight planets of Apple"

Keeping a small part of the replacement words as they are is to alleviate the performance loss caused by the deviation between the training text and the predicted text, and letting another small part of the replacement words be replaced with random words is to let BERT learn to automatically correct errors based on context information. Assuming that there is no random replacement option, when BERT encounters a non- [MASK]word, it will directly select the same word as the input word, and the optimal cross-entropy will be obtained. By adopting the strategy of random replacement of masked words, BERT is forced to comprehensively predict the predicted words based on contextual information. From a mathematical point of view, it avoids the hidden danger of BERT obtaining the optimal objective function through "lazy" methods. In short, using the MLM training method of selecting replacement words based on probability increases the robustness of BERT and the ability to extract contextual information. This probability distribution ratio is not randomly designed, but the optimal result obtained by BERT after trying various configuration ratios in the pre-training process and comparing them through tests. In the test of the replacement word configuration ratio, two downstream tasks were selected as the test standard, and the test results are shown in the figure below. The results show that when the replacement ratio is 8:1:1, the pre-trained language model with the best performance can be obtained. In addition, using the MLM training method, only 15% of the words of the input text are trained each time, while GPT can perform cross-entropy on each word of the input text. Although the training efficiency of BERT is much lower than that of GPT, judging from the results in the figure below, the MLM training method can allow BERT to obtain semantic understanding capabilities that exceed all pre-trained language models in the same period, and it is worth sacrificing training efficiency.
Mask word replacement strategy and test results

References:
[1] Lecun Y, Bengio Y, Hinton G. Deep learning[J]. Nature, 2015 [
2] Aston Zhang, Zack C. Lipton, Mu Li, Alex J. Smola. Dive Into Deep Learning[J] . arXiv preprint arXiv:2106.11342, 2021.
[3] Che Wanxiang, Cui Yiming, Guo Jiang. Natural Language Processing: A Method Based on Pre-Training Model [M]. Electronic Industry Press, 2021. [4]
Shao Hao, Liu Yifeng. Pre-training language model [M]. Electronic Industry Press, 2021.
[5] He Han. Introduction to Natural Language Processing [M]. People's Posts and Telecommunications Press, 2019
[6] Sudharsan Ravichandiran. BERT Basic Tutorial: Transformer Large Model Practice[ M]. People's Posts and Telecommunications Press, 2023
[7] Wu Maogui, Wang Hongxing. Simple Embedding: Principle Analysis and Application Practice [M]. Machinery Industry Press, 2021.

Guess you like

Origin blog.csdn.net/hy592070616/article/details/131349867