BERT (two)-BERT defects

What are the limitations of BERT?

From the XLNet paper, two shortcomings of BERT are mentioned, as follows:

  • In the first pre-training stage of BERT, it is assumed that multiple words in the sentence are masked. These masked words have no relationship and are conditionally independent. However, sometimes these words are related, such as " New York is a city", assuming that we Mask live in the two words "New" and "York", then given the condition of "is a city", "New" and "York" are not independent, because "New York" is For an entity, the probability that "York" appears after "New" is much greater than the probability that "York" appears after "Old".
    • However, it should be noted that this problem is not a big problem, and it can even be said that it does not have much impact on the final result, because the corpus of BERT pre-training itself is massive (every dozens of G), so if the training data is enough Big, in fact, not relying on the current example, relying on other examples, can also make up for the direct interrelationship of the masked words, because there are always other examples that can learn the interdependence of these words.
  • BERT will have a special [MASK] during pre-training, but it will not appear in downstream fine-tune, which causes a mismatch between pre-training and fine-tuning, and fine-tuning does not appear [MASK]. The model seems to have no focus, and I don't know where to start. So only replace 80% of them with [mask], but this is only a relief and cannot be solved .
  • Compared with traditional language models, only 15% of the labels in each batch of training data of Bert are predicted, which causes the model to require more training steps to converge.

  • Another shortcoming is that BERT does a [MASK] after word segmentation. In order to solve the OOV problem, we usually split a word into more fine-grained WordPiece. BERT randomly masks these WordPiece during Pretraining, which may only part of the word Mask

E.g:

Insert picture description here

The word probability is divided into 3 WordPiece, "pro", "#babi" and "#lity". A random mask that may appear is to mask "#babi", but "pro" and "#lity" are not masked. This kind of prediction task becomes easier, because basically there can only be "#babi" between "pro" and "#lity". In this way, it only needs to remember some words (WordPiece sequence) to complete this task, instead of predicting it based on the semantic relationship of the context. The similar Chinese word "model" may also be masked (the practical example of "pipa" may be better because these two characters can only appear together but not alone), which will also make prediction easier.

In order to solve this problem, the natural idea is that the word as a whole is either masked or not masked. This is the so-called Whole Word Masking. This is a very simple idea, and there are very few modifications to the BERT code, just modify some of the mask code.

 

 

Reference link: 

Sorting records & thinking about some problems of BERT model  

BERT---details that are easily overlooked

Guess you like

Origin blog.csdn.net/katrina1rani/article/details/111699033