Those things about Mask in Transformer

Mask is inspired by cloze. The Transformer structure includes an encoder and a decoder. During the encoding process, the purpose is to let the model see the information before and after the current position, so no attention mask is required. However, in the decoding process, in order to simulate the real inference scene, the current position cannot see the next position, and the information of the previous position is needed at the same time, so an attention mask is added during training. It can effectively improve the generalization.

For the original BERT, during training, the smallest input unit token in the entire sentence is randomly selected for masking. Because Byte Pair Encoding (BPE) technology is used, these smallest units can also be regarded as subwords (subwords), for example, superman is divided into two subwords, super+man.

The BERT model uses two pre-training tasks during training: Masked LM and Next Sentence Prediction. During the BERT Mask process, 15% of the words will be blocked, and then the model will be used to predict the blocked words. BERT will process each word separately, that is, the phrase information is not considered when Mask is used. For example, in the sentence "The author of Jingyesi is Li Bai", BERT may get Mask "The author of Jingyesi is [Mask] Bai".

This method may be a bit problematic, [MASK] token never appears in the fine-tuning stage, which will cause inconsistency between the pre-training task and the downstream fine-tuning task.

Later Bert further proposed a technique called whole word mask (wwm) for optimizing the original mask in the MLM task. In this setting, the WordPiece tokens to be masked are not randomly selected (Wu et al., 2016), but all tokens corresponding to full words are always masked at the same time. This would explicitly force the model to recover full words in the MLM pre-training task, rather than just recovering WordPiece tokens (Cui et al., 2019a).

The paper "ERNIE: Enhanced Representation through Knowledge Integration". The author of ERINE believes that the way BERT occludes individual words usually ignores prior knowledge in sentences, for example, for the sentence "The author of Harry Potter is JK Rowling", if the model occludes a random one of "Harry Potter" words, the model can easily predict "Harry Potter" without using the knowledge of the sentence. However, if the entire entity of "Harry Potter" is blocked, BERT cannot predict correctly, indicating that BERT cannot make good use of the knowledge of the entire sentence.

ERINE proposed a new strategy called Knowledge Mask. It mainly includes Phrase Mask (phrase) and Entity Mask (entity), which can be composed of multiple words. By occluding some phrases in the sentence and predicting the entire phrase, ERNIE can better capture the relationship between phrases and entities. The figure below shows the difference between BERT and ERNIE Mask strategies.

RoBERTa takes the original BERT architecture, but makes more precise modifications to show the characteristics of BERT, which is underestimated. They carefully compared the various components of BERT, including mask strategy, training steps, etc. After a thorough evaluation, they draw several useful conclusions that make BERT more powerful, including dynamic masks.

In RoBERTa, n copies of the original data are copied, and a random static mask is performed on each copy, so that the mask results of each copy of data are different. The data allcator in huggingface uses a dynamic mask, but instead of copying data, the mask strategy of each epoch is different, so that the effect of a dynamic mask can be achieved, so that the mask of each epoch is different.

MacBert proposed an interesting Mask:

We use whole-word masked and Ngram masked strategies to select candidate tokens to mask, and the ratio of word-level unigram to 4-gram is 40%, 30%, 20%, 10%.
It is proposed not to use [MASK] token for masking, because [MASK] never appeared in the token fine-tuning stage, we propose to use similar words for masking. Similar words are obtained by using the Synonym Toolkit (Wang and Hu, 2017) based on word2vec (Mikolov et al., 2013) similarity calculation. If an N-gram is selected for masking, we will find similar words respectively. In rare cases, when there are no similar words, we fall back to using random word replacement.
We mask 15% of the input words, 80% of which will be replaced with similar words, 10% will be replaced with random words, and the remaining 10% will keep the original words.

In the ablation experiment, it actually proved the effectiveness. The overall average score is obtained by averaging the test scores for each task (the EM and F1 metrics are averaged before the overall average). Overall, removing any component in MacBERT results in a decrease in average performance, suggesting that all modifications contribute to the overall improvement. Specifically, the most effective modification is N-gram mask and similar word replacement, which is a modification of the language model task of mask. We can see clear pros and cons when we compare N-gram masks with similar word replacements, where N-gram masks seem to be more effective in text classification tasks, while performance on reading comprehension tasks seems to benefit from similar word More. By combining these two tasks, it is possible to compensate each other and achieve better performance on both types.

A new point of view was put forward in "Should You Mask 15% in Masked Language Modeling?". In the previous masked pre-training model, the masked language model usually uses a masking rate of 15%. The author team believes that more masking will provide enough context to learn good representations, while less masking will make training too expensive. Surprisingly, we find that an input sequence with 40% occlusion probability can outperform a 15% baseline, as measured by fine-tuning on downstream tasks, and even occlusion of 80% of characters retains most of the performance.

We can find that up to 50% occlusion achieves comparable or even better results than the default 15% occlusion model. Obscuration of 40% achieves the best downstream task performance overall (although the optimal ratio of obscuration varies for different downstream tasks). The results show that language model pre-training does not have to use less than 15% masking rate, while the optimal masking rate for large models using efficient pre-training side rate is as high as 40%.

In this paper, there is actually a refutation of the 80-10-10 method used by MacBert.

Since 2019, most people think that it is beneficial to replace 10% of the original token (keep the word unchanged) and replace 10% with a random token. Since then, the 80-10-10 rule has been widely adopted in almost all MLM pre-training work in previous pre-training model research. The motivation is that masked labels create a mismatch between pre-training and downstream fine-tuning, and using raw or random labels as an alternative to [MASK] can alleviate this gap. Based on this reasoning, it would be reasonable to assume that masking out more context should further increase the variance, but the authors observe stronger performance in downstream tasks. This raises the question of whether the 80-10-10 rule is needed at all.

Based on the experimental results, we observe that the same character prediction and random character corruption degrades the performance of most downstream tasks. The "80-10-10" rule is less effective than simply using [MASK] for all tasks. This shows that in the fine-tuning paradigm, the [MASK] model can quickly adapt to complete, undamaged sentences without random replacement. In view of the experimental results, the author recommends only using [MASK] for pre-training.

Those things about Mask in Transformer

Guess you like