The improvement for Bert is mainly reflected in increasing training corpus, adding pre-training tasks, improving mask methods, adjusting model structure, adjusting hyperparameters, model distillation, etc.

Introduction
We talked about BERT in the previous section. If we want to use a popular word to describe the emergence of BERT, this sentence is probably more appropriate: everything that passes is a prologue. After the emergence of Bert, researchers began to explore and study it continuously, proposed various improved versions, and continued to surpass Bert in various tasks. The improvement for Bert is mainly reflected in increasing training corpus, adding pre-training tasks, improving mask methods, adjusting model structure, adjusting hyperparameters, model distillation, etc. The following describes the key points of the improved version of Bert in recent years.


Simple language model ( 1
) - language model and its interesting applications (3) - Contextualized word vector representation ( CoVe, ELMo , ULMFit, GPT, BERT) in-depth language model (4) - BERT's back waves ( RoBERTa, MASS, XLNet, UniLM , ALBERT, TinyBERT, Electra)

RoBERTa
Facebook proposed RoBERTa (a Robustly Optimized BERT Pretraining Approach). Reached SOTA on multiple tasks again. So what exactly does it improve? It does not change Google's Bert at the model level, but only the pre-training method.

The training time is longer, the batch size is larger, and the training data is more. Drawing lessons from XLNet, which uses 10 times more data than Bert, RoBERTa also uses more data. The performance has indeed improved again. Of course, longer training sessions are also required.
Removed the next predict loss, and found that the NSP task is not very helpful for training the language model, and
the training sequence is longer. RoBERTa removes NSP, but instead inputs multiple consecutive sentences each time, until the maximum length is 512 (can span articles).
Dynamically adjust the Masking mechanism. In BERT, when preparing training data, each sample will only perform a random mask once (so each epoch is repeated), and each subsequent training step uses the same mask. This is the original static mask, which is the original BERT. practice. The author first proposed a modified version of the static mask, which copied the data set 10 times during preprocessing, and used a different mask for each copy; then proposed a dynamic mask, which did not execute the mask during preprocessing, but in each The mask is dynamically generated when the input is provided to the model for the first time, so it changes all the time.
BPE is used in Text Encoding. The Wordpiece of the original BERT is implemented based on char-level, which generates new subwords based on probability. In this paper, he uses the same method as GPT-2 to merge based on the next highest frequency byte pair, using bytes instead of unicode characters as the basic unit of sub-word. Some explanations about BPE recommend some materials: NLP subword algorithm,
BPE in this paper.
MASS
Mass, a researcher at Microsoft Research Asia, proposed a new general-purpose pre-training method at ICML 2019, surpassing BERT and GPT in sequence-to-sequence natural language generation tasks. Masked sequence to sequence pre-training (MASS: Masked Sequence to Sequence Pre-training). MASS randomly masks a continuous segment of length k from a sentence, and then predicts and generates the segment through an encoder-attention-decoder model. As shown in the figure below, the 3rd to 6th words at the encoder end are masked out, and then the decoder end only predicts these consecutive words, while other words are masked out. "_" in the figure represents the masked word. Similar to the Transformer version of Cove.


MASS pre-training has the following advantages:

Other words at the decoder side (words not masked at the encoder side) are masked out to encourage the decoder to extract information from the encoder side to help predict successive segments, which facilitates encoder-attention-decoder Structured joint training.
In order to provide more useful information to the decoder, the encoder is forced to extract the semantics of the unmasked words to improve the encoder's ability to understand the source sequence text.
Let the decoder predict continuous sequence segments to improve the language modeling ability of the decoder.
In fact, the essential idea of ​​MASS is equivalent to the fusion of BERT and GPT, which can be controlled by the length k kk of the MASK segment. When k = 1 k=1k=1, according to the setting of MASS, the encoder end masks a word, and the decoder end predicts a word. At this time, the pre-training method of the masked language model in MASS and BERT is equivalent; when k = mk=mk=m (m mm is the sequence length), according to the setting of MASS, the encoder shields all words, and the decoder predicts all words, as shown in the figure below, since all words at the encoder end are masked, The attention mechanism of the decoder is equivalent to not acquiring information, and in this case MASS is equivalent to the standard language model in GPT.

XLNet


The previous blog said that there are some shortcomings in the AR and AE models . Here we briefly talk about the autoregressive nature brought by the AR method, which learns the dependence between the predicted tokens during prediction, which is not available in AE's BERT; and The learning of deep two-way information brought by BERT's AE method is not available in AR one-way language models like ELMo and GPT. Generally speaking, the AE model represented by BERT is not as good as AR for generating tasks. Therefore, people naturally think, how to unify the advantages of the autoregressive model and the self-compiled Mo model? At this time, it's time for XLNet to debut.

XLNet pointed out in the article that the first pre-training stage adopts the training mode of introducing the [Mask] mark to Mask some words, and the Fine-tuning stage cannot see this forcedly added Mask mark, so two There are inconsistent usage patterns in the stage, which may cause a certain performance loss; the other is that Bert assumes that multiple words in the sentence are Masked in the first pre-training stage, and there is no gap between these Masked words. Any relationship is conditionally independent, and sometimes there is a relationship between these words, and XLNet considers this relationship.

The core idea of ​​XLNet is the Permutation Language Model . The specific implementation method is to randomly select a sentence and arrange it, and then "mask " a certain amount of words at the end (it is somewhat different from the direct replacement "[MASK]" in BERT) Finally, use the AR method to predict the "covered" words according to this arrangement . The specific implementation of Permutation in the paper is to directly operate on the Transformer's Attention Mask .

Two-Stream Self-Attention


First of all, we will find that the location information is very important after the order is scrambled. At the same time, for each location, what needs to be predicted is the content information (words corresponding to the location), so the input cannot contain content information, otherwise the model cannot learn anything. Just copy directly from input to output. So here is the separation of location information and content information, so in addition to BERT's location information + content information input Self-Attention (Stream) , the authors also added another location information only Stream as query input in Self-Attention . In this paper, the former is called Content Stream, while the latter is called Query Stream. In this way, Query Stream can be used to predict the location that needs to be predicted without revealing the content information of the current location. The specific operation is to use two sets of hidden states (hidden states), g and h, where g only has position information, which is used as Q in Self-Attention, and h contains content information, which is used as K and V; for Content Stream, h is also used as QKV .


Context stream encodes content information just like ordinary self-attention, but it is based on lookahead's mask strategy, that is, you can only see the content information of yourself and the previous location.
The query stream encodes location information. You can see your own location information and previous content information, but you cannot see your own content information.
Transformer-XL
Transformer-XL and Vanilla Transformer are both used to solve the problem that Transformer cannot model dependencies beyond a fixed length, and has poor encoding effects on long texts, so as to better model long-distance dependencies. In this paper, the main ideas of Transformer-XL are introduced: relative position encoding and segmented RNN mechanism. Practice has proved that these two points are very helpful for long document tasks.

Segment Recurrence Mechanism
What Segment Recurrence Mechanism wants to do is whether it can save all the hidden states (hidden states) calculated by it after the calculation of the previous period, and put them into a Memory, and then in the current segment calculation, Put together the previously stored hidden state and the hidden state of the current segment as the K and V of the Attention mechanism, so as to obtain longer context information. The identity of mem in the figure below is also obvious, it is the memory used in the Segment Recurrence Mechanism, and it stores the hidden state of the previous segment.


Relative Positional Encoding
relative positional encoding, no longer cares about the absolute information of the words in the sentence, but relative, for example, how many words are separated between two words. The improvement scheme given by Transformer-XL is as follows:


Relative Segment Encodings


But when this encounters the Segment Recurrence Mechanism , like the position vector, there is also a problem. What if it is obviously not a sentence, but the same, so we need the last patch, and prepare two vectors, s + s+s+ and s − ss− represent in a sentence and not in a sentence, respectively. The specific implementation is to add an item when calculating attention:


sij s_{ij}s 
ij
​also
 comes from two trainable vectors s + s_+s 
+
​and
 s − s_-s 

​,
 when i ii and j jj come from a segment, sij = s 1 s_ {ij}=s1s 
ij
​=
 s1; otherwise sij = s 0 s_{ij}=s_0s 
ij
​=
 s 
0  ​.

UniLM


Referring to BERT, a new pre-training method is proposed. The next thing to talk about is another BERT-based generation model-UNILM, which is also produced by Microsoft. Like MASS and XLNet, it is for generating tasks. . The author has grasped the point of MASK very cleverly, and believes that no matter what LM is , the essence is what information can be obtained during training. At the implementation level, it is actually a question of what input the mask is . So it is completely possible to integrate Seq2Seq LM into BERT . In S1 [SEP] S2 [SEP], S1 is encoded with an encoder, and the token in S2 can only obtain the token information of S1 and itself , as shown in the bottom mask matrix.


Input representation: Three Embeddings are used here like BERT, but with reference to WordPiece, all tokens are processed into subwords, which enhances the performance of the generated model. In addition, the author emphasized that segment embedding can help distinguish different LMs.
Bidirectional LM: For bidirectional LM, only padding is masked.
Unidirectional LM: For the prediction of MASK, only t's previous characters and its own position can be used, and a diagonal matrix is ​​used. Similarly, the LM from right to left is similar.
Seq2Seq LM: Enter two sentences. The first sentence adopts the coding method of BiLM, and the second sentence adopts the method of unidirectional LM. Simultaneously train the encoder (BiLM) and decoder (Uni-LM). When processing input, some tokens are also randomly masked.
Training: In a batch, the allocation of optimization goals is to use BiLM and Next sentence for 1/3 of the time, Seq2Seq LM for 1/3 of the time, and 1/6 of the time for left-to-right and right-to-left LM respectively. .
Mask: For adding Mask, 80% of the time a random mask, 20% of the time will mask a bigram or trigram to increase the predictive ability of the model.
UniLM vs MASS
UNILM and MASS have the same goal of unifying BERT and generative models, but I personally think that UNILM is more elegant. First of all, the unified method of UNILM is more concise and improved from the perspective of the mask matrix, while MASS still changes the structure of BERT to Seq2Seq, and only uses the encoder when doing other tasks, unlike UNILM, which does everything with one structure.

ALBERT


This paper has a different starting point from the previous papers. This paper is based on model compression. Some pre-training models that appeared after BERT used BERT as the baseline, and claimed that the model effect surpassed BERT. But we must note that most of these models are larger than BERT. This research method, similar to an arms race, is of little significance in the industry, because no matter how good the effect of the model is, it cannot be deployed online and can only run offline tasks. So model compression is very important.
How does this model transform BERT?

Factoring Embedding
In BERT, the word embedding dimension ( E ) (E) (E) is the same as the dimension ( H ) (H) (H) output by the encoder, which is 768. However, ALBERT believes that the word-level embedding has no context-dependent information, and the output value of the hidden layer not only contains the meaning of the word itself but also includes some context-dependent information, so theoretically speaking, the expression of the hidden layer should contain more information. Some, so it is more reasonable to let H > > E H>>EH>>E.

In addition, in NLP tasks, the dictionary is usually very large, and the size of the embedding matrix is ​​E × VE × VE × V. If H = EH=EH=E like BERT, then the number of parameters of the embedding matrix will be large, and inversely In the process of propagation, the updated content is also relatively sparse, resulting in a waste of model space.

For the above two points, ALBERT adopts a factorization method to reduce the amount of parameters. Simply put, a linear mapping is added between the input and the Embedding output. First, the one-hot vector is mapped to a low-dimensional space with a size of E, and then mapped to a high-dimensional space. To put it bluntly, it first passes through a very low-dimensional embedding matrix, and then passes through a high-dimensional matrix. into the space of the hidden layer, thereby reducing the parameter amount from O ( V × H ) O(V×H)O(V×H) to O ( V × E + E × H ) O(V×E+E ×H)O(V×E+E×H), when E << H E<<HE<<H, the amount of parameters is significantly reduced.

Cross-layer parameter sharing
The traditional BERT has 12 layers, and each layer has two sub-layers of FFN and attention. After the model is trained, their parameters are different. The author found that many parameters are similar, as shown in the figure below. In the "inter-layer parameter sharing" method, the parameters of some layers (sub-layers) can be shared, which saves the space for storing parameters. After alchemy practice, the following experience is obtained:


(1) Sharing the parameters of the FFN layer will lead to a decrease in the model effect.
(2) The parameters of the attention layer are shared, and the model effect will not decrease (when the word vector dimension E=128) or slightly decrease (when the word vector dimension E=768).
(3) We can divide the L-layer model into N groups, each group contains M layers, and share parameters within the group.
The maximum amount of parameters has been reduced from O ( 12 × L × H × H ) O(12×L×H×H)O(12×L×H×H) to O ( 12 × H × H ) O(12×H× H)O(12×H×H)

Remove the NSP task,
the NSP task using the SOP task BERT is actually a binary classification, the positive sample of the training data is obtained by sampling two consecutive sentences in the same document, and the negative sample is obtained by using two different documents sentence. The design idea of ​​NSP is actually to mix the two tasks of topic prediction and coherence prediction in the same task. The topic prediction task is easier to learn and train than the coherence prediction task, and a lot of topic prediction information has been learned during the training process of the MASK language model.

In ALBERT, in order to retain only the consistency task and remove the influence of topic recognition, a new task-sentence-order prediction (SOP) is proposed. The positive samples of SOP are obtained in the same way as NSP, and the negative samples put The order of positive samples can be reversed. The purpose of this is to remove the influence of topic recognition by adjusting the way of obtaining positive and negative samples, so that pre-training can focus more on sentence relationship consistency prediction.

The author who removed the dropout
ALBERT also found a very interesting point. After ALBERT trained for 100w steps, the model still did not overfit, so the author decisively removed the dropout. Unexpectedly, the effect on downstream tasks has been improved to a certain extent. .

TinyBERT
BERT works well, but the model is too large and slow, so some method of model compression is needed. TinyBERT is a compressed BERT model proposed by researchers from Huazhong Technology and Huawei. TinyBERT is mainly used for compression by model distillation. In the GLUE experiment, it can retain 96% of the performance of BERT-base, but the volume is 7 times smaller than BERT, and the speed is 9 times faster.

Model Distillation Distillation is a commonly used model compression method. First, a large teacher model is trained, and then a small student model is trained using the predicted value output by the teacher model. The student model learns the prediction results (probability values) of the teacher model to learn the generalization ability of the teacher model.


Model innovations
(1) The TinyBERT model proposes a knowledge distillation method for the transformer model.
(2) The TinyBERT model proposes a two-stage learning framework: a general knowledge distillation stage, and a task-specific distillation stage.

The transformer-based knowledge distillation method
(1) simplifies the model in two aspects: the number of layers and the vector dimension.
(2) Three loss functions are designed: the output of the embd layer, the hidden state and attention matrix of each transform layer, and the logits output of the last layer (3
) The attention weight learned by BERT contains potential language information, In the conversion from the teacher network BERT to the student network TinyBERT, semantic information is also transferred.

Two-stage distillation framework
(1) In the general knowledge distillation stage, reducing the number of layers and dimensions will lead to a decline in the effect of the model.
(2) The model uses data enhancement in the fine-tune distillation stage.

Data enhancement
(1) First mask a word in the text, and use the language model BERT to predict the most likely M words in this position as a candidate set.
(2) Use a threshold p to decide whether to randomly replace the masked word with a word in the candidate set. If the word consists of multiple words, then use GloVe's fixed word vector to replace it, instead of using the word in the candidate set output by the BERT model.
(3) Repeat the above steps for each word in the text, so that a new text can be obtained

The main improvement of Electra
ELECTRA to Bert is that it proposes a new pre-training task and framework, changing the generative Masked language model (MLM) pre-training task into a discriminative Replaced token detection (RTD) task, judging Whether the current token has been replaced by the language model. The overall structure of the model is as follows: use an MLM Generator-BERT (generator) to change the input sentence, and then pass it to the Discriminator-BERT (discriminator) to determine which word has been changed.


Generator: It is a small masked language model (usually 1/4 of the size of the discriminator). The specific function of this module is that it adopts the classic bert's MLM method; iscriminator: the input of the discriminator after being corrupted by the generator, the discriminator The function is to distinguish whether each input token is original or replaced. Note: If the token generated by the generator is consistent with the original token, then the token is still original.

The model is trained by minimizing the combined loss:


Why joint training?


We can analyze why joint training will make the model achieve good results? In fact, we understand it vividly, that is, we regard the generator as the questioner and the discriminator as the answerer. During the training process of the model, the questions posed by the questioners become more and more advanced, and the answerers also accumulate and become more and more powerful. It's not that the questions posed by the question maker at the beginning are very complicated, and the answerer has no way to learn at all.
 

Guess you like

Origin blog.csdn.net/qq_39970492/article/details/131227009