Bert Classical Variation Learning

ALBert

ALBERT is to solve the problem of large model parameters and long training time. The smallest parameter of ALBERT is only a dozen M, the effect is 1-2 points lower than BERT, and the largest xxlarge is more than 200 M. It can be seen that the reduction in the amount of model parameters is still very obvious, but it does not seem to be so obvious in terms of speed. The biggest problem is that this method does not actually reduce the amount of calculation, that is, the reasoning time has not been reduced, and the reduction in training time is also open to question.

The structure of the whole model still follows the skeleton of BERT, using Transformer and GELU activation functions.
There should be three specific innovation parts:

  1. One is to factorize the parameters of embedding,
  2. Then there is cross-layer parameter sharing,
  3. Finally, the original NSP task is abandoned, and now the SOP task is used.

The main task of the first two of these three updates is to reduce parameters. The third update is no longer an update. There have been a lot of work before and it turns out that the next sentence in BERT is not very positive for this task. influence. According to the experimental results of the article, it seems that parameter sharing has a greater impact on parameter reduction, and it will also affect the overall effect of the model. These three changes will be described in detail later.

Factorized embedding parameterization

At the input of the original BERT model and various pre-trained language models based on transformers, we will find that its E is equal to H, where E is the embedding size, and H is the hidden size, which is the input and output dimensions of the transformer.

  • This will lead to a problem. When our hidden size increases, the embedding size also needs to be increased, which will lead to an increase in the dimension of our embedding matrix.
  • So here the author unties E and H. The specific operation is to add a matrix after embedding for dimension transformation . E is always the same. After H is improved later, we perform a dimension increase operation behind E to make E reach the dimension of H. This makes the dimension of the embedding parameter from O ( V × H ) O(V×H)O(V×H ) toO ( V × E + E × H ) O(V×E + E×H)O(V×E+E×H ) , it is more obvious when E is much smaller than H.

Cross-layer parameter sharing

The parameters of each layer of the previous transformer are independent, including self-attention and full connection, so that when the number of layers increases, the parameters will increase significantly.
Previous work has tried to share self-attention or full connection alone, and has achieved some results.

  • Here the author tries to share all the parameters, which actually leads to the fact that the multi-layer attention is actually the superposition of a layer of attention . At the same time, the author also found through experiments that the use of parameter sharing can effectively improve the stability of the model. The experimental results are as follows:
    insert image description here

Inter-sentence coherence loss

Here the author uses a new loss, which actually changes a subtask NSP of the original BERT. The original NSP is to predict the next sentence, that is, whether a sentence is the next sentence of another sentence.

  • The problem with this task lies in the training data. The positive example uses two consecutive sentences in one document, but the negative example uses two sentences in different documents. This leads to this task including topic prediction, and topic prediction is much simpler than the prediction of the continuity of two sentences.

  • The new method is used sentence-order prediction(SOP), the construction of the positive example is the same as that of NSP, but the negative example is to reverse the two sentences. Experimental results also prove that this method is much better than before. But this should not be the first one here, and Baidu's ERNIE seems to have adopted this kind of one.


The size impact of Embedding:
insert image description here
For the version where the parameters are not shared, as E increases, the effect is continuously improved.

But this does not seem to be the case in the parameter - sharing version, and the version with the best effect is not the version with the largest E. At the same time, we can also find that parameter sharing may bring about 1-2 points of decline in the effect


Roberta

The main contributions of the model are:

  • (1)training the model longer, with bigger batches,over more data;
  • (2) removing the next sentence prediction objective;
  • (3) training on longer sequences;
  • (4) dynamically changing the masking pattern applied to the training data. We also collect a large new dataset (CC-NEWS) of comparable size to other privately used datasets, to better control for training set size effects.

Dynamic Masking

Static masking : The original BERT uses a static mask method, that is, in create pretraining data, the data is first masked in advance, in order to make full use of the data, dupe_factor is defined, so that the training data can be copied by dupe_factor, and then the same Data can have different masks.

  • Note that not all of these data are fed to the same epoch, but different epochs. For example, dupe_factor=10, epoch=40, then each mask method will be used 4 times during training.

The original BERT implementation performed masking once during data preprocessing, resulting in a single static mask. To avoid using the same mask for each training instance in every epoch, training data was duplicated 10 times so that each sequence is masked in 10 different ways over the 40 epochs of training. Thus, each training sequence was seen with the same mask four times during training.

  • dynamic masking : Every time the training example is fed to the model, a random mask is performed.

No NSP and Input Format

NSP: 0.5: Two consecutive segments from the same article. 0.5: segments in different articles

  1. Segment+NSP:bert style
  2. Sentence pair+NSP: Use two consecutive sentences+NSP. Use a larger batch size
  3. Full-sentences: If the maximum length of the input is 512, then try to choose a continuous sentence with a length of 512. If it spans documents, add a special separator in the middle. No NSPs. The experiment uses this because the size of the batch size can be fixed.
  4. Doc-sentences: Same as full-sentences, but not across documents. No NSPs. best.

Text Encoding

The BERT prototype uses character-level BPE vocabulary of size 30K
RoBERTa uses the byte BPE implementation of GPT2, and uses byte instead of unicode characters as the unit of subword.

insert image description here


XLnet

  • XLNet introduces the autoregressive language model and the self-encoding language model

Autoregressive Language Model (Autoregressive LM)

Before ELMO/BERT came out, the language model that everyone usually talked about was actually predicting the next word that might follow based on the above content, which is often said to be a left-to-right language model task, or vice versa, that is, predicting based on the following The previous word, this type of LM is called an autoregressive language model.

  • GPT is a typical autoregressive language model. Although ELMO seems to use the above and the following, it is still an autoregressive LM in essence, which has something to do with how the model is implemented. ELMO is made in two directions (language models from left to right and from right to left), but it is an autoregressive LM in two directions, and then splicing the hidden node states of the two directions of LSTM into Together, to reflect the bidirectional language model. So it is actually a splicing of two autoregressive language models, which is still an autoregressive language model in essence.

Autoregressive language models have advantages and disadvantages

The disadvantage is that it can only use the information of the above or the following, and cannot use the information of the above and the following at the same time. Of course, it seems that ELMO does both directions, and then splicing seems to be able to solve this problem, because the fusion mode is too simple, so the effect is actually Not very good.

The advantages are actually related to downstream NLP tasks, such as generating NLP tasks, such as text summarization, machine translation, etc. When actually generating content, it is from left to right, and the autoregressive language model naturally matches this process. However, Bert's DAE model faces the problem of inconsistency between the training process and the application process in the generation of NLP tasks, resulting in the generation of NLP tasks that have not been done well so far.

Autoencoder LM

The autoregressive language model can only predict the next word based on the above, or conversely, it can only predict the previous word based on the following.

In contrast, Bert randomly drops some words in the input X, and then one of the main tasks of the pre-training process is to predict these masked words based on the context words. If you are familiar with it, you will see that Denoising Autoencoder, This is indeed a typical DAE train of thought . Those words that are dropped by Mask are the so-called noise added on the input side. A pre-training model like Bert is called DAE LM.

The advantages and disadvantages of this DAE LM are just the opposite of those of the autoregressive LM. It can be more naturally integrated into the bidirectional language model, and at the same time see the context and context of the predicted word, which is an advantage.

  • What are the disadvantages? The [Mask] mark is mainly introduced on the input side, which leads to the inconsistency between the pre-training stage and the Fine-tuning stage, because the [Mask] mark cannot be seen in the Fine-tuning stage .
    • DAE, it is necessary to introduce noise, [Mask] mark is the means of introducing noise, this is normal.

The starting point of XLNet is: whether it can combine the advantages of both autoregressive LM and DAE LM. That is to say, from the perspective of autoregressive LM, how to introduce an effect equivalent to the two-way language model; from the perspective of DAE LM, it itself is integrated into the two-way language model, how to get rid of the [Mask] mark on the surface , to keep pre-training and Fine-tuning consistent. Of course, XLNet also talked about a problem that Bert is independent of the Mask words. I believe this is not very important, and the reason will be mentioned later. Of course, I don't think this is important. It's purely a personal opinion. Mistakes are inevitable. Just look at it and it's over. Don't take it seriously.

  • How to make this model: it still looks like a left-to-right input and prediction mode, but in fact, the context information of the current word has been introduced internally?

XLNet's main contribution to the model is actually here.

  • So how does XLNet do this?

In fact, the idea is relatively simple, you can think of it this way: XLNet still follows a two-stage process, the first stage is the language model pre-training stage; the second stage is the task data Fine-tuning stage.

It mainly hopes to change the first stage, that is, unlike Bert's mode of Denoising-autoencoder with Mask symbol, it adopts the mode of autoregressive LM. That is to say, it seems that the input sentence X is still input from left to right. See the Context_before of the word Ti to predict the word Ti. But I also hope that in Context_before, not only the above words can be seen, but also the following words in Context_after after the Ti word can be seen. In this case, the Mask symbol introduced in the pre-training stage in Bert is not needed, so in the pre-training The training phase seems to be a standard left-to-right process, and Fine-tuning is of course the same process, so the two links are unified. Of course, that's the goal. What remains is the question of how to do this.

insert image description here
So, how can we knead the content of the following Context_after in the context_before of the word Ti?

  • This is what XLNet does. In the pre-training phase, the training target of the Permutation Language Model is introduced. What does that mean?

That is to say, for example, the currently input sentence X containing the word Ti is composed of several words in sequence, such as four words x1, x2, x3, and x4 in sequence. We assume that the word Ti to be predicted is x3, and its position is at Position 3. If you want it to be able to see the word x4 at Position 4 in the above Context_before, that is, Position 1 or Position 2. You can do this: Suppose we fix the position of x3, that is, it is still at Position 3, and then randomly arrange and combine the 4 words in the sentence. Among the various possibilities after random arrangement and combination, select a part as the input of model pre-training X. For example, after random permutation and combination, the permutation combination of x4, x2, x3, and x1 is extracted as the input X of the model. Therefore, x3 can see the contents of x2 above and x4 below at the same time.

This is the basic idea of ​​XLNet, so, after reading this, you can understand its original intention: it still looks like an autoregressive left-to-right language model, so, you can see the above and 但是其实通过对句子中单词排列组合,把一部分Ti下文的单词排到Ti的上文位置中below , but in form it still looks like the next word is predicted from left to right.

  • Permutations! ! !

insert image description here
Of course, the above is still the basic idea. The difficulty lies in how to implement the above ideas. First of all, it needs to be emphasized that although the above is to arrange and combine the words of sentence X, and then randomly extract examples as input, in fact, you cannot do this.因为Fine-tuning阶段你不可能也去排列组合原始输入。

  • Therefore, it is necessary to let the input part of the pre-training stage still look like the input order of x1, x2, x3, x4, but we can do some work in the Transformer part to achieve our desired goal.

Specifically, XLNet adopts the mechanism of Attention mask. You can understand that the current input sentence is X, the word Ti to be predicted is the i-th word, and the first 1 to i-1 words are observed in the input part. There is no change, who should be. But inside the Transformer, through the Attention mask, from the input word of X, that is, the upper and lower words of Ti, randomly select i-1 words, put them in the upper position of Ti, and pass the input of other words through The Attention mask is hidden, so we can achieve our desired goal (of course, this so-called placement on the top of Ti is just a vivid statement. In fact, internally, through the Attention Mask, other words that have not been selected Mask off, so that they don't work when predicting the word Ti, that's all. It looks like putting these selected words in the Context_before position above). In the specific implementation, XLNet is implemented using the "two-stream self-attention model". For details, please refer to the paper, but the basic idea is as mentioned above. The two-stream self-attention mechanism is only a specific way to realize this idea. In theory, you can think of Finding other specific implementation methods to realize this basic idea can also achieve the goal of letting Ti see the following words.


Dual flow sub-attention mechanism:

Here is a brief introduction to the "dual-flow self-attention mechanism". One is the content flow self-attention, which is actually the calculation process of the standard Transformer; it mainly introduces the query flow self-attention. What is this for? In fact, it is used to replace the [Mask] mark of Bert, because XLNet hopes to throw away the [Mask] mark symbol, but for example, knowing the above words x1, x2, to predict the word x3, at this time the highest layer of Transformer at the corresponding position of x3 To predict this word, but the word x3 to be predicted cannot be seen on the input side. Bert actually directly introduces the [Mask] tag to cover the content of the word x3, which is equivalent to saying that [Mask] is a general placeholder symbol. However, because XLNet wants to discard the [Mask] mark, but cannot see the input of x3, the Query stream simply ignores the input of x3, and only retains this position information, using the parameter w to represent the embedding code of the position. In fact, XLNet just throws away the superficial [Mask] placeholder symbol, and internally introduces the Query flow to ignore the masked word. Compared with Bert, the implementation method is different.

insert image description here

Attention mask

The Attention mask mentioned above, I guess you still don't understand what it means, let me explain it with an example. The core of the mechanism of Attention Mask is that although the current input still looks like x1->x2->x3->x4, we have changed it to another random order x3->x2->x4->x1. , if this example is used to train LM from left to right, it means that when predicting x2, it can only see the above x3; when predicting x4, it can only see the above x3 and x2, and so on ...In this way, for example, for x2, you will see x3 below. This maintains the superficial order of the X sentence words on the input side, but in fact, inside the Transformer, what is seen is the rearranged and combined order, which is realized through the Attention mask. As shown in the figure above, the input still looks like x1, x2, x3, x4. Through different mask matrices, the current word Xi can only see the sequence x3->x2->x4->x1 after being arranged and combined. in the word in front of yourself. In this way, the predicted word is changed internally and the context word is seen at the same time, but the input side still seems to maintain the original word order. The key is to understand the mask matrix on the right side of the above picture. I believe many people didn’t understand it at first, because I didn’t understand it at first, because the word coordinates of the mask matrix are not marked, and its coordinates are 1-2 -3-4 is the word order of X on the surface. Through the mask matrix, you can change it to the permutation and combination you want, and let the current word see the so-called above that it should see. In fact, it is mixed with the above. text and the content below. This is the meaning behind the attention mask to achieve permutation and combination.


XLNet works. If we summarize it macroscopically, there are three factors;

  1. A new pre-training target different from Bert's De-noising Autoencoder method: Permutation Language Model (PLM for short); this can be understood as how to take specific measures to integrate into the bidirectional language model in the autoregressive LM mode. This is a relatively large contribution of XLNet from the perspective of the model, and it has indeed opened up a new idea for the trend of the two-stage model in NLP.

  2. The main idea of ​​Transformer-XL is introduced: relative position encoding and segmented RNN mechanism. Practice has proved that these two points are very helpful for long document tasks;

  3. Increase the size of the data used in the pre-training phase; the pre-training data used by Bert is BooksCorpus and English Wiki data, with a size of 13G. In addition to using these data, XLNet also introduced Giga5, ClueWeb and Common Crawl data, and excluded some low-quality data, the size of which is 16G, 19G and 78G respectively. It can be seen that the data scale is greatly expanded in the pre-training stage, and the quality is filtered. This is obviously taking the route of GPT2.0.

Guess you like

Origin blog.csdn.net/RandyHan/article/details/131899731