Pre-trained language model (two) NLP in

  Facebook made to bring the Benpian two pre-training model --SpanBERT and RoBERTa.

一, Team Bert

  论文:SpanBERT: Improving Pre-training by Representing and Predicting Spans 

  GitHub:https://github.com/facebookresearch/SpanBERT

  This paper proposes a new method mask, as well as a new loss function object. Bert and discussed the NSP task is useful. Next, how the pre-trained SpanBERT, specifically as shown below:

    

   As shown above, where the first mask strategy span mask. Specific approach is first sampled from a geometric distribution of the length of the span, and limits the maximum length of 10, and then random samples (e.g., uniformly distributed) in the initial position of the span. Overall training mission is to predict token mask, another mask problems and bert proportion similar. But here the loss introduced two objects, $ L_ {MLM} $ and $ L_ {SBO} $, $ L_ {MLM} $ bert and the like, and this $ L_ {SBO} $ only through the boundary of the span token at the two span predicted in the mask word, formulated as follows:

    

   (.) Function $ f $ expressed as follows:

    

  In addition to these there are two strategies, one dynamic mask, bert in the data preprocessing stage is a 10 mask sequences of randomly different times, but here is the use of a different mask each time sequence epoch. Second bert will generate 10% of the sequence length is shorter than the data preprocessing stage 512, and where such an operation is not only a document has been taken for the length of the sequence 512, but the last sequence length may be less than 512 . In addition to $ \ epsilon $ adam is set to 1e-8. According to the two authors we trained a new strategy bert model, while removing NSP tasks using only a single training sequence bert model. Therefore, the author gives a performance comparison of four models:

  Google BERT: Google open source bert

  Our BERT: Based on the above two strategies trained bert

  Our BERT-1seq: Based on the above two strategies, and the task of removing bert NSP

  SpanBERT: This paper proposes a model

  The first performance test is given in the form of a data set on SQuAD,

    

   SpanBERT is greatly improved, in addition to the removal of NSP task has been raised, the authors believe that the length of a single NSP task sequence is not enough, that it can not be a good model to capture long-distance information. Also in the other removable QA task has greatly improved

    

   Personally I think SpanBERT can achieve such a large increase in the removable QA task, because the task SpanBERT in construction, especially SBO task is actually a little fit removable QA tasks.

  On the other tasks SpanBERT also some improvement, but did not improve so much on removable QA tasks, in addition to the authors also do the experiment represents a random mask span is superior to mask the effect of the entity or phrase.

  In summary, SpanBERT effect on removable QA's outstanding performance in the removable QA is worth trying.

 

Two, RoBERTa

  论文:RoBERTa: A Robustly Optimized BERT Pretraining Approach

  GitHub:https://github.com/brightmart/roberta_zh

  This paper is doing fine tune parameters on the basis of bert, it can be seen as the ultimate parameter adjustment, the final performance is not only fully rolled bert, and beyond the XL-Net on most tasks.

   Under summary, the main changes in the place of the following six:

  1) Adam parameter adjustment algorithm, $ \ epsilon $ 1e-6 by the change 1e-8, $ \ beta_2 $ 0.98 into a 0.999.

  2) the use of more data, increased from 16GB 160GB.

  3) Static Dynamic mask substituted mask.

  4) removing the NSP task, and using the full-length sequence.

  5) larger batch size, more training a few steps.

  6) The character-level BPE substituted with byte-level BPE.

  Next we look at in conjunction with the author's experiments. First, adjust the parameters of the task is adam can make training more stable and can achieve better performance, but did not give the experimental data. Increasing the data to enhance performance is beyond doubt.

  Dynamic mask

  Bert is done in a different data mask 10 times the pretreatment, so that the time epochs to 40, the average of each mask sequence will appear four times, of a dynamic mask used herein, i.e., each time a different epochs do the mask. Comparative results are as follows:

    

   To tell the truth, I did not feel much improvement, after all, when we are training model, a data model will be seen many times.

  Model input

  Performance comparison of the performance of the tasks of whether the NSP, and different input sequences, the author here gives the four input form:

  1) SEGMENT-PAIR + NSP: two segment of sentence pair, and the task is introduced NSP

  2) SENTENCE-PAIR + NSP: two sentence of sentence pair, and the introduction of NSP task, the total length may be much smaller than 512.

  3) FULL-SENTENCES: a plurality of complete sentences, to the cross section of the document, separated by an identifier, but the total length of not more than 512, no task NSP

  4) DOC-SENTENCES: There are multiple complete sentences, but not across documents, the total length of no more than 512

  Properties are as follows:

    

   Obviously sentence directly to the worst, the authors believe that the main sequence when the length is not enough, resulting in long-range model can not capture information. And removing the effect of NSP task also improved.

  Larger batch size, more training times

  The author believes appropriate to increase the batch size, can either accelerate the training model can also enhance the performance of the model.

    

   After the author in the batch size 8k in turn increases the number of training

    

  As can be seen from the experiment using a larger number of training, performance is not a small improvement. And can be seen even in the training data of similar cases, RoBERTa is superior to the BERT.

  In short RoBERTa is a tune BERT parameters of success, beyond bert on many tasks, most beyond the XL-Net.

    

Guess you like

Origin www.cnblogs.com/jiangxinyang/p/11528215.html