[NLP] UNILM crude read

Last week told a MASS model, I feel very good, reference BERT pre-training methods proposed new Seq2Seq task today is to talk about another BERT-based generation model -UNILM, Microsoft is also out.

Papers link


UNILM full name of the Unified Language Model Pre-training for Natural Language Understanding and Generation, actually proposed a pre-training methods, but also very simple, direct reuse the structure and parameters of BERT just fine. NLU directly BERT do, NLG directly BERT of S1 [SEP] S2 as encoder-decoder, though not the structure, but the heart that thought.

1. The model structure

On the very clever seize MASK this point that no matter what LM, essentially in training to be able to get information about what, in fact, what is the problem mask entered in the implementation level. It is entirely possible to integrate Seq2Seq LM BERT, in the S1 [SEP] S2 [SEP] in, encoder with encoder S1, S2 can acquire the token and the token information prior to their S1, as shown above the lowermost mask matrix ( in fact, I say clearly than Figure).

The basic principles easy to understand, primarily a closer look at the papers in detail during training.

2. Pre-training

  1. Input representation: BERT here and use the same three Embedding, but the token reference WordPiece are reprocessed into subword, enhanced ability to generate performance model. In addition, the authors emphasize the segment embedding can help distinguish between different LM.
  2. Transformer: no change, but stressed that the task will be controlled by LM Mask different matrices.
  3. Unidirectional LM: enter only a single sentence. LM other way is to calculate the predicted loss for each token, the author still uses the idea Masked LM calculated only token Mask of loss. I think it is due to the BERT to achieve BiLM, in the pre-training are doing cloze (to the [MASK] The current forecast due coding), it is obviously not suitable to predict the situation x1 and x2. Thus one way to unify the structure and BiLM LM, using this approach Masked left-to-right LM.
  4. Seq2Seq LM: Enter a few words. Using a first encoding method BiLM, the second sentence LM unidirectional manner. At the same time training encoder (BiLM) and decoder (Uni-LM). It is also a random mask off some token while processing the input.
  5. Next sentence also included the task.
  6. Training, in a batch, the target allocation optimization is 1/3 of the time use and BiLM Next sentence, 1/3 time using Seq2Seq LM, 1 time / 6, respectively, to right and right to left LM. Here I did not understand how to achieve specific, is put a different batch of data? Know how to talk about children's shoes ~
  7. Parameter is selected from BERT-large initialized.
  8. The same frequency and BERT plus Mask, but to add Mask, 80% of a random time mask 20% of the time will a bigram or trigram's mask, the predictive power of the model increases.

3. Fine-tunning

  1. NLU: Reference BERT
  2. NLG: During fine tune mask only S2 sentence token, and S2 of [SEP] will be randomly mask, so that the model learning how to stop.

4. Experimental

  1. Abstract formula: abstract as increasing the removable auxiliary tasks, input sentence is present in a removable data based on the predicted first token.
  2. Q (Reading): on removable task SQuAD and CoQA are beyond BERT; fared particularly well on the formula Q, improved 40 points over a small 2018 PGNet .
  3. Problems generated: utilization generated, the SQuAD results improved 4 points.
  4. GLUE: on most tasks> = BERT, and ultimately enhance the 0.3 points.

5. Summary

Like UNILM and MASS goals are unified want BERT and generative model, but I personally think UNILM more elegant. First unified approach UNILM more concise, from the perspective of mask matrix improvement, but still put MASS BERT to the structure Seq2Seq changed, only use encoder, unlike UNILM a structure to do all the things to do when other tasks. UNILM gives more results, especially the formula Q has a huge improvement, but also to ensure the overall effectiveness and BERT pretty, but not too focused on their own MASS encoder.

However UNILM and MASS not do the same experiments are not directly comparable, personally I feel that some of the simple formula that can be used UNILM task, but the more difficult task of translation, especially in the absence of a training corpus, MASS should be more appropriate.


Happy Dragon Boat Festival!

Reproduced in: https: //juejin.im/post/5cfbf2a0f265da1b6a348721

Guess you like

Origin blog.csdn.net/weixin_34297300/article/details/91426365