BERT and ERNIE record

Mainly on

BERT: Pre-training of Deep Bidirectional Transformers for Language Understandingtichu提出的BERT

And ERNIE: learning record ERNIE two models Enhanced Language Representation with Informative Entities proposed

BERT  

Look before looking BERT https://www.cnblogs.com/dyl222/p/10888917.html   because it uses the feature extractor is Transformer "Attention is all you need" proposed

Translation of paper:

https://zhuanlan.zhihu.com/p/52248160

BERT specific interpretation see: https: //blog.csdn.net/yangfengling1023/article/details/84025313

BERT belongs to a two-stage model, the first stage is based on large-scale corpus unsupervised pre-training (pre-train) second stage with the specific task of fine-tuning (fine-tuning), its biggest advantage is a good effect , there is a strong generalized by linguistic features extracted from the mass unsupervised corpus multilayer Transformer, downstream tasks can play a very good feature complement.

Its biggest innovation is that two new pre-training mission it raises:

1、Masked LM

Masked LM mask from a random number input words, only the target words to predict these masks by its context, it is possible to use full context and this weak Bi-LSTM achieved by splicing bidirectional compared to its achieved is a real two-way.

In all experiments, each of us in a random sequence covering 15% of WordPiece mark.

While this allows us to do two-way pre-training model, but this approach still has two drawbacks. The first is that this approach will make pre-trained reference models and modulation method can not be matched with each other, because [the MASK] tag parameter adjustment method does not exist. In order to eliminate this malpractice, we do not always cover the words expressed by [MASK], but in the training data represent 15% of the labeled random generation [MASK], for example, in the sentence "my dog is hairy" select the "hairy", then the mark is generated by the following manner:
* not always [the MASK] Alternatively the selected word, the data generated by:
* 80% of the cases: to replace the selected word to [the MASK] , such as: "My Dog iS Hairy" → "My Dog iS [the MASK]"
* 10% of the alternative selected word random word, such as: "My Dog iS Hairy" → "My Dog iS Apple"
* 10% remain unchanged in the case of the original word, such as: "my dog is hairy" → "my dog is hairy".

The second drawback is the use of a MLM means that each batch, only 15% of the mark will be predicted, so when the need for more training of convergence step. In 5.3 we will explain the convergence rate is slower than the MLM model (predict each mark) from left to right, but a huge boost and bring MLM compared to doing so is worth it.

2.Next Sentence Prediction

Many important tasks such as Q & A and the downstream natural language inference (NLI) is based on the understanding between the two sentences, the language model does not capture directly to this relationship. In order to understand the model can Xulian sentence relations, we pre-trained binarization predicted the next sentence a task easily generated from any monolingual corpus. In particular, when selecting a sentence A and the sentence B are each pre training model, 50% of the case where B is a lower A, when 50% of the randomly selected sentence expected. such as:

输入部分:

 It consists of three parts for the input word embedding position vector and the vector segment, respectively

We use the word in the 30,000 WordPiece embedding, the word fragment (word pieces) resolved by "##" labels (Translator's Note: see the "playing" - "play ## ing")

Sentence each one of the first use of a special classification embedding ([CLS]). This mark. In the final hidden layer (i.e., output of the converter) corresponding to the classification task is polymerized in the sequence identified in FIG. Non-classification task in this tag will be ignored. That is the choice [CLS] vector downstream task in dealing with classified as input, because it represents a sentence vector

 Sentence to be packaged together as a sentence. We distinguish them in two ways, first of all, we put them with a special tag ([SEP]) distinguished, and then we will give the first sentence of each tag to add embedded in a study of a sentence A, to the second sentence each added tag embedded in a sentence B is learned
* we only for a single input sentence a sentence is embedded.

A segment of the input sentence vectors are all 0, the vector segment of the input sentence B are all 1

 BERT advantages:

1) is achieved by MLM real two-way, Bi-LSTM ELMO used in the characterization of the forward and reverse splicing achieved only a weak bidirectional.

2) selection of a feature extractor Transformer, in the CNN can be obtained compared with more global information, compared to the other RNN LSTM series feature extractor, which enables parallel, also introduces a multilayer, although the multi aspect NLP layer and multi-layer image can not simply compare the terms, but this is a big step forward, the introduction of multi-layered able to extract more wealth of information in terms of NLP.

3) The biggest advantage is it a clear direction on how to use unsupervised massive corpus, the linguistic features extracted by pre-train has a strong generalization, like in the image field of image-net just be fine-tuning downstream on the basis of specific tasks on a small corpus will be able to achieve good results.

In the location information is very important in NLP tasks, the only criticism might be worthy of its position embedding, specific tasks of the position information relatively high demand added a layer of Bi-LSTM in BERT and then may be able to get better results, but this in turn defeats the purpose of BERT, in order to completely abandon this on CNN and RNN.

Guess you like

Origin www.cnblogs.com/dyl222/p/10960842.html