Pre-trained language model (a)

  With bert has made remarkable record in a variety of NLP tasks, pre-training model in less than a year this has been a great development, this series of articles is mainly a brief review of what there is after the more famous bert pre-training model, this issue yourself with the following domestic open-source pre-training model.

A, ERNIE (Tsinghua University & Huawei Noah)

  论文:ERNIE: Enhanced Language Representation with Informative Entities

  GitHub:https://github.com/thunlp/ERNIE

  Tsinghua University and Huawei Noah's Ark laboratory jointly proposed the introduction of semantic mapping knowledge to enhance the ability to pre-training model, in fact, pre-training is the addition of a physical alignment of tasks on the basis of the original bert. We take a look at this new task is how to do, first of all look at the architecture of the entire pre-training model

  

  There are two encoder, T-encoder and K-encoder, in fact, where the K-encoder only have effect at the time of pre-training, as long as the use of T-encoder in the fine-tuning stage after it, so it is important here is the introduction of entities aligned this task only.

  As shown in the right figure above, given a sequence of $ w_1, w_2, ... w_n $ composition, and this sequence is aligned with the entity $ e_1, e_2, ... e_m $, mapping knowledge from these entities. Because an entity involves a plurality of words, in an example in FIG above $ e_1 = Bob Dylan $, and in a sequence corresponding thereto token consists of two entities, namely $ w_1 = bob, w_2 = dylan $. Therefore, we will first word of entities in the Knowledge Graph entity and the corresponding sequences when aligned, that is positionally $ e_1 $ corresponds to $ w_1 $.

  Role of T-encoder is encoding sequence, similar structure and bert-base, but the number of layers is six layers. K-encoder is a physical map and knowledge to do the polymerization sequence, the entity mapping knowledge embedded TransE doing, specific expression is as follows:

    

  Coding sequence and the first entity, then do polymerization updated $ W $ and $ E $ state after completion of polymerization

    

  For the treatment of non-solid token, the token directly to the update sequence:

    

   Understand how the introduction of the Knowledge Graph entity to the task, let's look at how specific tasks are built, this paper presents a random mask tokens-entity in the entity, and then to predict the position corresponding to the entity, the nature and MLM (mask language model) the same task, since all belong to de-noising code. Specific mask details:

  1) 5% of a random tokens-entity with another entity to replace, mainly introduce noise, as in this case there is an actual task.

  2) 15% of a random tokens-entity maskentity, and to predict the entity.

  3) 80% to maintain normal.

  The main work of this paper is to increase this task, while also proposed a new pre-training methods on entity types and relation extraction two tasks, specifically in the following figure:

    

  It is the introduction of a special token to show some additional special token of identity. Since the introduction of the entities alignment task, so the model and mapping knowledge in a number of related downstream tasks better than bert.

 

Two, ERNIE (Baidu)

  论文:ERNIE: Enhanced Representation through Knowledge Integration

  GitHub:https://github.com/PaddlePaddle/ERNIE

  Baidu consistent with the proposed model name above, and also known as the introduction of knowledge and information, but the practice is completely different, here are the main changes in the MLM task for bert made some improvements. Specifically as shown below

    

     

   In bert in just mask a single token, but in a language most of the time in a phrase or physical presence, the correlation between if you do not consider the phrase or word in the entity, and all the words to open an independent, not very good expression of syntactic, semantic information, so this introduces three ways mask, respectively token, entity, phrase for mask. In addition, this paper also introduces dialogue corpus, a rich source of corpus and corpus for dialogue, given a similar task and NSP. FIG follows:

    

   Here the task of building a DLM, in fact, practices and NSP similar, randomly generated a number of false rounds QR right, then let the model to predict the current rounds of dialogue is real or fake.

  Authors tested on many tasks bert has 1-2% over the upgrade, and the authors have done experiments show DLM task has been raised on the NLI task.

 

Three, ERNIE 2.0 (Baidu)

  论文:ERNIE 2.0: A CONTINUAL PRE-TRAINING FRAMEWORK FORLANGUAGE UNDERSTANDING 

  GitHub:https://github.com/PaddlePaddle/ERNIE

  This is Baidu on the previous model made new improvements, this thesis is to take the idea of ​​multi-tasking, the introduction of seven big task to pre-training model, and uses a successive increase in pre-task approach to training, specific tasks as shown in FIG below:

    

  Because different here different tasks input, the authors introduced the Task Embedding, to distinguish between different tasks, the training method is to train Task 1, save the model, and then load just saved the model, and then while training Tasks 1 and 2, and so, while the last seven training tasks. Personal guess this may be due to a direct approach while training seven models the effect is not good, but now this training mode, a start has been pre-trained on Task 1, it is equivalent to have a good initialization parameters, then go to training tasks 1 and 2 model to ensure better convergence.

   In effect compared ERNIE1.0 version is basically to enhance, and there is much room for improvement in reading comprehension tasks.

 

Four, BERT-wwm 

  论文:Pre-Training with Whole Word Maskingfor Chinese BERT 

  GitHub:https://github.com/ymcui/Chinese-BERT-wwm

  BERT-wwm HIT open out, the introduction of whole word mask based on the original bert-base, in fact, after the word for word mask, as shown below:

    

   Because it is based on bert-base training on, so seamless now bert to use, direct replacement model can be pre-trained, you do not need to change any files. And in many Chinese have some tasks than bert upgrade, it is recommended to use.

Guess you like

Origin www.cnblogs.com/jiangxinyang/p/11512596.html