ERNIE (use the information to enhance physical representation language) paper original translation

This article was published in ACL2019, enhanced use of information entities represented ERNIE of language translation. At the same time there is another Baidu proposed ERNIE

  -- By Brisk Yu

Feel the key is to build knowledge entities see TransE

ERNIE: the use of information to enhance physical representation language

Summary

  Pre-training of natural language representation model (such as BERT) can capture rich semantic patterns from plain text on a large scale corpus, and able to continue to enhance the performance of various NLP tasks by fine-tuning. However, the current model does not pre-training combined with language knowledge map (KG), which may be better structured language provides a wealth of knowledge to understand the facts. We believe that the information entity KG can enhance the external knowledge representation language. In this paper, we use large-scale text corpus of knowledge and an enhanced language training diagram representation model (ERNIE), which can make use of vocabulary, syntax information and knowledge at the same time. The results show, ERNIE has made significant improvements in various knowledge-driven tasks, while comparable with existing BERT model on other common tasks. Source code and experimental details can be found: https://github.com/thunlp/ERNIE.

1 Overview

  Pre-training, including language representation model, you can get a wealth of information and the language of many NLP applications benefit from text-based features and fine tuning. BERT as a method recently proposed by simply fine-tuning to get the best results in different NLP applications, including NER, quiz machine, reasoning and natural language text classification.

 

 

FIG 1 an example of information contained in the knowledge of the language understanding. The solid line represents the fact that knowledge of existing, the red dotted line represents extracted from red sentence knowledge, the green line represents the sentence extracted from green knowledge  

 

 

  Although the pre-characterized model of language training and achieved good results and in many NLP tasks as a regular module, but it ignores the integration of information into knowledge in language comprehension. 1, if you do not know Blowin 'in the Wind and Chronicles: Volume One are the songs and the title, we know that it is difficult to classify entities in the mission two professions Bob Dylan: The songwriting and writer. In addition, the relationship between the classification task is almost impossible to extract fine-grained relationships, such as composer and author. For existing pre-trained language model, these two words is syntactically ambiguous, similar to "UNK wrote UNK in UNK". Therefore, considering the wealth of knowledge and information can get a better understanding of the language, which is conducive to all kinds of knowledge-driven applications, such as classification and entity relationship classification.

  The external knowledge integrated into the language representation model has two challenges: 1) the structure of knowledge coding : For a given text, how to characterize the language model effectively extracted and encoded information relating to KG entity is a very important issue; 2) heterogeneous information integration : language representation of the pre-training process and knowledge characterization process is slightly different, which will produce two separate vector space. How to design a special pre-training mission to fuse vocabulary, semantic knowledge and information is another challenge.

  In order to overcome these two challenges, we proposed to enhance language information representation model entities based on the model representation model based on large-scale training corpus and pre-KG Language:

  1) In order to extract information and coding knowledge, we first identify named entities mention in the text, and then mention these entities with KG aligned. We use TransE such knowledge embedded coding algorithm KG structure diagram, rather than directly using KG-based view of the fact that information is then embedded as an entity ERNIE input. KG-based and text alignment, ERNIE knowledge entity model characterizing the semantic model integrated into the bottom layer.

  2) Similarly BERT, we use under MASK and forecasting as a training mission. In addition, in order to better integration of knowledge and text features, we designed a new training mission: its physical random mask the input text, and let the model choose the appropriate entity from KG to complete the alignment. The traditional model uses only different local context information to predict token, our training task requires knowledge of the facts and context of the merger model to predict the token and entities, so that you can get the language contains knowledge representation model.

  We conducted our experiments on both knowledge-driven NLP tasks: classification and entity relationship classification. The results table, ERNIE full use of vocabulary, semantic knowledge and information, significantly beyond BERT performance in the knowledge-driven model. We also assessed the effect on other ERNIE common NLP tasks, ERNIE also achieved comparable results.

RELATED WORK

  To capture this information and language information from the text for specific NLP tasks, people committed to the pre-characterizing language training model. These pre-training methods can be divided into two categories: feature-based methods and fine-tuning method.

  Early work focus on how to use the feature-based method to convert word to characterize the distribution. Since these pretraining word representation syntactic and semantic information captured in the corpus, which is often used in a variety of models NLP input or initialization parameters, as compared to random initialization parameters better. Since the model can not distinguish polysemous, Peters et al employed so sentence-level model (ELMO) capturing the complex word feature different contexts, and context-sensitive generated using ELMO word embedded.

  Feature-based languages ​​above-mentioned method uses only as input the pre-characterizing features of the training language different, and Le is not marked by Dai trained from the encoder on the text, and then using pre-trained model structure and parameters as other specific tasks NLP starting point. By Dai and Le inspired characterization model pre-trained to use more fine-tuning it has been proposed. Devlin et al proposed BERT depth two-way model containing multiple layers Transformer, the model currently get the best results on multiple NLP tasks (Note: As of this release)

  While there has been a huge success features and fine-tune the model-based language, but they ignored the relationship between knowledge and language model information. As recent studies have shown that injection of additional knowledge and information can significantly enhance the original model, such as reading comprehension, machine translation, natural language reasoning, knowledge acquisition and dialogue systems. Therefore, we believe external knowledge and information can effectively enhance the existing pre-training model. In fact, the characterization of some of the joint entity word and try to learn to work effectively with external KG, and have achieved good results. Sun et al proposed a strategy based on knowledge mask mask language model, characterized by knowledge of the language prompt. In this paper, we based BERT, corpus and use an enhanced language training KG representation model.

3 method

  In this section, we present the overall framework and implementation details ERNIE. Section 3.2 describes the model framework, section 3.4 introduces new pre-training mission in order to encode information entity and the integration of heterogeneous information and design, Section 3.5 describes the details of the fine-tuning process.

 

Figure 2 is a left ERNIE architecture. The right is the input token and for each entity integrated aggregator. There are two types of input information fusion layer: one is embedded token, token embedding and the other is connected to the embedded entity. The information layer is a fusion layer and output a new token embedding embedding entities.

3.1 Symbol

  We   represent the input token (Note: As used herein, token character level), n is the sequence length. At the same time, we represent an entity alignment sequence, m is the sequence length. Note that, in most cases m and n are inconsistent, and can not be aligned to each token KG entities. In addition, we in the dictionary all the token, with KG in the representation of entities. If a token has a corresponding physical alignment relationship between them is as described . Herein, we will first token corresponding to the named entities with the phrase aligned, as shown in FIG.

 

3.2 model architecture

  2, the structure of the entire model ERNIE two stacked modules are: 1) a lower layer encoder text (T-Encoder) that capture lexical and semantic information from the token input, 2) knowledge of the upper layer encoder ( K-Encoder) is responsible for the external token knowledge information integrated into the text information output from the lower layer, which can be characterized token and heterogeneous information entities into a unified feature space. In addition, we expressed the T-Ecoder layers denoted by N, the number of layers of the K-Encoder M.

  Specifically, a given sequence and a sequence corresponding entity , a text encoder to each token for token embedded, embedded segment, summing embedding location, obtain input embedded. Next calculated by lexical and semantic features (bold):

  

T-Encoder is a multi-layered two-way Transformer. T-Encoder BERT and its implementation exactly the same in, so a detailed description thereof is omitted here.

  Calculated after, ERNIE knowledge encoder using K-Encoder injected into language knowledge information characterization. Specifically, we first use the embedded entity (in bold) to represent the entity model TransE embedded pre-trained by the highly efficient embedded knowledge. Then, and it was left to K-Encoder isomerization information and calculates the final output fusion insert,

  

 And features as a specific task. 3.3 will detail the knowledge encoder.

 

3.3 Knowledge encoder

   2, K-Encoder knowledge encoder polymerization composed of a stack, not only may encode token and entities, but also features a fusion isomers. The i-th polymerization reactor, embedded token input by a front aggregator and embedded entities are thrown respectively from the attention of the bull,

  

然后,第i个聚合器使用信息融合层实现token和实体序列的相互整合,并且计算每个token和实体的输出嵌入。对于token和其对齐的实体,信息融合的过程如下:

  

是整合token和实体后的内部隐层状态。是非线性函数,一般选择GELU。对于没有对应实体的token,信息融合层直接计算输出嵌入,

  

  为了简化,第i层的聚合器操作由以下公式表示,

  

 最上层聚合器计算的token和实体输出嵌入作为K-Encoder最终输出嵌入。

 

3.4 为注入知识而预训练

  为了通过信息实体将知识注入语言表征,我们提出了新的预训练模型ERNIE,其随机mask一些token和实体对齐对,然后要求系统基于对齐的token预测对应的实体。既然我们的任务类似于训练去噪自编码器,我们将此过程称为去噪实体自编码器(dEA)。考虑到对于softmax层来说太大了,因此我们仅要求系统基于给定的实体序列进行预测,而不是基于整个KG。给定token序列和对应的实体序列,我们将的对齐实体分布表示为,

  

 公式7用来计算dEA的交叉熵损失函数。

  考虑到token-实体对中会存在一些错误,我们针对dEA进行以下操作:1)有5%的几率,我们将token-实体对中的实体换成其它任意实体,让模型可以分辨出token对齐到了错误的实体上;2)有15%的几率,我们mask某个token-实体对,训练模型能够发现实体对齐系统没有抽取所有存在的token-实体对;3)其余情况,我们让token-实体对保持不变,让模型将实体信息整合进token表征来获得更好的语言理解。

  类似BERT,ERNIE也采用了掩码语言模型(MLM)以及预测下一句(NSP)作为预训练任务,让ERNIE可以捕获词汇和语义信息。预训练的总损失包括dEA、MLM和NSP损失(注:相较于BERT增加了dEA)。

 

3.5 对特定任务进行微调

 

 图3 对于特定任务调整输入序列。为了在不同类型的输入间对齐标记,我们使用虚线方框作为占位符。彩色的方框表示特定标记token。

   如图3所示,对于不同的NLP任务,ERNIE采用了不同的微调方法。我们将第一个token作为区分任务的标记。对于一些知识驱动任务,我们设计了特殊的微调方法:

  关系分类任务要求系统基于上下文对给予的实体对的关系标签进行分类。最直接的方法是将池化层用于给定实体引用的最终输出嵌入,并将用于分类的引用嵌入连接作为给定实体对的表示。在本文中,我们设计了另一种方法,通过添加两个mark tokens突出实体引用来调整输入token序列。这些额外的mark token同传统关系分类模型的位置嵌入扮演同样的角色。然后,我们也使用CLS标记表明是一个分类任务。注意我们使用HD和TL分别表示实体的头和尾。

   实体分类任务的微调是关系分类任务微调的简单版本。之前的分类模型充分利用了上下文嵌入和实体引用嵌入,我们认为修改后的序列加上引用标记ENT可以让ERNIE注意结合上下文信息和实体引用信息。

 4 实验

  本节,我们将展示ERNIE在五个NLP数据库上预训练和微调的细节,这五个数据库包括知识驱动任务和传统NLP任务。

4.1 数据库预训练

  预训练过程与现有文献记载的预训练语言模型方法一致。从零开始训练ERNIE代价太高,我们采用Google训练的BERT参数来初始化用于编码token的Transformer模块。因为预训练是由NSP,MLM和dEA多任务组成的,我们使用英文维基百科作为我们的预训练语料库并且将文本和Wiki-data对齐。在将语料库转换为用于预训练的格式化数据后,注释后的输入有接近45亿的子单词和1.4亿的实体,并且抛弃了实体数小于3的句子。

  预训练ERNIE之前,我们采用通过TransE在Wikidata上训练的知识嵌入作为实体的输入嵌入。具体来说,我们从Wikidata中采样,其包括5040986个实体和24267796个事实三元组。在训练中固定实体嵌入,并随机初始化实体编码模块的参数。

5 总结

  本文中,我们提出ERNIE将知识信息整合进语言表征模型。相应的,我们提出知识聚合器和预训练任务dEA将文本和KG中的异构信息融合。实验结果表明ERNIE相较于BERT在远程监控数据去噪和有限数据微调上有更好的能力。未来的研究方向有三个:1)将知识注入ELMo等基于特征的预训练模型;2)在不同于世界知识库Wikidata的ConceptNet数据库中加入多样的结构化知识;3)对真实语料库进行启发式标注,建立更大的预训练数据。这样可能会获得更全面和有效的语言理解。

  

Guess you like

Origin www.cnblogs.com/brisk/p/11592699.html