Dive into BERT: Language Model and Knowledge

Write in front

Recently , I’m mainly watching something related to knowledge , including reviewing some knowledge representation models, and how some large-scale language models can be icing on the cake to integrate external knowledge methods. If you are interested, you can go directly to the previous articles. take a peek. Today, I will use knowledge as the starting point to analyze more in-depth the pre-training models that have been popular recently.

√ Language Models as Knowledge Bases?

√ Linguistic Knowledge and Transferability of Contextual Representations.

√ What does BERT learn about the structure of language?

There are many interesting researches, you can go to the PLMpapers warehouse compiled by Tsinghua University to find them.

1、Language Models as Knowledge Bases?(EMNLP2019)

Can a language model be regarded as a kind of knowledge base?

The knowledge base is an effective solution for formatted knowledge storage and application. In the actual construction or expansion of the knowledge base, we will use a series of complex NLP technologies, such as entity extraction, reference resolution, entity linking and relationship extraction, and so on. These pipeline components will inevitably require monitoring data (everyone knows how difficult it is to label data in NLP), and this process is prone to error propagation . On the contrary, the language model is fed with a large amount of data (unsupervised) in the pretrain stage, from which effective relational knowledge can be learned, and the [MASK] operation like BERT can focus more on extracting relational knowledge similar to the knowledge base .

image

Therefore, the author compares the knowledge base obtained by the pre-training language model with the traditional relation extraction method to explore the extent to which the pre-training model stores facts and common sense knowledge. There are:

  • How much relational knowledge is stored in the pre-trained language model?

  • What is the difference between different types of knowledge, such as physical facts, common sense, and general questions and answers?

  • What is the difference between a language model that does not need to be fine-tuned and the traditional automatic extraction of knowledge to create a knowledge base?

To this end, the authors proposed a LAMA (LAnguage Model Analysis) probe to verify the above problems. The knowledge used for testing includes: the relationship between entities in Wikidata; common sense knowledge between ConceptNet concepts; and the knowledge necessary to answer natural language questions in SQuAD. It is believed that if the language model can predict the correct triplet in the form of cloze filling, it means that it has learned this knowledge. For example, for triples (Dante, born-in, Florence), if the language model can predict that Dante was born in ____the space in a given sentence is Florence, then it is correct.

Language model

The following picture shows the language model selected for the test in this article (HAHA is sure everyone is very familiar with it, ps. The first is the multi-layer gated convolution model implemented by the fairseq library). The last comparison index is rank-based metrics, and the result of each relationship and the average value of all relationships are calculated. Use precision at k (P@K)

image

Source of knowledge

In order to verify the different storage knowledge capabilities of different models, of course, it is necessary to simulate as many sources of knowledge in reality as possible.

  • Google-RE : ~60K facts, three of the five types of relationships are selected in the article, because the other two are mainly multi-token objects;

  • T-REx : This article selects 40 types of Wikidata relationships, each with about 1000 facts

  • ConcepNet : This article selects 16 types of relationships in English data

  • SQuDA : This article selects 305 questions that are not related to the context, and converts the question form into cloze form. For example, Who developed the theory of relativity? ----> The theory of relativity was developed by _.

in conclusion

image

  • The BERT-Large model can obtain accurate relational knowledge, which is equivalent to the knowledge base constructed by the thread relation extractor and the oracle-based entity linker;

  • 从预训练模型中可以很好地保存事实知识,但是对于某些关系(N对M关系)表现非常差;

  • BERT-Large模型在获取事实和常识知识方面始终胜过其他语言模型,同时在查询方面更强大;

  • BERT-Large在开放域质量方面取得了显著成果,其P@10精度为57.1%,而使用任务特定的监督关系提取系统构建的知识库为63.5%。但是从上图中可以看出P@1精度就惨不忍睹….

所以看下来大规模预训练模型和知识库还是可以有抗衡性的(怎么好像跟GPT-V2的思想有点异曲同工),毕竟文中参与对比的模型都是未经过特定任务领域微调的。相比于知识库而言,这种语言模型作为知识存储具有的优势主要在灵活便捷,易于扩展到更多数据,且不需要人工标注语料。但同时也仍然存在很多问题,比如N对M关系。

Code Here

PS. 最近几天新出了一篇研究 BERT is Not a Knowledge Base (Yet): Factual Knowledge vs. Name-Based Reasoning in Unsupervised QA 反对了Language Models as Knowledge Bases的观点,认为BERT等的表现是由于对实体名称(表面形式)进行推理,例如猜测具有意大利语名称的人母语为意大利语,而不是模型保存了事实知识。实验中当过滤掉某些容易猜测的事实时,BERT的精度会急剧下降。

2、Linguistic Knowledge and Transferability of Contextual Representations(NAACL2019)

预训练模型的上下文相关词向量表示(CWR)在广泛的NLP任务中取得良好的性能,表明它们编码了必要的知识。于是作者们对ELMO、GPT、BERT三种具有代表性的语言模型,设计了17种不同的探测任务来研究模型编码的语言知识及其可迁移性。

任务

先来看看都有哪些任务吧

Token Labelling

  • part-of-speech(POS):测试是否捕获基本语法。数据集为PTB和UD-EWT;

  • CCG supertagging(CCG):评估向量在上下文中单词的句法角色的细粒度信息。数据集为CCGbank;

  • syntactic constituency ancestor tagging:探索向量的分层语法信息。数据集为PTB;

  • semantic tagging:探索语义信息。

  • Preposition supersense disambiguation:对介词的词汇语义贡献及其介导的语义角色或关系进行分类的任务。

  • event factuality:用所描述的事件的真实性来标记短语的任务;

Segmentation

  • Syntactic chunking (Chunk):验证是否编码span和boundary的信息;

  • Named entity recognition (NER):验证是否编码entity的信息;

  • 等等

Pairwise Relations

探索单词之间关系的信息。

  • arc prediction

  • arc classification

  • 等等

结果与讨论

image

上图是各个模型在设计的各类任务上的表现,我们结合作者开篇提及的问题来一一分析:

CWR捕获到了哪些语言信息,又遗漏了什么?

  • 在所有任务中,上下文相关词向量都比固定词向量(Glove)效果好;

  • 在ELMO-based模型中,使用transformer的效果最差;

  • 总体来看各类任务,BERT-based > ELMO-based > GPT-based 模型,这说明双向编码所获取的信息更多;而且GPT 预训练语料都是lowercased text, 这也限制了其在NER等对大小写敏感的任务上表现;

  • CWR并不能准确抓取输入中有关实体和共指现象的可转移信息。

在编码器的表示层中,可转移性如何以及为什么会发生变化?

  • 对于所有模型来说,general 和 task-specific是一个trade-off

  • LSTM based的模型(ELMO),越靠前的层越general, 越靠后的层越task-specific

  • transformer based的模型(GPT/BERT),没有表现出特定的最general的层,依据任务的不同而不同,但是基本都中间层;

image

预训练任务的选择对语言捕获知识和迁移能力有何影响?

image

作者们研究了基于ELMO模型的不同预训练任务后模型表现发现:

  • 双向语言模型训练得到的平均效果更具有转移性

  • 在相似任务上预训练后的模型,在特定任务上的表现更好

  • 数据越多,模型效果越好

Code Here

3、What does BERT learn about the structure of language?(ACL2019)

看了上面那篇突然又想起尘封在to read list里面的这一篇文章,主要内容是探索BERT的每一层都编码了什么知识信息。

Phrasal Syntax

image.png

image

Probing Tasks

为了进一步了解BERT每一层捕获的不同类型语言知识,作者们设计了三大类总共十个句子级别的探测任务:Surface(表层),Syntactic(句法)和Semantic(语义),可以发现比较浅层的信息对Surface任务会更有效,中层bert的信息对Syntactic任务比较有效,而高层的信息对semantic任务比较有效。另外,作者发现未经训练的BERT高层网络在预测句子长度任务上效果比经过训练的BERT要好,这说明untrained BERT获取更多的是表层特征知识,而训练之后的BERT获取了更多复杂知识,这是以牺牲表层信息为代价的。

image

Subject-Verb Agreement

这是一个属于句法级别的任务,主要是根据名词单复数来预测谓语形式是否和主语一致,一般而言主语和谓语之间的名词个数越多,该任务就越难。从下表的试验结果可以看出(每一列是主谓之间的名词个数,每一行是网络层):

  • 中间层的编码信息效果最强,这也和上一部分中探测任务结果一致,也和本文章中介绍的上一篇论文结果一致;

  • 网络层数越深,越能解决长程依赖问题。

image

Compositional Structure

The author uses Tensor Product Decomposition Network (TPDN) to explore the combined structure information learned by the BERT model, and finds that the input tree structure can be learned through the attention mechanism, and the dependency tree can be reconstructed by using the weight of the attention head

image

Code here


For a model, applying it in accordance with the steps of the official README is only the most basic step. It also needs to be analyzed in depth. On the one hand, it helps us understand why it can work successfully, and on the other hand, it can also let us understand it. Limitations and then research more and more effective methods. Enjoy~

Above ~
2019.11.16




This article is authored by the author to publish the original AINLP on the official account platform, click on the "read the original" link to the original text, welcome to contribute, AI, NLP can be.




About AINLP


AINLP is an interesting natural language processing community with AI, focusing on the sharing of AI, NLP, machine learning, deep learning, recommendation algorithms and other related technologies. Topics include text summarization, intelligent question answering, chat robots, machine translation, automatic generation, and knowledge Graphs, pre-training models, recommendation systems, computational advertisements, recruitment information, job search experience sharing, etc. Welcome to follow! To add technical exchange group, please add AINLP Jun WeChat (id: AINLP2), note work/research direction + add group purpose.


image


Guess you like

Origin blog.51cto.com/15060464/2675593