[Knowledge Graph] Knowledge Graph Entity Linking Unsupervised Learning Framework

Leifeng.com AI Technology Review Press: Alibaba has 11 papers accepted by AAAI 2018, from the Machine Intelligence Technology Laboratory, Business Platform Division, Alimama Division, Artificial Intelligence Laboratory, and Cloud Retail Division, of which 5 are The author was invited to give a presentation in the form of Oral&Spotlight at the main conference, and another author presented two papers in the form of a poster in the main conference. The content of the paper involves adversarial learning, neural networks, training frameworks to improve the performance of lightweight networks, machine translation, chatbots, unsupervised learning frameworks, extreme low-bit neural networks and other technical directions.

The following is the interpretation and submission of collaborative papers between Alibaba AI Lab, Microsoft Research, and the University of Illinois at Urbana-Champaign.

640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1

Main authors (Chinese and English): Zhong Zexuan, Zexuan Zhong, Cao Yong, Yong Cao, Guo Mu, Mu Guo, Nie Zaiqing, Zaiqing Nie

Paper download address: https://102.alibaba.com/downloadFile.do?file=1518508273059/CoLink%20An%20Unsupervised%20Framework%20for%20User%20Identity%20Linkage.pdf

Summary

Linking together the same entity information on several sub-knowledge graphs (also known as the User Identity Linking (UIL) problem) is critical for many applications. There are two main difficulties with the entity linking problem.

First, it is very expensive to collect human-linked entity-information pairs (user pairs) as training data.

Second, entity attributes of different sub-knowledge graphs usually have very different definitions and formats, which makes attribute alignment very difficult.

In this paper we propose CoLink, a general unsupervised framework for entity information linking problems. CoLink uses a co-training algorithm that operates on two independent models (attribute-based and relation-based) simultaneously, and iteratively enables the two models to mutually augment each other in an unsupervised learning fashion. We also propose that using "sequence-to-sequence" learning as an attribute-based model is very effective, which can treat the attribute alignment problem as a machine translation problem. We applied CoLink to the entity information linking task of mapping employees in a corporate network to their LinkedIn profiles. Experimental results show that CoLink outperforms the previous state-of-the-art unsupervised methods by more than 20% on the F1 score.

introduction

Linking information on the same entity on different sub-knowledge graphs (also known as the User Identity Linking (UIL) problem) often leads to better and deeper understanding of the entity, which often leads to better business intelligent.

Although machine learning algorithms have been widely used in entity linking problems, the labeling of training data is not simple. First, finding linked entity information pairs is extremely time-consuming, as it requires searching all sub-knowledge graphs and carefully evaluating a large number of candidate pairs. In addition, this work requires human annotators to have extensive domain knowledge. Second, due to privacy protection reasons, not all knowledge graph entity data can be made available to human annotators, especially when these data come from personal social networks or intra-enterprise networks.

Linking entities between two sub-knowledge graphs requires careful comparison of entity attributes in the two sub-graphs, such as name, job title, location, etc. Therefore, the alignment of attribute values is critical to the entity linking problem. However, traditional string similarity functions have two shortcomings:

There is no one-size-fits-all way to handle the variation of the same attribute across different entity networks
Unable to find implicit attribute correspondence

In this paper, we propose CoLink, a general-purpose unsupervised framework for entity linking problems. The entity data in the knowledge graph can be naturally divided into features from two independent perspectives: attributes and relationships, which perfectly fits the requirements of co-training algorithms.

CoLink uses two separate models: an attribute-based model and a relationship-based model. Both attribute-based models and relation-based models are binary classifiers that determine whether two entities can be linked. They can be based on any machine learning or heuristic algorithm. Therefore, as long as the knowledge graph data contains attributes and relationships, CoLink can be applied to the entity linking problem of the knowledge graph .

Going a step further, we use a "sequence-to-sequence" learning algorithm in the implementation of CoLink's attribute-based model , which provides a general approach for attribute alignment between different entity networks. Instead of treating attribute alignment as a string similarity comparison, we try to "translate" attribute values from one "language" (a particular style of network) into another "language". Abbreviations, abbreviations, synonyms and even implicit correspondences can be considered special cases of translation. The reason we chose the "sequence-to-sequence" algorithm is that it has shown effectiveness on machine translation tasks. Specifically, the "sequence-to-sequence" approach has two advantages that can be used for CoLink. First, it can automatically get word-level mappings and sequence-level mappings with almost no need to manually extract features. Second, it only requires positive examples (aligned attribute pairs) as training data, which relieves the work of sampling negative examples.

We applied CoLink to the task of linking the same users of a social network, where we tried to link employees in a corporate network with their LinkedIn profiles. We further compare CoLink with previous state-of-the-art unsupervised methods. Experimental results show that CoLink can outperform the previous state-of-the-art unsupervised methods by 20% overall on F1 score. Our contributions are summarized as follows:

We first applied the collaborative training algorithm to the problem of entity linking in knowledge graphs. Since entity attributes and entity relations in entity networks are naturally separated, this makes co-training a perfect and cost-free solution.
We first modeled the attribute alignment problem as machine translation. We use a sequence-to-sequence approach as the basis for an attribute-based model, which generalizes well with little to no feature extraction.
We conduct extensive experiments comparing our proposed method with previous state-of-the-art unsupervised methods, enumerating different settings and models, and the results demonstrate the effectiveness of our proposed solution.

CoLink

problem definition

The entity linking problem on knowledge graph is defined as: its input includes a source knowledge graph and a target knowledge graph. The output is a set of entity link pairs representing entity pairs linked from the source graph to the target graph.

CoLink framework

The CoLink framework is based on the co-training algorithm shown in Algorithm 1. We define two different models in this framework: an attribute-based model fatt and a relation-based model frel. Both models make binary classification predictions, classifying a given set of entity pairs as positive (linked) or negative (unlinked). The co-training algorithm continuously enhances the two models in an iterative manner. During each co-training iteration, both models are retrained using the linked paired set S. The high-quality linked pairs generated using these two models are then merged into S for the next iteration until S converges. At the very beginning, an initial set of linked pairs (referred to as the seed set) is needed to start the co-training process, which can be generated by a set of seed rules. Depending on the algorithm used by the model, training of attribute-based and relation-based models may require negative examples. The process of sampling negative examples is not given in Algorithm 1.

640?wx_fmt=jpeg

Algorithm 1: Co-training algorithm in CoLink

This co-training algorithm does not modify linked pairings generated in previous iterations. So errors introduced by previous iterations are not fixed later. An alternative to this algorithm is to make a final modification after co-training has converged. That is, S is reconstructed using the final model obtained by this collaborative learning process.

Seed Rules

The initiation of the co-training algorithm requires a small seed set of linked entity pairs. A simple and straightforward way to obtain a seed set is to generate it according to human-designed rules, which we call seed rules. These seed rules can take into account the following facts from the target knowledge graph:

Entity Name Uniqueness
Entity attribute value mapping
Entity Relationship Propagation

The selection of seed rules will directly affect the performance of CoLink.

attribute-based model

Attribute-based models predict linked entity pairs by considering the attributes of the entities. It can use any classification algorithm. In this paper, we tried two different machine learning algorithms: "sequence-to-sequence" and support vector machines (SVM).

sequence to sequence

Due to the different variations of attributes, traditional string similarity methods perform poorly in handling attribute alignment. Since attribute alignment is similar to a machine translation problem, we adopt a "sequence-to-sequence" approach. Abbreviations, abbreviations, synonyms and even implicit links can be considered special cases of translation.

We adopt the "sequence-to-sequence" network structure proposed by Sutskever, Vinyals, and Le (2014). The network consists of two parts: a sequence encoder and a sequence decoder. Both the encoder and decoder use a deep long short-term memory (LSTM) architecture. The encoder deep LSTM reads the input sequence and finds a representation vector for each word position. These vectors are then fed into an attention layer, resulting in an overall representation of the input sequence that takes into account the positions of the output words. The hidden state of the decoder deep LSTM is then further fed into a fully connected layer (whose output contains dimensional information of the vocabulary size) to predict the output word.

We train a sequence-to-sequence network using linked attribute-value pairs, following previous work. However, instead of using the network to predict the output sequence, we use the learned "sequence-to-sequence" network in CoLink for binary classification. First, we use the network to find the probability of matching a pair of attributes. We then choose a matching probability threshold above which entity pairs are considered linked.

Support Vector Machines

Traditional classification algorithms such as SVM can also be used in attribute-based models. Unlike "sequence-to-sequence" methods that only require positive training samples (linked pairs), SVMs also require negative examples. Because the user pairing space is very large, the positive examples are actually very sparse in the whole space. In each joint training iteration, given the linked pairings, we also select an equal number of random entity pairs as negatives.

relationship-based model

Relationship-based models only use entity relationships to predict linked entity pairs. The problem of finding equivalent nodes in two networks based only on the relationship is often referred to as the network alignment problem.

Relation-based models can use any relation-based network alignment model. Because the focus of this paper is on co-training algorithms and "sequence-to-sequence" attribute-based models, we use in this paper a simple heuristic model based on the assumption that if two entities from different networks both have a large number of linked entities that are related to each other, then the two entities are likely to be linked as well.

experiment

我们的实验比较了 CoLink 与当前最佳的无监督方法。我们还研究了种子规则和链接概率阈值的选择，以更好地理解它们对链接结果的可能影响方式。

数据集

我们选择了一个真实数据集来评估 CoLink，它包含两个社交网络。其中一个社交网络是领英，另一个网络是一个企业内部用户网络。

640?wx_fmt=jpeg

表 1：数据集总体情况

候选实体对的选择

我们构建了一个候选实体对过滤器，它能移除大量不可能链接的实体对。该候选项过滤器考虑了以下属性。

实体名
组织机构

在过滤之后，我们得到了 758046 个候选实体对，其涵盖了测试集合中所有有链接的配对。

序列到序列

我们实验中的「序列到序列」网络由一个带注意网络的深度 LSTM 编码器和一个深度 LSTM 解码器构成。编码器深度 LSTM 和解码器深度 LSTM 都有 2 个层叠的 LSTM，因为我们发现对于实体链接任务而言，超过 2 层的编码器或解码器不能再带来更多提升。每个 LSTM 的循环单元大小为 512。每个词在被送入编码器和解码器之前都首先会被转换成一个 512 维的嵌入向量。「序列到序列」模型的训练时间取决于训练数据的规模。平均而言，使用一个 Tesla K40 GPU，让模型在 10 万个属性配对上完成训练需要 30 分钟。

种子规则

为了测试 CoLink 的稳健性，我们尝试了下列 3 个种子规则集：

粗略调整的集合
精细调整的集合
有噪声集合

640?wx_fmt=jpeg

图 1：种子集比较；协同训练迭代开始后的 P/R/F1 趋势

协同训练

我们通过将关系特征和属性特征分开而使用了协同训练。基于属性的模型和基于关系的模型都能在每次迭代中找到新配对然后增强彼此。图 2 给出了每个模型所得到的已链接配对的统计情况。在这项任务中，基于属性的模型生成的配对比基于关系的模型多，这是因为我们没有完整的领英关系数据。我们爬取了公开的领英个人资料中的「人们还看了」列表，这只能为每位用户提供不到 10 个关系。

640?wx_fmt=jpeg

图 2：基于粗略调整的种子配对使用联合训练迭代得到的已链接配对的增长情况

概率阈值

图 3 给出了不同阈值的比较情况。使用更严格的阈值（更小的百分数）会得到更高的精度和相对更低的召回率。我们在本任务中选择的阈值是 95%。

640?wx_fmt=jpeg

图 3：序列到序列链接概率阈值比较

比较结果

640?wx_fmt=jpeg

表 2：不同方法的表现的比较

属性对齐

通过使用「序列到序列」方法，CoLink 可以处理使用传统字符串相似度函数难以应付的属性对齐问题。表 3 给出了一些选择出的应该是对齐的属性示例以及来自不同方法的相似度分数（全都位于 [0,1] 区间中）。在「序列到序列」的帮助下，几乎无需提取特征，就可以轻松地将这种方法应用于其它实体匹配任务。

640?wx_fmt=jpeg

表 3：选择出的一些属性示例以及它们的相似度分数

640?wx_fmt=png

人工智能赛博物理操作系统

AI-CPS OS

“人工智能赛博物理操作系统”（新一代技术+商业操作系统“AI-CPS OS”：云计算+大数据+物联网+区块链+人工智能）分支用来的今天，企业领导者必须了解如何将“技术”全面渗入整个公司、产品等“商业”场景中，利用AI-CPS OS形成数字化+智能化力量，实现行业的重新布局、企业的重新构建和自我的焕然新生。

AI-CPS OS的真正价值并不来自构成技术或功能，而是要以一种传递独特竞争优势的方式将自动化+信息化、智造+产品+服务和数据+分析一体化，这种整合方式能够释放新的业务和运营模式。如果不能实现跨功能的更大规模融合，没有颠覆现状的意愿，这些将不可能实现。

领导者无法依靠某种单一战略方法来应对多维度的数字化变革。面对新一代技术+商业操作系统AI-CPS OS颠覆性的数字化+智能化力量，领导者必须在行业、企业与个人这三个层面都保持领先地位：

重新行业布局：你的世界观要怎样改变才算足够？你必须对行业典范进行怎样的反思？
重新构建企业：你的企业需要做出什么样的变化？你准备如何重新定义你的公司？
重新打造自己：你需要成为怎样的人？要重塑自己并在数字化+智能化时代保有领先地位，你必须如何去做？

AI-CPS OS是数字化智能化创新平台，设计思路是将大数据、物联网、区块链和人工智能等无缝整合在云端，可以帮助企业将创新成果融入自身业务体系，实现各个前沿技术在云端的优势协同。AI-CPS OS形成的数字化+智能化力量与行业、企业及个人三个层面的交叉，形成了领导力模式，使数字化融入到领导者所在企业与领导方式的核心位置：

精细：这种力量能够使人在更加真实、细致的层面观察与感知现实世界和数字化世界正在发生的一切，进而理解和更加精细地进行产品个性化控制、微观业务场景事件和结果控制。
智能：模型随着时间（数据）的变化而变化，整个系统就具备了智能（自学习）的能力。
高效：企业需要建立实时或者准实时的数据采集传输、模型预测和响应决策能力，这样智能就从批量性、阶段性的行为变成一个可以实时触达的行为。
不确定性：数字化变更颠覆和改变了领导者曾经仰仗的思维方式、结构和实践经验，其结果就是形成了复合不确定性这种颠覆性力量。主要的不确定性蕴含于三个领域：技术、文化、制度。
边界模糊：数字世界与现实世界的不断融合成CPS不仅让人们所知行业的核心产品、经济学定理和可能性都产生了变化，还模糊了不同行业间的界限。这种效应正在向生态系统、企业、客户、产品快速蔓延。

AI-CPS OS形成的数字化+智能化力量通过三个方式激发经济增长：

创造虚拟劳动力，承担需要适应性和敏捷性的复杂任务，即“智能自动化”，以区别于传统的自动化解决方案；
对现有劳动力和实物资产进行有利的补充和提升，提高资本效率；
人工智能的普及，将推动多行业的相关创新，开辟崭新的经济增长空间。

给决策制定者和商业领袖的建议：

超越自动化，开启新创新模式：利用具有自主学习和自我控制能力的动态机器智能，为企业创造新商机；
迎接新一代信息技术，迎接人工智能：无缝整合人类智慧与机器智能，重新
评估未来的知识和技能类型；
制定道德规范：切实为人工智能生态系统制定道德准则，并在智能机器的开
发过程中确定更加明晰的标准和最佳实践；
重视再分配效应：对人工智能可能带来的冲击做好准备，制定战略帮助面临
较高失业风险的人群；
开发数字化+智能化企业所需新能力：员工团队需要积极掌握判断、沟通及想象力和创造力等人类所特有的重要能力。对于中国企业来说，创造兼具包容性和多样性的文化也非常重要。

子曰：“君子和而不同，小人同而不和。” 《论语·子路》云计算、大数据、物联网、区块链和人工智能，像君子一般融合，一起体现科技就是生产力。

如果说上一次哥伦布地理大发现，拓展的是人类的物理空间。那么这一次地理大发现，拓展的就是人们的数字空间。在数学空间，建立新的商业文明，从而发现新的创富模式，为人类社会带来新的财富空间。云计算，大数据、物联网和区块链，是进入这个数字空间的船，而人工智能就是那船上的帆，哥伦布之帆！

新一代技术+商业的人工智能赛博物理操作系统AI-CPS OS作为新一轮产业变革的核心驱动力，将进一步释放历次科技革命和产业变革积蓄的巨大能量，并创造新的强大引擎。重构生产、分配、交换、消费等经济活动各环节，形成从宏观到微观各领域的智能化新需求，催生新技术、新产品、新产业、新业态、新模式。引发经济结构重大变革，深刻改变人类生产生活方式和思维模式，实现社会生产力的整体跃升。

产业智能官 AI-CPS

用“人工智能赛博物理操作系统”（新一代技术+商业操作系统“AI-CPS OS”：云计算+大数据+物联网+区块链+人工智能），在场景中构建状态感知-实时分析-自主决策-精准执行-学习提升的认知计算和机器智能；实现产业转型升级、DT驱动业务、价值创新创造的产业互联生态链。

640?wx_fmt=png

长按上方二维码关注微信公众号： AI-CPS，更多信息回复：

新技术：“云计算”、“大数据”、“物联网”、“区块链”、“人工智能”；新产业：“智能制造”、“智能金融”、“智能零售”、“智能驾驶”、“智能城市”；新模式：“财富空间”、“工业互联网”、“数据科学家”、“赛博物理系统CPS”、“供应链金融”。

官方网站：AI-CPS.NET

本文系“产业智能官”（公众号ID：AI-CPS）收集整理，转载请注明出处！