实体链接(Entity Linking)

在自然语言处理中，实体链接，命名实体链接（NEL），命名实体消歧（NED），命名实体识别和消歧（NERD）或命名实体规范化（NEN），都是是确定实体(Entity)的Identity的任务。例如，对于句子“巴黎是法国的首都”，Entity Linking的想法是确定句中“巴黎”指的是巴黎市，而不是巴黎希尔顿或任何其他可称为“巴黎”的实体。再例如，对于句子”James Bond is cool”，我们期望获得“James_Bond”这整个经过链接后的名字。

Dexter2

Dexter是一个Entity Linking的开源框架，其利用维基百科（英文）中的词条来实现实体链接。

下载

dexter on github
这里有编译好的二进制文件和source code，本文直接上编译好的bin file
windows的话在解压后的当前目录:

java -Xmx4000m -jar dexter-2.1.0.jar

或者在linux上

wget http://hpc.isti.cnr.it/~ceccarelli/dexter2.tar.gz
tar -xvzf dexter2.tar.gz
cd dexter2
java -Xmx4000m -jar dexter-2.1.0.jar

于是本地端口8080开启，如果是windows或者有可视化的linux上直接打开浏览器输入http://localhost:8080/dexter-webapp/dev/ 即可查看api。如果dexter是在服务器上的话那么直接用Python request利用url获取结果（见后文）。

使用

所有使用api可以参考本地或者官网。都有可执行的例子。本文举例说明。

1. annotate, spot

annotate
Performs the entity linking on a given text, annotating maximum n entities.
spot
It only performs the first step of the entity linking process, i.e., find all the mentions that could refer to an entity

两者都是对一句query中的词进行entity linking。区别是annotate会找出最相关的前n个linking。按需使用。

例如，查找

Bob Dylan and Johnny Cash had formed a mutual admiration society even before they met in the early 1960s

中的linked entity

当然，可以直接输入网址进行demo查看。linking的confidence设置为0.5：

http://localhost:8080/dexter-webapp/api/rest/annotate?text=Bob%20Dylan%20and%20Johnny%20Cash%20had%20formed%20a%20mutual%20admiration%20society%20even%20before%20they%20met%20in%20the%20early%201960s&n=50&wn=false&debug=false&format=text&min-conf=0.5

可以得到Annotate的结果：

"value": "<a href=\"#\" onmouseover='manage(4637590)' >Bob Dylan</a> and <a href=\"#\" onmouseover='manage(11983070)' >Johnny Cash</a> had formed a mutual admiration society even before they met in the early 1960s"

其中annotate也可以给出spot的结果：

"spots": [
    {
      "mention": "johnny cash",
      "linkProbability": 1,
      "start": 14,
      "end": 25,
      "linkFrequency": 2558,
      "documentFrequency": 1932,
      "entity": 11983070,
      "field": "body",
      "entityFrequency": 2540,
      "commonness": 0.9929632525410477,
      "score": 0.9929632525410477
    },
    {
      "mention": "bob dylan",
      "linkProbability": 1,
      "start": 0,
      "end": 9,
      "linkFrequency": 5588,
      "documentFrequency": 4275,
      "entity": 4637590,
      "field": "body",
      "entityFrequency": 5547,
      "commonness": 0.9926628489620616,
      "score": 0.9926628489620616
    }
  ]

可以看出这个程序给我们link出了bob dylan和johnny cash两个置信度高于0.5的linked entity，并给出了两个entity的id。我们可以用这些id去做些其他的操作，具体在后文讲解。

如果运行spot api：

http://localhost:8080/dexter-webapp/api/rest/spot?text=Bob%20Dylan%20and%20Johnny%20Cash%20had%20formed%20a%20mutual%20admiration%20society%20even%20before%20they%20met%20in%20the%20early%201960s&wn=false&debug=false&format=text

可以得到结果：

"spots": [
    {
      "mention": "mutual admiration society",
      "linkProbability": 1,
      "field": "body",
      "start": 39,
      "end": 64,
      "linkFrequency": 33,
      "documentFrequency": 31,
      "candidates": [
        {
          "entity": 2319591,
          "freq": 13,
          "commonness": 0.3939393939393939
        },
        {
          "entity": 2648616,
          "freq": 9,
          "commonness": 0.2727272727272727
        },
        {
          "entity": 2319544,
          "freq": 6,
          "commonness": 0.18181818181818182
        },
        {
          "entity": 3001631,
          "freq": 4,
          "commonness": 0.12121212121212122
        },
        {
          "entity": 32742,
          "freq": 1,
          "commonness": 0.030303030303030304
        }
      ]
    },
    {
      "mention": "johnny cash",
      "linkProbability": 1,
      "field": "body",
      "start": 14,
      "end": 25,
      "linkFrequency": 2558,
      "documentFrequency": 1932,
      "candidates": [
        {
          "entity": 11983070,
          "freq": 2540,
          "commonness": 0.9929632525410477
        },
        {
          "entity": 12326526,
          "freq": 14,
          "commonness": 0.00547302580140735
        }
      ]
    },
    {
      "mention": "bob dylan",
      "linkProbability": 1,
      "field": "body",
      "start": 0,
      "end": 9,
      "linkFrequency": 5588,
      "documentFrequency": 4275,
      "candidates": [
        {
          "entity": 4637590,
          "freq": 5547,
          "commonness": 0.9926628489620616
        },
        {
          "entity": 438899,
          "freq": 35,
          "commonness": 0.006263421617752327
        }
      ]
    }
  ],
  "nSpots": 3,
  "querytime": 264

我们发现事实上dexter不仅找到了bob dylan和johnny cash，它还找到了mutual admiration society。但mutual admiration society有很多词条含义，比如Mutual_Admiration_Society_(song)，Mutual_Admiration_Society_(album)，Mutual_Admiration_Society_(collaboration)，Mutual_Admiration_Society_–Joe_Locke&_David_Hazeltine_Quartet。
但事实上我们一看就应该知道这个Multual admiration society应该是首歌或者专辑，这说明dexter的算法应该是context-free的，和上下文无关。所以dexter其实只提供了linking的接口，如果需要解决多义性则还需其他工具。

2. get-id

输入实体获取id（在wiki中编好的号码）

http://localhost:8080/dexter-webapp/api/rest/get-id?title=johnny%20cash

http://localhost:8080/dexter-webapp/api/rest/get-id?title=johnny_cash

二者得到的结果都是：

{
  "title": "Johnny_cash",
  "url": "",
  "id": 11983070
}

3. get-desc

输入id获取description。可以理解为输入id获取entity

http://localhost:8080/dexter-webapp/api/rest/get-desc?id=11983070&title-only=true

记得把title-only参数改成true，不然无法输出实体。

4. 用Python批量处理

开启8080端口后，可以使用urllib和json来批量处理信息。举个例子

import urllib
from urllib import request
from urllib import parse
import json


def GetAnnotateUrl(query, n = 5, conf = 0.5):
  url = 'http://localhost:8080/dexter-webapp/api/rest/annotate?text='
  query = query.replace(' ', '%20')
  url += query
  url += ('&n=' + str(n))
  url += ('&min-conf=' + str(conf))
  url += '&wn=false&debug=false&format=text'
  return url

def GetId2EntityUrl(id):
  url = 'http://localhost:8080/dexter-webapp/api/rest/get-desc?title-only=true&id='
  url += str(id)
  return url

def GetRequest(url):
  req = request.Request(url)
  data = request.urlopen(req).read().decode('utf-8')
  Json = json.loads(data)
  return Json

def GetEntitiesByQuery(query, n = 5, conf = 0.5 ):
  url = GetAnnotateUrl(query, n, conf)
  AnnoData = GetRequest(url)
  # AnnoData = json.dumps(AnnoData, indent = 4, separators = (',', ':'))
  # print(AnnoData) # Use the above dumps command to print structured json
  Spots = AnnoData['spots']
  Entities = {}
  for session in Spots:
    url = GetId2EntityUrl(session["entity"])
    Entities[session["entity"]] = GetRequest(url)["title"]
  return Entities

Entities = GetEntitiesByQuery('bob dylan and johnny cash')
print(Entities)

output:
{4637590: 'Bob_Dylan', 11983070: 'Johnny_Cash'}

可以批量处理一波。这东西也不像TX/ALI云里面的开源项目还要限制流量的，可以无限用。

Enjoy!