NLP（五）命名实体识别（NER）

本文将会简单介绍自然语言处理（NLP）中的命名实体识别（NER）。

一，什么是命名实体识别

1，命名实体识别简介

命名实体识别（Named Entity Recognition，简称NER）是信息提取、问答系统、句法分析、机器翻译等应用领域的重要基础工具，在自然语言处理技术走向实用化的过程中占有重要地位。一般来说，命名实体识别的任务就是识别出待处理文本中三大类（实体类、时间类和数字类）、七小类（人名、机构名、地名、时间、日期、货币和百分比）命名实体。

2，命名实体识别举例说明

举个简单的例子，在句子“小明早上8点去学校上课。”中，对其进行命名实体识别，应该能提取信息：

人名：小明，时间：早上8点，地点：学校。

二，NLTK、Stanford NLP中对命名实体识别的分类

NLTK和Stanford NLP中对命名实体识别的分类，如下图：
在这里插入图片描述

在上图中，LOCATION和GPE有重合。GPE通常表示地理—政治条目，比如城市，州，国家，洲等。LOCATION除了上述内容外，还能表示名山大川等。FACILITY通常表示知名的纪念碑或人工制品等。

三，使用NLTK实现NER任务

1，示例文档

我们的示例文档（介绍FIFA，来源于维基百科）如下：

FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich,
its membership now comprises 211 national associations. Member countries must each also be members of one of the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America and the Caribbean, Oceania, and South America.

2，python实现代码

实现NER的Python代码如下：

import re
import pandas as pd
import nltk

def parse_document(document):
   document = re.sub('\n', ' ', document)
   if isinstance(document, str):
       document = document
   else:
       raise ValueError('Document is not string!')
   document = document.strip()
   sentences = nltk.sent_tokenize(document)
   sentences = [sentence.strip() for sentence in sentences]
   return sentences

# sample document
text = """
FIFA was founded in 1904 to oversee international competition among the national associations of Belgium, 
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its 
membership now comprises 211 national associations. Member countries must each also be members of one of 
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America 
and the Caribbean, Oceania, and South America.
"""

# tokenize sentences
sentences = parse_document(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
# tag sentences and use nltk's Named Entity Chunker
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
ne_chunked_sents = [nltk.ne_chunk(tagged) for tagged in tagged_sentences]
# extract all named entities
named_entities = []
for ne_tagged_sentence in ne_chunked_sents:
   for tagged_tree in ne_tagged_sentence:
       # extract only chunks having NE labels
       if hasattr(tagged_tree, 'label'):
           entity_name = ' '.join(c[0] for c in tagged_tree.leaves()) #get NE name
           entity_type = tagged_tree.label() # get NE category
           named_entities.append((entity_name, entity_type))
           # get unique named entities
           named_entities = list(set(named_entities))

# store named entities in a data frame
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
# display results
print(entity_frame)

# ------output---------------
        Entity Name   Entity Type
0         Caribbean      LOCATION
1             North           GPE
2       Switzerland           GPE
3              Asia           GPE
4           Denmark           GPE
5            Africa        PERSON
6            Zürich           GPE
7            Europe           GPE
8   Central America  ORGANIZATION
9     South America           GPE
10            Spain           GPE
11      Netherlands           GPE
12             FIFA  ORGANIZATION
13          Oceania           GPE
14           Sweden           GPE
15          Germany           GPE
16          Belgium           GPE
17           France           GPE

可以看到，NLTK中的NER任务大体上完成得还是不错的，能够识别FIFA为组织（ORGANIZATION），Belgium,Asia为GPE,
但是也有一些不太如人意的地方，比如，它将Central America识别为ORGANIZATION，而实际上它应该为GPE；将Africa识别为PERSON，实际上应该为GPE。

四，使用Stanford NLP实现NER任务

1，使用Stanford NLP工具之前的环境安装

1）安装Java

在使用Stanford NLP工具之前，你需要在自己的电脑上安装Java（一般是JDK），并将Java添加到系统路径中。
在ubuntu系统中下载java8的方法如下：https://blog.csdn.net/mucaoyx/article/details/82949450
Javal所在路径如下：/opt/java/jdk1.8.0_261/bin

2）下载Stanford NER

下载英语NER的文件包：stanford-ner-2018-10-16.zip（大小为172MB），下载地址为：https://nlp.stanford.edu/software/CRF-NER.shtml
在这里插入图片描述

下载Stanford NER的zip文件解压后的文件夹的路径为：/home/nijiahui/anaconda3/envs/nlp/lib/python3.6/site-packages/stanford-ner-4.0.0，如下图所示：
在classifer文件夹中有如下文件：

它们代表的含义如下：

3 class: Location, Person, Organization
4 class: Location, Person, Organization, Misc
7 class: Location, Person, Organization, Money, Percent, Date, Time

2，Python实现Stanford NER代码

使用Python实现Stanford NER，完整的代码如下：

import re
from nltk.tag import StanfordNERTagger
import os
import pandas as pd
import nltk

def parse_document(document):
   document = re.sub('\n', ' ', document)
   if isinstance(document, str):
       document = document
   else:
       raise ValueError('Document is not string!')
   document = document.strip()
   sentences = nltk.sent_tokenize(document)
   sentences = [sentence.strip() for sentence in sentences]
   return sentences

# sample document
text = """
FIFA was founded in 1904 to oversee international competition among the national associations of Belgium, 
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its 
membership now comprises 211 national associations. Member countries must each also be members of one of 
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America 
and the Caribbean, Oceania, and South America.
"""

sentences = parse_document(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

# set java path in environment variables
java_path = r'C:\Program Files\Java\jdk1.8.0_161\bin\java.exe'
os.environ['JAVAHOME'] = java_path
# load stanford NER
sn = StanfordNERTagger('E://stanford-ner-2018-10-16/classifiers/english.muc.7class.distsim.crf.ser.gz',
                       path_to_jar='E://stanford-ner-2018-10-16/stanford-ner.jar')

# tag sentences
ne_annotated_sentences = [sn.tag(sent) for sent in tokenized_sentences]
# extract named entities
named_entities = []
for sentence in ne_annotated_sentences:
   temp_entity_name = ''
   temp_named_entity = None
   for term, tag in sentence:
       # get terms with NE tags
       if tag != 'O':
           temp_entity_name = ' '.join([temp_entity_name, term]).strip() #get NE name
           temp_named_entity = (temp_entity_name, tag) # get NE and its category
       else:
           if temp_named_entity:
               named_entities.append(temp_named_entity)
               temp_entity_name = ''
               temp_named_entity = None

# get unique named entities
named_entities = list(set(named_entities))
# store named entities in a data frame
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
# display results
print(entity_frame)

# ----------output----------
                Entity Name   Entity Type
0                   Belgium      LOCATION
1   North & Central America  ORGANIZATION
2                 Caribbean      LOCATION
3                    France      LOCATION
4                      1904          DATE
5             South America      LOCATION
6                      Asia      LOCATION
7                   Denmark      LOCATION
8                    Zürich      LOCATION
9                      FIFA  ORGANIZATION
10                   Sweden      LOCATION
11                    Spain      LOCATION
12                   Europe      LOCATION
13                  Oceania      LOCATION
14                   Africa      LOCATION
15          the Netherlands      LOCATION
16              Switzerland      LOCATION
17                  Germany      LOCATION

可以看到，在Stanford NER的帮助下，NER的实现效果较好，将Africa识别为LOCATION，将1904识别为时间（这在NLTK中没有识别出来），但还是对North & Central America识别有误，将其识别为ORGANIZATION。

五，总结

值得注意的是，并不是说Stanford NER一定会比NLTK NER的效果好，两者针对的对象，语料，算法可能有差异，因此，需要根据自己的需求决定使用什么工具。