This article will briefly introduce named entity recognition (NER) in natural language processing (NLP).
One, what is named entity recognition
1. Introduction to named entity recognition
Named Entity Recognition (NER) is an important basic tool in application fields such as information extraction, question answering systems, syntax analysis, and machine translation. It occupies an important position in the practical process of natural language processing technology. Generally speaking, the task of named entity recognition is to identify three categories (entity category, time category and number category) and seven categories (person name, organization name, place name, time, date, currency and percentage) in the text to be processed. entity.
2. Examples of named entity recognition
For a simple example, in the sentence "Xiao Ming goes to school at 8 o'clock in the morning.", the named entity recognition should be able to extract information:
Name: Xiao Ming, Time: 8 AM, Location: School.
2. Classification of named entity recognition in NLTK and Stanford NLP
The classification of named entity recognition in NLTK and Stanford NLP is as follows:
In the above figure, LOCATION and GPE overlap. GPE usually means geo-political items, such as city, state, country, continent, etc. In addition to the above content, LOCATION can also represent famous mountains and rivers. FACILITY usually means well-known monuments or artifacts, etc.
Third, use NLTK to implement NER tasks
1. Sample document
Our sample document (introducing FIFA, from Wikipedia) is as follows:
FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich,
its membership now comprises 211 national associations. Member countries must each also be members of one of the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America and the Caribbean, Oceania, and South America.
2. Python implementation code
The Python code to implement NER is as follows:
import re
import pandas as pd
import nltk
def parse_document(document):
document = re.sub('\n', ' ', document)
if isinstance(document, str):
document = document
else:
raise ValueError('Document is not string!')
document = document.strip()
sentences = nltk.sent_tokenize(document)
sentences = [sentence.strip() for sentence in sentences]
return sentences
# sample document
text = """
FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its
membership now comprises 211 national associations. Member countries must each also be members of one of
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America
and the Caribbean, Oceania, and South America.
"""
# tokenize sentences
sentences = parse_document(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
# tag sentences and use nltk's Named Entity Chunker
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
ne_chunked_sents = [nltk.ne_chunk(tagged) for tagged in tagged_sentences]
# extract all named entities
named_entities = []
for ne_tagged_sentence in ne_chunked_sents:
for tagged_tree in ne_tagged_sentence:
# extract only chunks having NE labels
if hasattr(tagged_tree, 'label'):
entity_name = ' '.join(c[0] for c in tagged_tree.leaves()) #get NE name
entity_type = tagged_tree.label() # get NE category
named_entities.append((entity_name, entity_type))
# get unique named entities
named_entities = list(set(named_entities))
# store named entities in a data frame
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
# display results
print(entity_frame)
# ------output---------------
Entity Name Entity Type
0 Caribbean LOCATION
1 North GPE
2 Switzerland GPE
3 Asia GPE
4 Denmark GPE
5 Africa PERSON
6 Zürich GPE
7 Europe GPE
8 Central America ORGANIZATION
9 South America GPE
10 Spain GPE
11 Netherlands GPE
12 FIFA ORGANIZATION
13 Oceania GPE
14 Sweden GPE
15 Germany GPE
16 Belgium GPE
17 France GPE
It can be seen that the NER task in NLTK is generally well completed. It can identify FIFA as an organization (ORGANIZATION), Belgium and Asia as GPE,
but there are also some unsatisfactory places, for example, it recognizes Central America as ORGANIZATION, but in fact it should be GPE; to recognize Africa as PERSON, it should actually be GPE.
Fourth, use Stanford NLP to implement NER tasks
1. Environmental installation before using Stanford NLP tool
1) Install Java
Before using the Stanford NLP tool, you need to install Java (usually JDK) on your computer and add Java to the system path.
The method of downloading java8 in the ubuntu system is as follows: https://blog.csdn.net/mucaoyx/article/details/82949450 The
path where Javal is located is as follows:/opt/java/jdk1.8.0_261/bin
2) Download Stanford NER
Download the English NER file package: stanford-ner-2018-10-16.zip (172MB in size), the download address is: https://nlp.stanford.edu/software/CRF-NER.shtml
download the zip of Stanford NER The path of the folder after the file is decompressed is:, /home/nijiahui/anaconda3/envs/nlp/lib/python3.6/site-packages/stanford-ner-4.0.0
as shown in the figure below:
There are the following files in the classifer folder:
their meaning is as follows:
3 class: Location, Person, Organization
4 class: Location, Person, Organization, Misc
7 class: Location, Person, Organization, Money, Percent, Date, Time
2. Python implements Stanford NER code
Use Python to implement Stanford NER, the complete code is as follows:
import re
from nltk.tag import StanfordNERTagger
import os
import pandas as pd
import nltk
def parse_document(document):
document = re.sub('\n', ' ', document)
if isinstance(document, str):
document = document
else:
raise ValueError('Document is not string!')
document = document.strip()
sentences = nltk.sent_tokenize(document)
sentences = [sentence.strip() for sentence in sentences]
return sentences
# sample document
text = """
FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its
membership now comprises 211 national associations. Member countries must each also be members of one of
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America
and the Caribbean, Oceania, and South America.
"""
sentences = parse_document(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
# set java path in environment variables
java_path = r'C:\Program Files\Java\jdk1.8.0_161\bin\java.exe'
os.environ['JAVAHOME'] = java_path
# load stanford NER
sn = StanfordNERTagger('E://stanford-ner-2018-10-16/classifiers/english.muc.7class.distsim.crf.ser.gz',
path_to_jar='E://stanford-ner-2018-10-16/stanford-ner.jar')
# tag sentences
ne_annotated_sentences = [sn.tag(sent) for sent in tokenized_sentences]
# extract named entities
named_entities = []
for sentence in ne_annotated_sentences:
temp_entity_name = ''
temp_named_entity = None
for term, tag in sentence:
# get terms with NE tags
if tag != 'O':
temp_entity_name = ' '.join([temp_entity_name, term]).strip() #get NE name
temp_named_entity = (temp_entity_name, tag) # get NE and its category
else:
if temp_named_entity:
named_entities.append(temp_named_entity)
temp_entity_name = ''
temp_named_entity = None
# get unique named entities
named_entities = list(set(named_entities))
# store named entities in a data frame
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
# display results
print(entity_frame)
# ----------output----------
Entity Name Entity Type
0 Belgium LOCATION
1 North & Central America ORGANIZATION
2 Caribbean LOCATION
3 France LOCATION
4 1904 DATE
5 South America LOCATION
6 Asia LOCATION
7 Denmark LOCATION
8 Zürich LOCATION
9 FIFA ORGANIZATION
10 Sweden LOCATION
11 Spain LOCATION
12 Europe LOCATION
13 Oceania LOCATION
14 Africa LOCATION
15 the Netherlands LOCATION
16 Switzerland LOCATION
17 Germany LOCATION
It can be seen that with the help of Stanford NER, the implementation effect of NER is better. Africa is recognized as LOCATION and 1904 is recognized as time (this is not recognized in NLTK), but the recognition of North & Central America is still wrong. Identify it as ORGANIZATION.
Five, summary
It is worth noting that Stanford NER is not necessarily better than NLTK NER. The objects, corpus, and algorithms for the two may be different. Therefore, you need to decide which tools to use according to your needs.