NLP (5) Named Entity Recognition (NER)

This article will briefly introduce named entity recognition (NER) in natural language processing (NLP).

One, what is named entity recognition

1. Introduction to named entity recognition

Named Entity Recognition (NER) is an important basic tool in application fields such as information extraction, question answering systems, syntax analysis, and machine translation. It occupies an important position in the practical process of natural language processing technology. Generally speaking, the task of named entity recognition is to identify three categories (entity category, time category and number category) and seven categories (person name, organization name, place name, time, date, currency and percentage) in the text to be processed. entity.

2. Examples of named entity recognition

For a simple example, in the sentence "Xiao Ming goes to school at 8 o'clock in the morning.", the named entity recognition should be able to extract information:

Name: Xiao Ming, Time: 8 AM, Location: School.

2. Classification of named entity recognition in NLTK and Stanford NLP

The classification of named entity recognition in NLTK and Stanford NLP is as follows:
Insert picture description here

In the above figure, LOCATION and GPE overlap. GPE usually means geo-political items, such as city, state, country, continent, etc. In addition to the above content, LOCATION can also represent famous mountains and rivers. FACILITY usually means well-known monuments or artifacts, etc.

Third, use NLTK to implement NER tasks

1. Sample document

Our sample document (introducing FIFA, from Wikipedia) is as follows:

FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich,
its membership now comprises 211 national associations. Member countries must each also be members of one of the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America and the Caribbean, Oceania, and South America.

2. Python implementation code

The Python code to implement NER is as follows:

import re
import pandas as pd
import nltk

def parse_document(document):
   document = re.sub('\n', ' ', document)
   if isinstance(document, str):
       document = document
   else:
       raise ValueError('Document is not string!')
   document = document.strip()
   sentences = nltk.sent_tokenize(document)
   sentences = [sentence.strip() for sentence in sentences]
   return sentences

# sample document
text = """
FIFA was founded in 1904 to oversee international competition among the national associations of Belgium, 
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its 
membership now comprises 211 national associations. Member countries must each also be members of one of 
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America 
and the Caribbean, Oceania, and South America.
"""

# tokenize sentences
sentences = parse_document(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
# tag sentences and use nltk's Named Entity Chunker
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
ne_chunked_sents = [nltk.ne_chunk(tagged) for tagged in tagged_sentences]
# extract all named entities
named_entities = []
for ne_tagged_sentence in ne_chunked_sents:
   for tagged_tree in ne_tagged_sentence:
       # extract only chunks having NE labels
       if hasattr(tagged_tree, 'label'):
           entity_name = ' '.join(c[0] for c in tagged_tree.leaves()) #get NE name
           entity_type = tagged_tree.label() # get NE category
           named_entities.append((entity_name, entity_type))
           # get unique named entities
           named_entities = list(set(named_entities))

# store named entities in a data frame
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
# display results
print(entity_frame)

# ------output---------------
        Entity Name   Entity Type
0         Caribbean      LOCATION
1             North           GPE
2       Switzerland           GPE
3              Asia           GPE
4           Denmark           GPE
5            Africa        PERSON
6            Zürich           GPE
7            Europe           GPE
8   Central America  ORGANIZATION
9     South America           GPE
10            Spain           GPE
11      Netherlands           GPE
12             FIFA  ORGANIZATION
13          Oceania           GPE
14           Sweden           GPE
15          Germany           GPE
16          Belgium           GPE
17           France           GPE

It can be seen that the NER task in NLTK is generally well completed. It can identify FIFA as an organization (ORGANIZATION), Belgium and Asia as GPE,
but there are also some unsatisfactory places, for example, it recognizes Central America as ORGANIZATION, but in fact it should be GPE; to recognize Africa as PERSON, it should actually be GPE.

Fourth, use Stanford NLP to implement NER tasks

1. Environmental installation before using Stanford NLP tool

1) Install Java

Before using the Stanford NLP tool, you need to install Java (usually JDK) on your computer and add Java to the system path.
The method of downloading java8 in the ubuntu system is as follows: https://blog.csdn.net/mucaoyx/article/details/82949450 The
path where Javal is located is as follows:/opt/java/jdk1.8.0_261/bin

2) Download Stanford NER

Download the English NER file package: stanford-ner-2018-10-16.zip (172MB in size), the download address is: https://nlp.stanford.edu/software/CRF-NER.shtml
Insert picture description here

download the zip of Stanford NER The path of the folder after the file is decompressed is:, /home/nijiahui/anaconda3/envs/nlp/lib/python3.6/site-packages/stanford-ner-4.0.0as shown in the figure below:
There are the following files in the classifer folder:

their meaning is as follows:

3 class: Location, Person, Organization
4 class: Location, Person, Organization, Misc
7 class: Location, Person, Organization, Money, Percent, Date, Time

2. Python implements Stanford NER code

Use Python to implement Stanford NER, the complete code is as follows:

import re
from nltk.tag import StanfordNERTagger
import os
import pandas as pd
import nltk

def parse_document(document):
   document = re.sub('\n', ' ', document)
   if isinstance(document, str):
       document = document
   else:
       raise ValueError('Document is not string!')
   document = document.strip()
   sentences = nltk.sent_tokenize(document)
   sentences = [sentence.strip() for sentence in sentences]
   return sentences

# sample document
text = """
FIFA was founded in 1904 to oversee international competition among the national associations of Belgium, 
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its 
membership now comprises 211 national associations. Member countries must each also be members of one of 
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America 
and the Caribbean, Oceania, and South America.
"""

sentences = parse_document(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

# set java path in environment variables
java_path = r'C:\Program Files\Java\jdk1.8.0_161\bin\java.exe'
os.environ['JAVAHOME'] = java_path
# load stanford NER
sn = StanfordNERTagger('E://stanford-ner-2018-10-16/classifiers/english.muc.7class.distsim.crf.ser.gz',
                       path_to_jar='E://stanford-ner-2018-10-16/stanford-ner.jar')

# tag sentences
ne_annotated_sentences = [sn.tag(sent) for sent in tokenized_sentences]
# extract named entities
named_entities = []
for sentence in ne_annotated_sentences:
   temp_entity_name = ''
   temp_named_entity = None
   for term, tag in sentence:
       # get terms with NE tags
       if tag != 'O':
           temp_entity_name = ' '.join([temp_entity_name, term]).strip() #get NE name
           temp_named_entity = (temp_entity_name, tag) # get NE and its category
       else:
           if temp_named_entity:
               named_entities.append(temp_named_entity)
               temp_entity_name = ''
               temp_named_entity = None

# get unique named entities
named_entities = list(set(named_entities))
# store named entities in a data frame
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
# display results
print(entity_frame)

# ----------output----------
                Entity Name   Entity Type
0                   Belgium      LOCATION
1   North & Central America  ORGANIZATION
2                 Caribbean      LOCATION
3                    France      LOCATION
4                      1904          DATE
5             South America      LOCATION
6                      Asia      LOCATION
7                   Denmark      LOCATION
8                    Zürich      LOCATION
9                      FIFA  ORGANIZATION
10                   Sweden      LOCATION
11                    Spain      LOCATION
12                   Europe      LOCATION
13                  Oceania      LOCATION
14                   Africa      LOCATION
15          the Netherlands      LOCATION
16              Switzerland      LOCATION
17                  Germany      LOCATION

It can be seen that with the help of Stanford NER, the implementation effect of NER is better. Africa is recognized as LOCATION and 1904 is recognized as time (this is not recognized in NLTK), but the recognition of North & Central America is still wrong. Identify it as ORGANIZATION.

Five, summary

It is worth noting that Stanford NER is not necessarily better than NLTK NER. The objects, corpus, and algorithms for the two may be different. Therefore, you need to decide which tools to use according to your needs.