Natural language processing-use spaCy for part-of-speech tagging

Part of speech (POS) tagging can be done using a language model, which contains a dictionary of words and all possible parts of speech. Then, the model can be trained using sentences that have been correctly annotated with parts of speech, so as to recognize the parts of speech of all words in a new sentence composed of other words in the dictionary. Both NLTK and spaCy have part-of-speech tagging functions. We use spaCy here because it is faster and more precise. In addition, part-of-speech tagging allows to extract the relationship between entities in a sentence.

import spacy
import en_core_web_md
from spacy.displacy import render
import pandas as pd
from collections import OrderedDict
from spacy.matcher import Matcher

en_model = en_core_web_md.load()
sentence = ("In 1541, Desoto met the Pascagoula, Desoto wrote in his journal that the Pascagoula people " +
                "ranged as far north as the confluence of the Leaf and Chickasawhay rivers at 30.4, -88.5.")
parsed_sent = en_model(sentence)
print(parsed_sent)
# spaCy 没有识别出纬度/经度对中的经度
print(parsed_sent.ents)
# spaCy 使用了“OntoNotes 5”词性标注标签体系
print(' '.join(['{}_{}'.format(tok, tok.tag_) for tok in parsed_sent]))

# 可视化依存树
# spaCy 解析的句子还包含嵌套字典表示的依存树
# render():
#   docs (list or Doc):可视化的文档。
#   page (bool):将标记呈现为完整的HTML页面。
#   options (dict):特定于Visualiser的选项,例如 颜色。
with open('pascagoula.html', 'w') as f:
    f.write(render(docs=parsed_sent, page=True, options=dict(compact=True)))

# 以表格形式列出所有的词条标签
def token_dict(token):
    ordered_dict = OrderedDict(ORTH=token.orth_, LEMMA=token.lemma_,
                POS=token.pos_, TAG=token.tag_, DEP=token.dep_)
    return ordered_dict

def doc_dataframe(doc):
    return pd.DataFrame([token_dict(tok) for tok in doc])

print(doc_dataframe(parsed_sent))

# 示例 spaCy 词性标注模式
'''
    一个模式由一个或多个`token_specs`组成,其中`token_spec`
    是将属性ID映射到值的字典,并且可以选择
    关键字“ op”下的量词运算符。可用的量词是:
    
    '!':通过要求精确匹配0次来求反。
    '?':允许模式匹配0或1次,从而使其成为可选模式。
    '+':要求模式匹配1次或多次。
    '*':允许模式零次或多次。
'''
pattern = [{
    
    'TAG': 'NNP', 'OP': '+'}, {
    
    'IS_ALPHA': True, 'OP': '*'},
            {
    
    'LEMMA': 'meet'}, {
    
    'IS_ALPHA': True, 'OP': '*'}, {
    
    'TAG': 'NNP', 'OP': '+'}]

# 用 spaCy 创建词性标注模式匹配器
matcher = Matcher(en_model.vocab)
# 将匹配规则添加到匹配器
'''
    key (unicode):匹配ID。
    patterns (list):为给定键添加的模式。
    on_match (callable):匹配时执行的可选回调。
'''
matcher.add('met', None, pattern)
# print(help(matcher.add))
m = matcher(parsed_sent)
print(m)
print(parsed_sent[m[0][1]:m[0][2]])
print('+' * 50)

# 使用词性标注模式匹配器看看对于维基百科中类似句子的效果
doc = en_model("October 24: Lewis and Clark met their first Mandan Chief, Big White.")
m = matcher(doc)
print(m)
print(doc[m[0][1]:m[0][2]])
print('+' * 50)
doc = en_model("On 11 October 1986, Gorbachev and Reagan met at a house.")
m = matcher(doc)
print(m)
print('+' * 50)

# 组合多个模式得到更鲁棒的模式匹配器
# 再添加一个模式,允许动词在主语和宾语名词之后出现
doc = en_model("On 11 October 1986, Gorbachev and Reagan met at a house, Clark met their first Mandan Chief.")
pattern2 = [{
    
    'TAG': 'NNP', 'OP': '+'}, {
    
    'LEMMA': 'and'}, {
    
    'TAG': 'NNP', 'OP': '+'},
            {
    
    'IS_ALPHA': True, 'OP': '*'}, {
    
    'LEMMA': 'meet'}]
# 在不删除前一个模式的情况下添加另一个模式
matcher.add('met2', None, pattern2)
m = matcher(doc)
print(m)
# 最长的匹配是匹配列表中的最后一个
for i in range(len(m)):
    print(doc[m[i][1]:m[i][2]])

Guess you like

Origin blog.csdn.net/fgg1234567890/article/details/113919006