Natural language processing pyltp (speech tagging, named entity recognition, role labeling, etc.)

pyltp is Python wrapper LTP provides word, speech tagging, named entity recognition, dependency parsing, semantic role labeling function.

You might just beginning to understand pyltp, so here is pyltp module to download and pyltp model download address:
pyltp install the downloaded: install pyltp point I
pyltp model download: Download the model point I

Note Note Note (say three times): pyltp have to download it and use the corresponding version of the model to work properly, so before downloading, be sure to remember your downloaded what version of pyltp

python environment: python3.6
system: Windows10

Clause

pyltp provided SentenceSplitter paragraph of text can be carried out in accordance with clause Chinese and English punctuation rules, resulting in these words in each sentence. Heard SentenceSplitter the split function to store sentence we can get a list, we then print out the list.

'''
created on January 22 17:23 2019

@author:lhy
'''
from pyltp import SentenceSplitter

def sentenceSplitter(sentence='机器学习有下面几种定义: “机器学习是一门人工智能的科学,该领域的主要研究对象是人工智能,特别是如何在经验学习中改善具体算法的性能”。 “机器学习是对能通过经验自动改进的计算机算法的研究”。 “机器学习是用数据或以往的经验,以此优化计算机程序的性能标准。” 一种经常引用的英文定义是:A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.'):
    sents=SentenceSplitter.split(sentence)#分句
    print('\n'.join(sents))

sentenceSplitter()

result:

机器学习有下面几种定义: “机器学习是一门人工智能的科学,该领域的主要研究对象是人工智能,特别是如何在经验学习中改善具体算法的性能”。
“机器学习是对能通过经验自动改进的计算机算法的研究”。
“机器学习是用数据或以往的经验,以此优化计算机程序的性能标准。”
一种经常引用的英文定义是:A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Participle

Using the model provided by the official cws.modelwill be divided into one sentence corresponding word:

'''
created on January 22 17:34 2019

@author:lhy
'''
from pyltp import Segmentor

def segmentor(sentence=''):
    segmentor=Segmentor()#初始化实例
    segmentor.load(r'D:\serve_out\auto-Q&A\ltp_data_v3.4.0\cws.model')  # 加载模型
    words=segmentor.segment(sentence)#产生分词
    segmentor.release()#释放模型
    print('\n'.join(words))

segmentor(sentence='我今天很开心能去看电影')

result:

我
今天
很
开心
能
去
看
电影

We can see a sentence has been split into a Chinese word according to a grammar of Chinese.

Speech tagging

On the basis of the word, we can know the part of speech of each word:

'''
created on January 22 17:42 2019

@author:lhy
'''
from pyltp import Segmentor
from pyltp import Postagger


#分词
def segmentor(sentence=''):
    segmentor=Segmentor()#初始化实例
    segmentor.load(r'D:\serve_out\auto-Q&A\ltp_data_v3.4.0\cws.model')  # 加载模型
    words=segmentor.segment(sentence)#产生分词
    words_list=list(words)
    segmentor.release()#释放模型
    return words_list

#词性标注
def posttagger(words):
    postagger=Postagger()#初始化实例
    postagger.load(r'D:\serve_out\auto-Q&A\ltp_data_v3.4.0\pos.model')
    postags=postagger.postag(words)#词性标注
    postagger.release()
    return postags

if __name__=='__main__':
    words=segmentor(sentence='我今天很开心能去看电影')
    postags=posttagger(words)
    
    for word,tag in zip(words,postags):
        print(word+'/'+tag)


operation result:

我/r
今天/nt
很/d
开心/a
能/v
去/v
看/v
电影/n

Ltp can be seen on every word was part of speech tagging. LTP speech tagging adopted in stage 863, the meaning of each part of speech can refer to the following table:
Here Insert Picture Description

Named entity recognition

NER (Named Entity Recognition, NER) is to locate and identify the sequence of words in a sentence of names , place names , organization names tasks and other entities.

Want to achieve NER need to know a sentence word and the word corresponding POS tagging .

LTP using BIESO labeling system. B represents a start entity words, I represents an entity intermediate term, E represents the end entity word, S represents a separate entity as, O for not constitute a named entity . Named entity type LTP provides for: name (Nh), names (Ns), organization name (Ni) . B, between I, E, S and the position of the label with a dash Entity Type Label - connected; O no type tag after the tag.
example:

'''
created on January 22 17:56 2019

@author:lhy
'''
from pyltp import Segmentor
from pyltp import Postagger
from pyltp import NamedEntityRecognizer

#分词
def segmentor(sentence=''):
    segmentor=Segmentor()#初始化实例
    segmentor.load(r'D:\serve_out\auto-Q&A\ltp_data_v3.4.0\cws.model')  # 加载模型
    words=segmentor.segment(sentence)#产生分词
    words_list=list(words)
    segmentor.release()#释放模型
    return words_list

#词性标注
def posttagger(words):
    postagger=Postagger()#初始化实例
    postagger.load(r'D:\serve_out\auto-Q&A\ltp_data_v3.4.0\pos.model')
    postags=postagger.postag(words)#词性标注
    postagger.release()
    return postags

#命名实体识别
def reco(words,postags):
    recognizer=NamedEntityRecognizer()#初始化实例
    recognizer.load(r'D:\serve_out\auto-Q&A\ltp_data_v3.4.0\ner.model')
    netags=recognizer.recognize(words,postags)#命名实体识别
    for word ,ntag in zip(words,netags):
        print(word+'/'+ntag)
    recognizer.release()
    return netags

if __name__=='__main__':
    #因为有**词,只能通过拼音,大家在练习的时候换成汉字就行
    words=segmentor(sentence='guowuyuanzonglilikeqiang调研上海外高桥时提出,支持上海积极探索新机制')
    postags=posttagger(words)
    reco(words,postags)

result:

guowuyuan/S-Ni
zongli/O
likeqiang/S-Nh
调研/O
上海/B-Ns
外高桥/E-Ns
时/O
提出/O
,/O
支持/O
上海/S-Ns
积极/O
探索/O
新/O
机制/O

"Guowuyuan" is the name of your organization labeled as Ni, "likeqiang" is the name labeled as Nh, "Shanghai Waigaoqiao" place names marked as Ns. The "Shanghai Waigaoqiao" to "Shanghai" for the start of B, the "Waigaoqiao" to end E. guowuyuan a single entity, it is S.

Dependency semantic analysis

Dependency between the semantic analysis component analysis unit of language dependencies determined syntax structure. Dependency analysis is used to identify the semantic sentence "SVO", "set up like" and other grammatical components, and analyze the relationship between these components.
Relationship dependency parsing denoted 14 kinds, as follows:
Here Insert Picture Description
Example:

'''
created on January 24 16:05 2019

@author:lhy
'''
from pyltp import Segmentor
from pyltp import Postagger
from pyltp import Parser

#分词
def segmentor(sentence=''):
    segmentor=Segmentor()#初始化实例
    segmentor.load(r'D:\serve_out\auto-Q&A\ltp_data_v3.4.0\cws.model')  # 加载模型
    words=segmentor.segment(sentence)#产生分词
    words_list=list(words)
    segmentor.release()#释放模型
    return words_list

#词性标注
def posttagger(words):
    postagger=Postagger()#初始化实例
    postagger.load(r'D:\serve_out\auto-Q&A\ltp_data_v3.4.0\pos.model')
    postags=postagger.postag(words)#词性标注
    postagger.release()
    return postags

#依存语义分析
def parse(words,postags):
    parser=Parser()#初始化实例
    parser.load(r'D:\serve_out\auto-Q&A\ltp_data_v3.4.0\parser.model')
    arcs=parser.parse(words,postags)#依存语义分析
    i=0
    for word,arc in zip(words,arcs):
        i=i+1
        print(str(i)+'/'+word+'/'+str(arc.head)+'/'+str(arc.relation))
    parser.release()
    return arcs
    

if __name__=='__main__':
    words=segmentor(sentence='guowuyuanzonglilikeqiang调研上海外高桥时提出,支持上海积极探索新机制')
    postags=posttagger(words)
    parse(words,postags)

result:

1/guowuyuan/2/ATT
2/zongli/3/ATT
3/likeqiang/4/SBV
4/调研/7/ATT
5/上海/6/ATT
6/外高桥/4/VOB
7/时/8/ADV
8/提出/0/HED
9/,/8/WP
10/支持/8/COO
11/上海/13/SBV
12/积极/13/ADV
13/探索/10/VOB
14/新/15/ATT
15/机制/13/VOB

Code, arc.headreferring to the index of the parent node dependent arc, arc.relationrefers to the dependency relationship arc. In the results, we can see the " proposed " is the core of the relationship between the whole sentence, its parent index is 0. As for the relationship between the other, we can see the "guowuyuan" parent node is an index 2 "prime minister", constitutes a definite relationship between them ATT, constitute "guowuyuanzongli", "Shanghai" and the parent node "explore" 13 constitutes the subject-predicate relationship SBV, constitute "Shanghai to explore."

Semantic role labeling

Semantic Role Labeling (Semantic Role Labeling, SRL) is a shallow semantic analysis techniques, some phrases on the label for a given predicate yuan (semantic roles) sentence, as agent, patient, time and location. Which is capable of answering system, information extraction, and machine translation application generating role.

Shanghai Waigaoqiao when the investigation raised guowuyuanzonglilikeqiang, and actively explore new mechanisms to support Shanghai.

We analyze the sentence to "explore" for example, "positive" is his way, generally expressed ADV, and the "new mechanism" is his Shoushi, generally denoted by A1.
Semantic a central role for the A0-5 six kinds, A0 agent usually means action, A1 usually shows the effect of the action, etc., A2-5 have different semantic meaning according to the verb. The remaining 15 semantic roles as an additional semantic roles, such as LOC indicate the location, TMP represents the time. Semantic role list is as follows:
Here Insert Picture Description
Example:

'''
created on January 24 17:05 2019

@author:lhy
'''
from pyltp import Segmentor
from pyltp import Postagger
from pyltp import Parser
from pyltp import SementicRoleLabeller

#分词
def segmentor(sentence=''):
    segmentor=Segmentor()#初始化实例
    segmentor.load(r'D:\serve_out\auto-Q&A\ltp_data_v3.4.0\cws.model')  # 加载模型
    words=segmentor.segment(sentence)#产生分词
    words_list=list(words)
    segmentor.release()#释放模型
    return words_list

#词性标注
def posttagger(words):
    postagger=Postagger()#初始化实例
    postagger.load(r'D:\serve_out\auto-Q&A\ltp_data_v3.4.0\pos.model')
    postags=postagger.postag(words)#词性标注
    postagger.release()
    return postags

#依存语义分析
def parse(words,postags):
    parser=Parser()#初始化实例
    parser.load(r'D:\serve_out\auto-Q&A\ltp_data_v3.4.0\parser.model')
    arcs=parser.parse(words,postags)#依存语义分析
    parser.release()
    return arcs
    
#角色标注
def role_label(words,postags,arcs):
    labeller=SementicRoleLabeller()#初始化实例
    labeller.load(r'D:\serve_out\auto-Q&A\ltp_data_v3.4.0\pisrl_win.model')
    roles=labeller.label(words,postags,arcs)#语义角色标注
    for role in roles:
        print(role.index,"".join(["%s:(%d,%d)"%(arg.name,arg.range.start,arg.range.end) for arg in role.arguments]))
    labeller.release()

if __name__=='__main__':
    words=segmentor(sentence='guowuyuanzonglilikeqiang调研上海外高桥时提出,支持上海积极探索新机制')
    postags=posttagger(words)
    arcs=parse(words,postags)
    role_label(words,postags,arcs)

result:

7 TMP:(0,6)A1:(9,14)
9 A1:(10,10)

Can be seen, according to the word of the index, the index 7 corresponding predicate is "that", and for "proposed" is concerned, what time the index TMP 0-6 Representative that "when guowuyuanzonglilikeqiang research Shanghai Waigaoqiao." "That" the impact of A1 is (that is, content is presented) index of 9-14 represented something that is "actively explore new mechanisms to support Shanghai." 9 index represents the "support" to support the content is "Shanghai." (Always felt a little strange, to support things should not be "Shanghai actively explore new mechanisms" Well)

Published 61 original articles · won praise 16 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_41427568/article/details/86598333