NLP introductory study 3 - syntax analysis (based on LTP4)

0. Introduction

This article will introduce the syntactic structure analysis in NLP, mainly based on the LTP tool, and will also introduce the use of other functions of LTP in the process.

1. Introduction to LTP

LTP (Language Technology Platform) provides a series of Chinese natural language processing tools. Users can use these tools to perform word segmentation, part-of-speech tagging, and syntactic analysis on Chinese texts.
git: https://github.com/HIT-SCIR/ltp

2. Install

2.1 Module installation

There is a problem with the installation of the old version of ltp, execute pip under linux, and version 0.2.1 is installed by default.

pip install pyltp

There will be problems under windows, please refer to https://github.com/HIT-SCIR/pyltp/issues/125
, it is recommended to install ltp4, one-click installation:

pip install ltp

2.2 Model download

There are download links for models in the official git, including two versions V1 and V2, and three models base, small and tiny.
At the same time, various indicators of the two versions are given on the official git.
V2 indicators
V1 indicator

3. use

3.1 Clause

from ltp import LTP
ltp = LTP('../ltp_model/base')   # 加载模型,路径是模型文件解压后的文件夹
# 后面的代码全都省略了模型加载这一步

sents = ltp.sent_split(["吃葡萄不吐葡萄皮。不吃葡萄倒吐葡萄皮。"])
sents

# [out]:   ['吃葡萄不吐葡萄皮。', '不吃葡萄倒吐葡萄皮。']

3.2 Word segmentation

segment, hidden = ltp.seg(["南京市长江大桥"])
segment

# [out]:  [['南京市', '长江', '大桥']]

Similar to other word segmentation tools, LTP also provides the function of custom dictionaries, which can specify the path to load custom dictionaries.

ltp.init_dict(path="user_dict.txt", max_window=4)

You can also directly enter custom vocabulary and add custom word segmentation results:

ltp.add_words(words=["市长", "江大桥"], max_window=4)

segment, hidden = ltp.seg(["南京市长江大桥"])
segment

# [out]:  [['南京', '市长', '江大桥']]

3.3 Part-of-speech tagging

pos = ltp.pos(hidden)
pos

# [out]:  [['ns', 'ns', 'n']]    # 这个结果对应的是 [['南京市', '长江', '大桥']]

The result of part-of-speech tagging corresponds to the word segmentation result, and the ns in the result will be noted in Part 4.

3.4 Named Entity Recognition

ltp can also complete the task of named entity recognition, but it is better to use a more complex model for complex tasks. After all, the parameters of ltp are here.

ner = ltp.ner(hidden)
ner

# [out]: [[('Nh', 0, 0), ('Ns', 1, 1), ('Nh', 2, 2)]]

for i in range(len(ner[0])):            # 因为只有一个句子,所以直接取了ner[0]
    print("{}:".format(ner[0][i][0]), segment[0][ner[0][i][1]: ner[0][i][2]+1])
# [out]:  Nh: ['南京']
#         Ns: ['市长']
#         Nh: ['江大桥']

Alas, the result was not very accurate as expected, Nanjing was recognized as a person's name. In terms of part-of-speech tagging, ltp is too finely divided. In the actual application process, it is only necessary to judge verbs and nouns. If you want to tag names and places, it is best to use a special NER model to complete.

3.5 Dependency Syntax Analysis

ltp.add_words(words=["市长", "江大桥"], max_window=4)
segment, hidden = ltp.seg(["南京市长江大桥是南京市的市长"])
print(segment)
dep = ltp.dep(hidden)
dep

# [out]: [['南京', '市长', '江大桥', '是', '南京市', '的', '市长']]
# [out]: [[(1, 2, 'ATT'),
#            (2, 3, 'ATT'),
#            (3, 4, 'SBV'),
#            (4, 0, 'HED'),
#            (5, 7, 'ATT'),
#            (6, 5, 'RAD'),
#            (7, 4, 'VOB')]]

For example, ATT stands for the relationship between China and China, that is, the attributive and the central word. Nanjing is the attributive of the mayor, and the mayor is the attributive of Jiang Daqiao. These relationships have many practical applications. For example, based on these relationships, in named entity recognition, after the entity is found, the modifiers of the entity are completed and so on.

4. Description

4.1 Part-of-speech tagging

abbreviation explain example
a adjective: adjective beauty
b other noun-modifier: other modified nouns large, western style
c conjunction: conjunction and, though
d adverb: adverb very
e exclamation: exclamation Why
g morpheme thorn nephew
h prefix: prefix A fake
i idiom: idiom Let a hundred flowers bloom
j abbreviation: abbreviation public security law
k suffix: suffix bounds rate
m number: number one first
n general noun: general noun apple
nd direction noun: direction noun Right
nh person name: person name Du Fu, Tom
in organization name: company name insurance company
nl location noun: location noun suburbs
ns geographic name: geographic noun Beijing
nt temporal noun: noun of time Recently, Ming Dynasty
nz other proper noun: other nouns nobel prize
o onomatopoeia: onomatopoeia Crash
p preposition: preposition in put
q quantity: Quantifier indivual
r pronoun: pronoun us
u auxiliary: auxiliary word ground
v verb: verb run, study
wp punctuation: punctuation ,。!
ws foreign words: foreign words CPU
x non-lexeme: does not form a word Grapes, Ao

4.2 Dependency Syntax Analysis

relationship type Tag Description Example
subject-verb relationship SBV subject-verb I send her a bouquet of flowers (I <– send)
Verb-object relationship VOB direct object, verb-object I send her a bouquet of flowers (send –> flowers)
guest relationship IOB indirect object, indirect-object I send her a bouquet of flowers (send –> her)
prepositional object FOB Fronting object, fronting-object He reads everything (book <– read)
Concurrent language DBL double He invites me to dinner (please –> me)
fixed relationship TO attribute redapple (red <– apple)
State structure ADV adverbial very beautiful (very <– beautiful)
Verb structure CMP complement Homework done (done –> done)
Constellation COO coordinate Mountains and Seas (Mountains –> Seas)
Guest Relations POB preposition-object within the trade zone (within –> )
Left Attachment LAD left adjunct The Mountain and the Sea (and <– the Sea)
right attach relation RAD right adjunct children (children –> children)
independent structure IS independent structure Two single sentences are structurally independent of each other
core relationship HED head Refers to the core of the entire sentence

Guess you like

Origin blog.csdn.net/weixin_44826203/article/details/108472878