NLP introductory study 3 - syntax analysis (based on LTP4)
0. Introduction
This article will introduce the syntactic structure analysis in NLP, mainly based on the LTP tool, and will also introduce the use of other functions of LTP in the process.
1. Introduction to LTP
LTP (Language Technology Platform) provides a series of Chinese natural language processing tools. Users can use these tools to perform word segmentation, part-of-speech tagging, and syntactic analysis on Chinese texts.
git: https://github.com/HIT-SCIR/ltp
2. Install
2.1 Module installation
There is a problem with the installation of the old version of ltp, execute pip under linux, and version 0.2.1 is installed by default.
pip install pyltp
There will be problems under windows, please refer to https://github.com/HIT-SCIR/pyltp/issues/125
, it is recommended to install ltp4, one-click installation:
pip install ltp
2.2 Model download
There are download links for models in the official git, including two versions V1 and V2, and three models base, small and tiny.
At the same time, various indicators of the two versions are given on the official git.
3. use
3.1 Clause
from ltp import LTP
ltp = LTP('../ltp_model/base') # 加载模型,路径是模型文件解压后的文件夹
# 后面的代码全都省略了模型加载这一步
sents = ltp.sent_split(["吃葡萄不吐葡萄皮。不吃葡萄倒吐葡萄皮。"])
sents
# [out]: ['吃葡萄不吐葡萄皮。', '不吃葡萄倒吐葡萄皮。']
3.2 Word segmentation
segment, hidden = ltp.seg(["南京市长江大桥"])
segment
# [out]: [['南京市', '长江', '大桥']]
Similar to other word segmentation tools, LTP also provides the function of custom dictionaries, which can specify the path to load custom dictionaries.
ltp.init_dict(path="user_dict.txt", max_window=4)
You can also directly enter custom vocabulary and add custom word segmentation results:
ltp.add_words(words=["市长", "江大桥"], max_window=4)
segment, hidden = ltp.seg(["南京市长江大桥"])
segment
# [out]: [['南京', '市长', '江大桥']]
3.3 Part-of-speech tagging
pos = ltp.pos(hidden)
pos
# [out]: [['ns', 'ns', 'n']] # 这个结果对应的是 [['南京市', '长江', '大桥']]
The result of part-of-speech tagging corresponds to the word segmentation result, and the ns in the result will be noted in Part 4.
3.4 Named Entity Recognition
ltp can also complete the task of named entity recognition, but it is better to use a more complex model for complex tasks. After all, the parameters of ltp are here.
ner = ltp.ner(hidden)
ner
# [out]: [[('Nh', 0, 0), ('Ns', 1, 1), ('Nh', 2, 2)]]
for i in range(len(ner[0])): # 因为只有一个句子,所以直接取了ner[0]
print("{}:".format(ner[0][i][0]), segment[0][ner[0][i][1]: ner[0][i][2]+1])
# [out]: Nh: ['南京']
# Ns: ['市长']
# Nh: ['江大桥']
Alas, the result was not very accurate as expected, Nanjing was recognized as a person's name. In terms of part-of-speech tagging, ltp is too finely divided. In the actual application process, it is only necessary to judge verbs and nouns. If you want to tag names and places, it is best to use a special NER model to complete.
3.5 Dependency Syntax Analysis
ltp.add_words(words=["市长", "江大桥"], max_window=4)
segment, hidden = ltp.seg(["南京市长江大桥是南京市的市长"])
print(segment)
dep = ltp.dep(hidden)
dep
# [out]: [['南京', '市长', '江大桥', '是', '南京市', '的', '市长']]
# [out]: [[(1, 2, 'ATT'),
# (2, 3, 'ATT'),
# (3, 4, 'SBV'),
# (4, 0, 'HED'),
# (5, 7, 'ATT'),
# (6, 5, 'RAD'),
# (7, 4, 'VOB')]]
For example, ATT stands for the relationship between China and China, that is, the attributive and the central word. Nanjing is the attributive of the mayor, and the mayor is the attributive of Jiang Daqiao. These relationships have many practical applications. For example, based on these relationships, in named entity recognition, after the entity is found, the modifiers of the entity are completed and so on.
4. Description
4.1 Part-of-speech tagging
abbreviation | explain | example |
---|---|---|
a | adjective: adjective | beauty |
b | other noun-modifier: other modified nouns | large, western style |
c | conjunction: conjunction | and, though |
d | adverb: adverb | very |
e | exclamation: exclamation | Why |
g | morpheme | thorn nephew |
h | prefix: prefix | A fake |
i | idiom: idiom | Let a hundred flowers bloom |
j | abbreviation: abbreviation | public security law |
k | suffix: suffix | bounds rate |
m | number: number | one first |
n | general noun: general noun | apple |
nd | direction noun: direction noun | Right |
nh | person name: person name | Du Fu, Tom |
in | organization name: company name | insurance company |
nl | location noun: location noun | suburbs |
ns | geographic name: geographic noun | Beijing |
nt | temporal noun: noun of time | Recently, Ming Dynasty |
nz | other proper noun: other nouns | nobel prize |
o | onomatopoeia: onomatopoeia | Crash |
p | preposition: preposition | in put |
q | quantity: Quantifier | indivual |
r | pronoun: pronoun | us |
u | auxiliary: auxiliary word | ground |
v | verb: verb | run, study |
wp | punctuation: punctuation | ,。! |
ws | foreign words: foreign words | CPU |
x | non-lexeme: does not form a word | Grapes, Ao |
4.2 Dependency Syntax Analysis
relationship type | Tag | Description | Example |
---|---|---|---|
subject-verb relationship | SBV | subject-verb | I send her a bouquet of flowers (I <– send) |
Verb-object relationship | VOB | direct object, verb-object | I send her a bouquet of flowers (send –> flowers) |
guest relationship | IOB | indirect object, indirect-object | I send her a bouquet of flowers (send –> her) |
prepositional object | FOB | Fronting object, fronting-object | He reads everything (book <– read) |
Concurrent language | DBL | double | He invites me to dinner (please –> me) |
fixed relationship | TO | attribute | redapple (red <– apple) |
State structure | ADV | adverbial | very beautiful (very <– beautiful) |
Verb structure | CMP | complement | Homework done (done –> done) |
Constellation | COO | coordinate | Mountains and Seas (Mountains –> Seas) |
Guest Relations | POB | preposition-object | within the trade zone (within –> ) |
Left Attachment | LAD | left adjunct | The Mountain and the Sea (and <– the Sea) |
right attach relation | RAD | right adjunct | children (children –> children) |
independent structure | IS | independent structure | Two single sentences are structurally independent of each other |
core relationship | HED | head | Refers to the core of the entire sentence |