Python implements top-down syntax analysis and syntax analysis based on CYK algorithm

Problem Description

  1. Construct a top-down syntactic analyzer, and use the following grammar to perform syntactic analysis on the input string "child/n likes/v dog/n", and obtain the syntactic analysis process of the input string.
    (1.1) S→NP VP
    (1.2) S→VP
    (2.1) NP→n
    (2.2) NP→an
    (3.1) VP→v NP
  2. Construct a syntactic analyzer based on the CYK algorithm, and use the following grammar to perform syntactic analysis on the input string "Zhang San/n is/v county magistrate/n sent/v to/v's/de", and obtain the syntactic analysis result corresponding to the input string .
    ① S→NP VP
    ② NP→n
    ③ NP→CS de
    ④ CS→NP V'
    ⑤ VP→v NP
    ⑥ V'→vv


top-down parsing

Introduction

The grammar of a language can be expressed as a quadruple: G=<T,N,P,S>, where T is a set of terminals (used to represent parts of speech), and N is a set of non-terminals (used to represent grammatical components ), P is the production (used to represent the syntax rules), and S is the start character, which is an element of N.
Top-down analysis starts from the root of the tree, and it operates on derivations of the form: S–>z1–>z2–>…–>zn. At the beginning, this derivation contains only the start symbol S, and n=0. The used rules are placed in a first-in-first-out stack, and the stack is empty at the beginning. The function of this stack is to record the effect of the most recently used rules.
Because the given string is a combination of Chinese characters plus parts of speech, it is necessary to process this string to obtain Chinese characters and parts of speech respectively. The most important thing is to get the part of speech, because these syntactic analysis is based on these characters. It is divided into terminal symbols and non-terminal symbols, and these parts of speech are processed.

Examples are as follows:

rules = [                                               			# 规则
    ('S', 'NP VP'),
    ('S', 'VP'),
    ('NP', 'n'),
    ('NP', 'an'),
    ('VP', 'v NP')
]
s = '孩子/an 喜欢/v 狗/n'                                  # 输入词串
t1 = s.split()                                          			# 词串列表
t = []
word = []
for it in t1:
    word.extend(it.split('/'))
    t.append(word.pop())

insert image description here
Then process the obtained part-of-speech list, and a "production" is a syntactic rule. Different types of grammars place different restrictions on the form of the rules. Before syntactic analysis, we must first determine what type of grammar to use. S is the initial symbol, and the final processing result is obtained by continuously replacing the non-terminal symbols through the production.

Experimental results

insert image description here

Then modify n and set it to an and then test again, the results are as follows, which confirms the availability of the algorithm

insert image description here

Then modify the string and use the second production of S, the result is as follows

insert image description here


CYK algorithm

process

Initialization: For i=0, j=0,...,n-1, fill the symbol tj in the input string into P(i,j)

  1. Construct a list of rules and store the grammar in the form of tuples
  2. Split the input string into a list of words
  3. Keep track of the lengths of the word list and grammar list
  4. Construct Analysis Table Matrix
rules = [                                               # 规则
    ('S', 'NP VP'),
    ('NP', 'n'),
    ('NP', 'CS de'),
    ('CS', 'NP V1'),
    ('VP', 'v NP'),
    ('V1', 'v v')
]
s = '张三/n 是/v 县长/n 派/v 来/v 的/de'                   # 输入词串
t1 = s.split()                                          # 词串列表
t = []
word = []
for it in t1:
    word.extend(it.split('/'))
    t.append(word.pop())
print('输入串的分词结果为:' + str(word))
print('输入串终结词的词性结果为:' + str(t))

n = len(t)                                              # 词串个数
nRules = len(rules)
P = [[set() for j in range(n-i)] for i in range(n)]     # 矩阵

# 初始化:对i=0, j=0,···,n-1, 将输入串中的符号tj填入P(i,j)
for j in range(n):
    P[0][j].add(t[j])

The initialized "triangular matrix":

insert image description here

Step 1: For i=0, j=0,...,n-1, if there is a rule A->tj, then add the non-terminal symbol A to P(i,j); that is, traverse the 0th row of the analysis table
matrix , and at the same time judge whether the right element of the production in the grammar is in it, if it is in it, add the non-terminal symbol of the corresponding production to P(0,j)

for j in range(n):
    for r in range(nRules):
        if rules[r][1] in P[0][j]:
            P[0][j].add(rules[r][0])

After the first step, the first line is added:

insert image description here

Step 2: For i=1,...,n-1, j=0,1,...,ni-1, k=0,...,i-1, for each rule A->BC, if B∈P (k,j) and C∈P(ik-1,j+k+1), then add the non-terminal symbol A to P(i,j); firstly, the
right part of the production rule needs to be processed, because the rules start with It is stored in the form of tuples, so rules[r][1] is taken out and word-segmented to form an element and put into the list.
Reconstruct multiple loops to judge whether the condition is true

ruleList = []
for r in range(nRules):
    result = rules[r][1].split()
    if len(result) == 1:
        result.append('null')
    ruleList.append(result)
# print(ruleList)
for i in range(1, n):
    for j in range(0, n-i):
        for k in range(0, i):
            for r in range(nRules):
                if (ruleList[r][0] in P[k][j]) and (ruleList[r][1] in P[i-k-1][j+k+1]):
                    P[i][j].add(rules[r][0])

Take the right-hand side of the production and convert it to a list:

insert image description here

Output the result of each loop:

insert image description here

insert image description here

insert image description here

insert image description here

insert image description here

Step 3: If S∈P(n-1,0), the analysis is successful, otherwise the analysis fails.

if 'S' in P[n-1][0]:
    print('分析成功')
else:
    print('分析失败')


Experimental results

insert image description here


Guess you like

Origin blog.csdn.net/qq_48068259/article/details/127644818