NLP technology: statistical syntactic analysis of CYK algorithm based on PCFG

table of Contents

1. Cognitive Syntactic Analysis

Two, CYK algorithm

Three, python implementation: core CYK algorithm

1. Selection and initialization of data structure

2. Calculation of leaf nodes

 3. Non-leaf nodes

 4. Backtracking to build a syntax tree

5. Experimental results: syntax tree structure

Six, analysis and summary


1. Cognitive Syntactic Analysis

First of all, what does syntactic analysis mean? What is it doing? As the name suggests, it feels like a variety of syntax and grammar when learning English. That's right! Here is to give the analysis process to the computer to process, let it analyze the syntactic composition of a sentence, and then better understand the semantic information of the sentence. This is the purpose of NLP, which is the goal of AI.

Syntactic parsing is one of the key technologies in natural language processing. The basic task is to determine the syntactic structure of a sentence or the dependency between words in a sentence. Syntactic analysis is divided into: syntactic structure analysis and dependency relationship analysis. This blog post will introduce in detail a method of syntactic structure analysis: statistical syntactic analysis based on probabilistic context-free grammar (PCFG). The algorithm used is the CYK algorithm to analyze the input word sequence (sentence) to find a sentence grammatical structure that conforms to the grammatical rules. , One of the important technical practices of natural language processing: syntactic analysis. This article records the learning summary and sharing experience methods in detail. Python implements the syntactic analysis of PCFG using the CYK algorithm, explains the idea of ​​in-depth understanding of statistical syntax analysis through the core algorithm, and masters the implementation of specific algorithm codes to obtain the grammar of a sentence tree.

This article is also on the basis of the first two tasks of NLP to further allow computers to understand the meaning of human natural language. The first two basic tasks are:

  1. Word segmentation: Two-way maximum matching algorithm-Chinese word segmentation based on dictionary rules [ https://blog.csdn.net/Charzous/article/details/108817914 ]
  2. Part of speech tagging: Java implementation: HMM+ Viterbi algorithm part of speech tagging [ https://blog.csdn.net/Charzous/article/details/109138830 ]

Two, CYK algorithm

In the subdivision of syntactic analysis methods, there are many methods for structural analysis. Here , the statistical syntactic analysis of Probabilistic Context Free Grammar (PCFG) is used, and one of the specific algorithms is selected: CYK algorithm. As the name suggests, it was jointly proposed by three big cows (Cocke-Younger-Kasami) that the algorithm idea cleverly uses the Viterbi dynamic programming method, I really admire it! Let's take a look at what powerful algorithm is.

Given a sentence s and a context-free grammar PCFG, G=(T, N, S, R, P), define a grammatical component π with the greatest probability of crossing words i to j: π(i,j,X)( i,j∈1…n,X∈N), the goal is to find the tree with the highest probability among all trees belonging to π[1,n,S].

  1. T stands for terminal character set
  2. N stands for non-terminal character set
  3. S stands for initial non-terminal symbol
  4. R stands for generating grammar rule set
  5. P represents the statistical probability of each rule

The following is the pseudo code of the algorithm that I have compiled and written based on the algorithm idea, which is easier to understand:

function CKY(words, grammar) :
//初始化
    score = new double[#(words)+1][#(words)+1][#(nonterms)]
    back = new Pair[#(words)+1][#(words)+1][#nonterms]]
    //填叶结点
    for i=0; i<#(words); i++
        for A in nonterms
            if A -> words[i] in grammar
                score[i][i+1][A] = P(A -> words[i])
                //处理一元规则
         boolean added = true
        while added
            added = false
            //生成新的语法需要加入
            for A, B in nonterms
                if score[i][i+1][B] > 0 && A->B in grammar
                    prob = P(A->B)*score[i][i+1][B]
                    if prob > score[i][i+1][A]
                        score[i][i+1][A] = prob
                        back[i][i+1][A] = B
            added = true
    //自底向上处理非叶结点
    for span = 2 to #(words)
        for begin = 0 to #(words)- span//该层结点个数
            end = begin + span
            for split = begin+1 to end-1
                for A,B,C in nonterms
                    prob=score[begin][split][B]*score[split][end][C]*P(A->BC)
                    //计算每种分裂概率,保存最大概率路径
                        if prob > score[begin][end][A]
                            score[begin]end][A] = prob
                            back[begin][end][A] = new Triple(split,B,C)
                        //处理一元语法
                boolean added = true
                while added
                    added = false
                    for A, B in nonterms
                        prob = P(A->B)*score[begin][end][B];
                            if prob > score[begin][end][A]
                                score[begin][end][A] = prob
                                back[begin][end][A] = B
                    added = true
    //返回最佳路径树
    return buildTree(score, back)

 The score stores the maximum probability, and the back stores the split point information for backtracking. In the next specific algorithm implementation, a special data structure will be used to achieve the storage of data information.

score[0][0]      
  score[1][1]    
    score[2][2]  
      score[3][3]

Store information in a matrix, with each word as an element on the diagonal, which is the leaf node of the tree structure. Use the idea of ​​dynamic programming to fill in the form until the upper right corner is calculated, and all the node information of the entire tree is calculated and processed.

Three, python implementation: core CYK algorithm

1. Selection and initialization of data structure

Take advantage of the python language, combine the two data structures of dictionary and list to realize the storage of probability and the preservation of path information.

    word_list = sentence.split()
    best_path = [[{} for _ in range(len(word_list))] for _ in range(len(word_list))]

    # 初始化
    for i in range(len(word_list)):  # 下标为0开始
        for j in range(len(word_list)):
            for x in non_terminal:  # 初始化每个字典,每个语法规则概率及路径为None,避免溢出和空指针
                best_path[i][j][x] = {'prob': 0.0, 'path': {'split': None, 'rule': None}}

2. Calculation of leaf nodes

 Here we also need to popularize the form of grammatical rules in advance, such as: ◼VP→VP PP ◼S→Aux NP VP ◼NP->astronomers, etc. is a grammatical rule, you can find that there is only one non-terminal symbol (part of speech) on the left, pointing to the right One/more non-terminal characters or terminal characters (words). In order to ensure the unity of algorithm processing, we need to unify the grammatical rules in some way, which leads to CNF (Chomsky Normal Form ) .

If the form of each production of a context-free grammar is: A->BC or A->a, that is, the right part of the rule is either two non-terminal characters or one terminal character. Therefore, this experimental data provides the grammatical rules of CNF, which facilitates the calculation process.

The key part is to implement the ①non-terminal symbol-word rule, and then scan the grammar rule set again, and add the "new rule" ②non-terminal symbol--①non-terminal symbol to the grammar set of the leaf node.

    # 填叶结点,计算得到每个单词所有语法组成的概率
    for i in range(len(word_list)):  # 下标为0开始
        for x in non_terminal:  # 遍历非终端符,找到并计算此条非终端-终端语法的概率
            if word_list[i] in rules_prob[x].keys():
                best_path[i][i][x]['prob'] = rules_prob[x][word_list[i]]  # 保存概率
                best_path[i][i][x]['path'] = {'split': None, 'rule': word_list[i]}  # 保存路径
                # 生成新的语法需要加入
                for y in non_terminal:
                    if x in rules_prob[y].keys():
                        best_path[i][i][y]['prob'] = rules_prob[x][word_list[i]] * rules_prob[y][x]
                        best_path[i][i][y]['path'] = {'split': i, 'rule': x}

 3. Non-leaf nodes

 This is the core part of the CYK algorithm, filling non-leaf nodes. The notes explain the function of each step in more detail.

for l in range(1, len(word_list)):
    # 该层结点个数
    for i in range(len(word_list) - l):  # 第一层:0,1,2
        j = i + l  # 处理第二层结点,(0,j=1),(1,2),(2,3)   1=0+1,2=1+1.3=2+1
        for x in non_terminal:  # 获取每个非终端符
            tmp_best_x = {'prob': 0, 'path': None}

            for key, value in rules_prob[x].items():  # 遍历该非终端符所有语法规则
                if key[0] not in non_terminal:
                    break
                # 计算产生的分裂点概率,保留最大概率
                for s in range(i, j):  # 第一个位置可分裂一个(0,0--1,1)
                    # for A in best_path[i][s]
                    if len(key) == 2:
                        tmp_prob = value * best_path[i][s][key[0]]['prob'] * best_path[s + 1][j][key[1]]['prob']
                    else:
                        tmp_prob = value * best_path[i][s][key[0]]['prob'] * 0
                    if tmp_prob > tmp_best_x['prob']:
                        tmp_best_x['prob'] = tmp_prob
                        tmp_best_x['path'] = {'split': s, 'rule': key}  # 保存分裂点和生成的可用规则
            best_path[i][j][x] = tmp_best_x  # 得到一个规则中最大概率

        # print("score[", i, "][", j, "]:", best_path[i][j])
best_path = best_path

 The extended CYK algorithm needs to deal with unary grammar rules, so I used a judgment statement to avoid the array out of bounds when the unary rule is calculated.

for s in range(i, j):  # 第一个位置可分裂一个(0,0--1,1)
# for A in best_path[i][s]
    if len(key) == 2:
        tmp_prob = value * best_path[i][s][key[0]]['prob'] * best_path[s + 1][j][key[1]]['prob']
    else:
        tmp_prob = value * best_path[i][s][key[0]]['prob'] * 0

 After this step, the maximum probability grammar rule and the split point path of each node in the upper triangle are obtained, which are used for the following path backtracking to obtain the grammar tree.

 4. Backtracking to build a syntax tree

This step took a lot of debugging time, and when the tree node traversal was empty, it was obvious that the boundary was not handled properly. This is how I started to traverse the tree in order and get the syntax tree recursively.

# 回溯路径,先序遍历树
def back(best_path, left, right, root, ind=0):
    node = best_path[left][right][root]
    if node['path']['split'] is not None:  # 判断是否存在分裂点,值为下标
        print('\t' * ind, (root,))  
        # 递归调用
            back(best_path, left, node['path']['split'], node['path']['rule'][0], ind + 1)  # 左子树
            back(best_path, node['path']['split'] + 1, right, node['path']['rule'][1], ind + 1)  # 右子树
    else:
        print('\t' * ind, (root,))
        print('--->', node['path']['rule'])

 The error is shown in the figure: TypeError:'NoneType' object is not subscriptable

I checked for a long time and found that the non-existent node was recursively traversed. After successful resolution, modify the program as follows:

def back(best_path, left, right, root, ind=0):
    node = best_path[left][right][root]
    if node['path']['split'] is not None:  # 判断是否存在分裂点,值为下标
        print('\t' * ind, (root,node['prob']))  # self.rules_prob[root].get(node['path']['rule']
        # 递归调用
        if len(node['path']['rule']) == 2:  # 如果规则为二元,递归调用左子树、右子树,如 NP-->NP NP
            back(best_path, left, node['path']['split'], node['path']['rule'][0], ind + 1)  # 左子树
            back(best_path, node['path']['split'] + 1, right, node['path']['rule'][1], ind + 1)  # 右子树
        else:  # 否则,只递归左子树,如 NP-->N
            back(best_path, left, node['path']['split'], node['path']['rule'][0], ind + 1)
    else:
        print('\t' * ind, (root,node['prob']))
        print('--->', node['path']['rule'])

4. Interpretation of detailed examples of syntactic analysis

Given the following PCFG, realize the most likely statistical syntax tree for the sentence "fish people fish tanks", and print the final tree in string form or tree form.

The first layer, the calculation results of the leaf nodes, obtain the grammatical probability and split point of the leaf nodes.

fish----> 'V': {'prob': 0.6, 'path': {'split': None, 'rule': 'fish'}},'N': {'prob': 0.2, 'path': {'split': None, 'rule': 'fish'}}, 'NP': {'prob': 0.13999999999999999, 'path': {'split': 0, 'rule': 'N'}},'VP': {'prob': 0.06, 'path': {'split': 0, 'rule': 'V'}}}

people---> 'V': {'prob': 0.1, 'path': {'split': None, 'rule': 'people'}},  'N': {'prob': 0.5, 'path': {'split': None, 'rule': 'people'}}, 'NP': {'prob': 0.35, 'path': {'split': 1, 'rule': 'N'}},'VP': {'prob':0.010000000000000002, 'path': {'split': 1, 'rule': 'V'}}}

fish---> 'V': {'prob': 0.6, 'path': {'split': None, 'rule': 'fish'}}, 'N': {'prob': 0.2, 'path': {'split': None, 'rule': 'fish'}}, 'NP': {'prob': 0.13999999999999999, 'path': {'split': 2, 'rule': 'N'}},'VP': {'prob': 0.06, 'path': {'split': 2, 'rule': 'V'}}}

tanks---> 'V': {'prob': 0.3, 'path': {'split': None, 'rule': 'tanks'}},  'N': {'prob': 0.2, 'path': {'split': None, 'rule': 'tanks'}}, 'NP': {'prob': 0.13999999999999999, 'path': {'split': 3, 'rule': 'N'}},'VP': {'prob': 0.03, 'path': {'split': 3, 'rule': 'V'}}}

 The intuitive display is:

The non-leaf node layer, through the CYK algorithm, calculates the non-leaf nodes from the bottom up, and saves the maximum probability of each rule and the split point.

score[ 0 ][ 1 ]: { 'NP': {'prob': 0.004899999999999999, 'path': {'split': 0, 'rule': ('NP', 'NP')}}, 'S': {'prob': 0.0012600000000000003, 'path': {'split': 0, 'rule': ('NP', 'VP')}}, 'VP': {'prob': 0.105, 'path': {'split': 0, 'rule': ('V', 'NP')}}}

score[ 1 ][ 2 ]: {'NP': {'prob': 0.004899999999999999, 'path': {'split': 1, 'rule': ('NP', 'NP')}}, 'S': {'prob': 0.0189, 'path': {'split': 1, 'rule': ('NP', 'VP')}}, 'VP': {'prob': 0.006999999999999999, 'path': {'split': 1, 'rule': ('V', 'NP')}}}

score[ 2 ][ 3 ]: {'NP': {'prob': 0.0019599999999999995, 'path': {'split': 2, 'rule': ('NP', 'NP')}}, 'S': {'prob': 0.00378, 'path': {'split': 2, 'rule': ('NP', 'VP')}}, 'VP': {'prob': 0.041999999999999996, 'path': {'split': 2, 'rule': ('V', 'NP')}}}

score[ 0 ][ 2 ]: {'NP': {'prob': 6.859999999999997e-05, 'path': {'split': 0, 'rule': ('NP', 'NP')}}, 'S': {'prob': 0.0008819999999999999, 'path': {'split': 0, 'rule': ('NP', 'VP')}},'VP': {'prob': 0.0014699999999999997, 'path': {'split': 0, 'rule': ('V', 'NP')}}}

score[ 1 ][ 3 ]: {'NP': {'prob': 6.859999999999997e-05, 'path': {'split': 1, 'rule': ('NP', 'NP')}}, 'S': {'prob': 0.013229999999999999, 'path': {'split': 1, 'rule': ('NP', 'VP')}},'VP': {'prob': 9.799999999999998e-05, 'path': {'split': 1, 'rule': ('V', 'NP')}}}

score[ 0 ][ 3 ]: {'NP': {'prob': 9.603999999999995e-07, 'path': {'split': 0, 'rule': ('NP', 'NP')}}, 'S': {'prob': 0.00018521999999999994, 'path': {'split': 1, 'rule': ('NP', 'VP')}}, 'V': {'prob': 0, 'path': None}, 'VP': {'prob': 2.0579999999999993e-05, 'path': {'split': 0, 'rule': ('V', 'NP')}}

 

Starting from the start mark S of the root node, follow the previously retained path to find the syntax tree with the largest probability. The following figure shows the intuitive backtracking process.

Looking back at the actual data storage structure, we have saved the path path in the dictionary, as well as the backtracking rule and split point, so that it is easier to implement the operation in the program.

5. Experimental results: syntax tree structure

The result is consistent with the actual handwriting calculation. The drawn syntax tree is:

Six, analysis and summary

This article implements statistical syntactic analysis based on Probabilistic Context Free Grammar (PCFG) , and the algorithm used is the CYK algorithm . This article records the detailed steps in python to realize the syntactic analysis of PCFG using the CYK algorithm, explain the idea of ​​in-depth understanding of statistical syntactic analysis through the core algorithm and master the implementation of specific algorithm codes to obtain a syntax tree of a sentence.

In the given PCFG grammar rules, the syntactic analysis of specific sentences is realized, and the most possible statistical syntax tree is obtained. First of all, it is realized by a program. It is necessary to find a suitable data structure to store the grammatical rules and probabilities, non-terminal symbols and terminal symbols, That's why I used a dictionary and a list to store data. The second step is the concrete realization of the core algorithm CYK, which is also the operation and calculation process of the data in the above data structure. For this job, the unary rules need to be processed and the extended CYK algorithm is used. In the third step, the best path is obtained through the CYK algorithm, and the final syntax tree needs to be output through backtracking according to the split point.

In the process of completing the core part of CYK, many problems were encountered. The main error-prone areas include: adding the grammatical rules of leaf nodes to the dictionary, adding the rules with the maximum probability of non-leaf nodes, saving different split points, and retracing the results of the path tree structure. Output. This section focuses on three boundary treatment, met overflow and array bounds, is null and so on, which causes a problem occurs at backward tree time, so in order to solve the above problems, the right to set breakpoints, single-step debugger is It is very important and effective troubleshooting method. I continue to debug, set breakpoints in important statements to observe the execution of the program, fix bugs in the program, optimize the algorithm structure, ensure a clear program idea, and finally get the correct result. Relatively speaking, this time I chose the program implementation method, which took a long time, but in the process of debugging and debugging, I have a deeper understanding of the idea of ​​the entire CYK algorithm.


My blog: https://blog.csdn.net/Charzous/article/details/109671138

Guess you like

Origin blog.csdn.net/Charzous/article/details/109671138