Use the Viterbi algorithm to complete part-of-speech tagging (including python code)

The training set used in this paper:

Link: https://pan.baidu.com/s/1fK--_PrEhckKUSrajHGkEw
Extraction code: aehg 

Article directory

The research status of "part-of-speech tagging" at home and abroad:

1. What is the Viterbi algorithm?

2. Training set introduction

3. Use the Viterbi algorithm for part-of-speech tagging (code)

Summarize


The research status of "part-of-speech tagging" at home and abroad:

Since the establishment of the Brown corpus in the 1960s, computer automatic part-of-speech tagging technology has developed rapidly. Up to now, there have been many computer automatic part-of-speech tagging technologies at home and abroad. According to the theories and methods these technologies rely on, the part-of-speech tagging methods can be divided into the following Several categories:

1. Rule-based part-of-speech tagging method

Rule-based methods mainly rely on a database of manually formulated disambiguation rules, which state the conditions for disambiguation. If an ambiguous word is encountered during part-of-speech tagging, the meaning of the ambiguous word will be matched against the semantic rules contained in the disambiguation rule base in turn. If the match can be successful and there is no violation of any semantic rules, it means that the disambiguation is completed, otherwise it means Disambiguation job failed.

2. Statistical part-of-speech tagging method

Using statistical methods for part-of-speech tagging requires a pre-marked corpus as a training corpus, and calculates the probability that a given word has a certain part of speech based on the context information of the word. For example, Hidden Markov Models.

3. Comprehensive algorithm based on the combination of rules and statistics

The part-of-speech tagging method based on the combination of rules and statistics combines the two methods of part-of-speech tagging, rules and statistics. Similar to the rule-based tagging algorithm, the algorithm needs to determine what kind of tag an ambiguous word should have according to the rules; similar to the random tagging algorithm, the rules in the algorithm can be automatically inferred from the previously tagged training corpus. In this way, the advantages of the two can be used, the existing language rules can be used, and the advantages of statistical methods in acquiring language knowledge that are more objective and have a high coverage rate can also be combined.

4. Transition-Based Error-Driven Learning Approach

Because it is very difficult to obtain language rules manually, in 1995 Eric Brill proposed a conversion-based error-driven method to replace manual formulation of language rules, so that the system can automatically obtain conversion rules from the training corpus. The rules obtained in this way have higher accuracy and applicability. This method learns large-scale rules from the large-scale part-of-speech tagging corpus through templates in an error-driven manner, and finally uses these rules to tag the text. This method solves the conflict problem among the rules, and saves the manpower and material resources needed to formulate the rules manually.


1. What is the Viterbi algorithm?

Because the parts of speech are subdivided into dozens of types, and each vocabulary needs to be tagged, the traditional traversal method is too inefficient. If we want to achieve the optimal part-of-speech tagging of a sentence (that is, the part-of-speech collocation with the highest probability value), It can be solved more efficiently with the help of the Viterbi algorithm. The purpose of the Viterbi algorithm is to find the optimal path with less resource consumption.

The specific algorithm principle is a little bit complicated, here are two links for everyone to learn:

Popular explanation of Viterbi algorithm: https://zhuanlan.zhihu.com/p/28274845

How to explain viterbi algorithm popularly: https://www.zhihu.com/question/20136144

2. Training set introduction

In this process, the part-of-speech tagging training set word-cixing.txt is used, which contains many tagged words.

Link: https://pan.baidu.com/s/1fK--_PrEhckKUSrajHGkEw
Extraction code: aehg 

Training set preview:

          

Among them, the left side of '/'  is the vocabulary, and the right side is the part of speech.

3. Use the Viterbi algorithm for part-of-speech tagging (code)

Import the training set, form a dictionary and part-of-speech collection, and specify the calculation method of model parameters:

The model has three parameters pi, A, and B. The custom log method is to prevent 0 from appearing in the matrix, so that the log operation cannot be performed.

import numpy as np
tag_id, id_tag = {}, {}
word_id, id_word = {}, {}
for line in open('word-cixing.txt'):
    items = line.split('/')
    word, tag = items[0], items[1].rstrip() #抽取每一行的词汇及词性
    if word not in word_id:
        word_id[word] = len(word_id)
        id_word[len(id_word)] = word
    if tag not in tag_id:
        tag_id[tag] = len(tag_id)
        id_tag[len(id_tag)] = tag
M = len(word_id) #词典的大小 
N = len(tag_id) #词性的种类数

pi = np.zeros(N)
A = np.zeros((N, M))
B = np.zeros((N, N))

pre_tag = ""
for line in open('word-cixing.txt'):
    items = line.split('/')
    wordId, tagId = word_id[items[0]], tag_id[items[1].rstrip()]
    if pre_tag == '':
        pi[tagId] += 1
        A[tagId][wordId] += 1
    else: 
        A[tagId][wordId] += 1
        B[tag_id[pre_tag]][tagId] += 1
    if items[0] == '.':
        pre_tag = ''
    else:
        pre_tag = items[1].rstrip()

pi = pi/sum(pi)
for i in range(N):
    A[i] /= sum(A[i])
    B[i] /= sum(B[i])

def log(v):
    if v == 0:
        return np.log(0.00001)
    else:
        return np.log(v)

Define the Viterbi algorithm:

def viterbi(x, pi, A, B):
    x = [word_id[word] for word in x.split(' ')]
    T = len(x)
    dp = np.zeros((T, N))
    ptr = np.array([[0 for x in range(N)] for y in range(T)])
    for j in range(N):
        dp[0][j] = log(pi[j]) + log(A[j][x[0]])
    for i in range(1, T):
        for j in range(N):
            dp[i][j] = -999
            for k in range(N):
                score = dp[i-1][k] + log(B[k][j]) + log(A[j][x[i]])
                if score > dp[i][j]:
                    dp[i][j] = score
                    ptr[i][j] = k
    best_seq = [0]*T
    best_seq[T-1] = np.argmax(dp[T-1])

    for i in range(T-2,-1,-1):
        best_seq[i] = ptr[i+1][best_seq[i+1]]
    result = [id_tag[best_seq[i]] for i in range(len(best_seq))]
    return result

 Enter custom text for part-of-speech tagging:

x = 'The best feeling in the world is when you know your heart is smiling and never put off what you can do today until tomorrow'
result = viterbi(x, pi, A, B)
print("句子:")
print(x)
x=x.split()
result_data = dict(zip(x, result))
print("词性:")
print(result)
print('\n')
print("句中单词的词性标注:") 
print(result_data)

Part-of-speech tagging results: 


Summarize

The code is borrowed from the big guys on the Internet. I practiced this method and shared the data set and code with you.

Guess you like

Origin blog.csdn.net/weixin_50706330/article/details/127428301