NLP learning (three) statistical word segmentation-Chinese word segmentation based on HMM algorithm-python3 implementation

Hidden Markov Model (HMM)

Model introduction
HMM model is composed of a "five-tuple":
StatusSet: state value set
ObservedSet: observation value set
TransProbMatrix: transition probability matrix
EmitProbMatrix: emission probability matrix
InitStatus: initial state distribution
Applying HMM to word segmentation, the problem to be solved Yes: When the parameters (ObservedSet, TransProbMatrix, EmitRobMatrix, InitStatus) are known, the state value sequence is solved. The most famous method to solve this problem is the viterbi algorithm.

Parameter introduction

1.StatusSet, the status value set is (B, M, E, S): {B:begin, M:middle, E:end, S:single}. Each state represents the position of the word in the word, B means the word is the beginning word in the word, M is the middle word in the word, E is the ending word in the word, and S is Single words into words.
2. ObservedSet, the set of observation values ​​is a set of all Chinese characters, even punctuation marks.
3. TransProbMatrix, the meaning of the state transition probability matrix is ​​the probability of transition from state X to state Y, which is a 4×4 matrix, namely {B,E,M,S}×{B,E,M,S}.
4. EmitProbMatrix, each element of the emission probability matrix is ​​a conditional probability, representing P(Observed[i]|Status[j])
5. InitStatus, the initial state probability distribution indicates that the first word of the sentence belongs to {B,E ,M,S} The probability of these four states.

Viterbi algorithm

The core idea of ​​the Viterbi algorithm is to realize the shortest path through dynamic programming. According to Michael Collins, the core idea is:
Define a dynamic programming table π(k,u,v),
π(k,u,v) = maximum probability of a tag sequence ending in tags u,v at position k.
For any k ∈ {1…n}: π(k,u,v) = max (π(k-1,w,u) × q(v|w,u ) × e(xk|v)) The
complete Viterbi algorithm has a lot of information available online. This article focuses on the implementation of the code.

Model training

# -*- coding: utf-8 -*-

# 二元隐马尔科夫模型(Bigram HMMs)
# 'trainCorpus.txt_utf8'为人民日报已经人工分词的预料,29万多条句子

import sys

# state_M = 4
# word_N = 0
A_dic = {}
B_dic = {}
Count_dic = {}
Pi_dic = {}
word_set = set()
state_list = ['B', 'M', 'E', 'S']
line_num = -1

INPUT_DATA = "trainCorpus.txt_utf8"
PROB_START = "prob_start.py"  # 初始状态概率
PROB_EMIT = "prob_emit.py"  # 发射概率
PROB_TRANS = "prob_trans.py"  # 转移概率


def init():  # 初始化字典
    # global state_M
    # global word_N
    for state in state_list:
        A_dic[state] = {}
        for state1 in state_list:
            A_dic[state][state1] = 0.0
    for state in state_list:
        Pi_dic[state] = 0.0
        B_dic[state] = {}
        Count_dic[state] = 0


def getList(input_str):  # 输入词语,输出状态
    outpout_str = []
    if len(input_str) == 1:
        outpout_str.append('S')
    elif len(input_str) == 2:
        outpout_str = ['B', 'E']
    else:
        M_num = len(input_str) - 2
        M_list = ['M'] * M_num
        outpout_str.append('B')
        outpout_str.extend(M_list)  # 把M_list中的'M'分别添加进去
        outpout_str.append('E')
    return outpout_str


def Output():  # 输出模型的三个参数:初始概率+转移概率+发射概率
    start_fp = open(PROB_START, mode='w',encoding="utf-8")
    emit_fp = open(PROB_EMIT, mode='w',encoding="utf-8")
    trans_fp = open(PROB_TRANS, mode='w',encoding="utf-8")
    print ("len(word_set) = %s " % (len(word_set)))

    for key in Pi_dic:  # 状态的初始概率
        Pi_dic[key] = Pi_dic[key] * 1.0 / line_num
    print (Pi_dic,file=start_fp)

    for key in A_dic:  # 状态转移概率
        for key1 in A_dic[key]:
            A_dic[key][key1] = A_dic[key][key1] / Count_dic[key]
    print (A_dic,file=trans_fp)

    for key in B_dic:  # 发射概率(状态->词语的条件概率)
        for word in B_dic[key]:
            B_dic[key][word] = B_dic[key][word] / Count_dic[key]
    print (B_dic,file=emit_fp)

    start_fp.close()
    emit_fp.close()
    trans_fp.close()


def main():
    ifp = open(INPUT_DATA,'r',encoding="UTF-8")
    init()
    global word_set  # 初始是set()
    global line_num  # 初始是-1
    for line in ifp:
        line_num += 1
        if line_num % 10000 == 0:
            print (line_num)

        line = line.strip()
        if not line: continue
        #line = line.encode("utf-8", "ignore")  # 设置为ignore,会忽略非法字符

        word_list = []
        for i in range(len(line)):
            if line[i] == " ": continue
            word_list.append(line[i])
        word_set = word_set | set(word_list)  # 训练预料库中所有字的集合

        lineArr = line.split(" ")
        line_state = []
        for item in lineArr:
            line_state.extend(getList(item))  # 一句话对应一行连续的状态
        if len(word_list) != len(line_state):
            print (sys.stderr, "[line_num = %d][line = %s]" % (line_num, line.endoce("utf-8", 'ignore')))
        else:
            for i in range(len(line_state)):
                if i == 0:
                    Pi_dic[line_state[0]] += 1  # Pi_dic记录句子第一个字的状态,用于计算初始状态概率
                    Count_dic[line_state[0]] += 1  # 记录每一个状态的出现次数
                else:
                    A_dic[line_state[i - 1]][line_state[i]] += 1  # 用于计算转移概率
                    Count_dic[line_state[i]] += 1
                    if not word_list[i] in B_dic[line_state[i]]:
                        B_dic[line_state[i]][word_list[i]] = 0.0
                    else:
                        B_dic[line_state[i]][word_list[i]] += 1  # 用于计算发射概率
    Output()
    ifp.close()


if __name__ == "__main__":
    main()

Test word segmentation effect

# -*- coding: utf-8 -*-

def load_model(f_name):
    ifp = open(f_name,  mode='rb')
    return eval(ifp.read())  # eval参数是一个字符串, 可以把这个字符串当成表达式来求值,


prob_start = load_model("prob_start.py")
prob_trans = load_model("prob_trans.py")
prob_emit = load_model("prob_emit.py")


def viterbi(obs, states, start_p, trans_p, emit_p):  # 维特比算法(一种递归算法)
    V = [{}]
    path = {}
    for y in states:  # 初始值
        V[0][y] = start_p[y] * emit_p[y].get(obs[0], 0)  # 在位置0,以y状态为末尾的状态序列的最大概率
        path[y] = [y]
    for t in range(1, len(obs)):
        V.append({})
        newpath = {}
        for y in states:  # 从y0 -> y状态的递归
            (prob, state) = max(
                [(V[t - 1][y0] * trans_p[y0].get(y, 0) * emit_p[y].get(obs[t], 0), y0) for y0 in states if
                 V[t - 1][y0] > 0])
            V[t][y] = prob
            newpath[y] = path[state] + [y]
        path = newpath  # 记录状态序列
    (prob, state) = max([(V[len(obs) - 1][y], y) for y in states])  # 在最后一个位置,以y状态为末尾的状态序列的最大概率
    return (prob, path[state])  # 返回概率和状态序列


def cut(sentence):
    prob, pos_list = viterbi(sentence, ('B', 'M', 'E', 'S'), prob_start, prob_trans, prob_emit)
    return (prob, pos_list)


if __name__ == "__main__":
    test_str = u"新华网驻东京记者报道"
    prob, pos_list = cut(test_str)
    print (test_str)
    print (pos_list)

result:

新华网驻东京记者报道
['B', 'M', 'E', 'S', 'B', 'E', 'B', 'E', 'B', 'E']

The prediction of artificial word segmentation (trainCorpus.txt_utf8) download link
https://pan.baidu.com/s/1geZkMif

Guess you like

Origin blog.csdn.net/qq_30868737/article/details/107563511