Python realizes the method of maximum probability word segmentation

Problem Description

  1. Read the given dictionary, and input the character string to be segmented "Spring Festival is approaching, the joyful atmosphere has come quietly.", and convert the probability of the word string in the dictionary into the cost of the word string;
  2. Look up candidate words in the dictionary and return;
  3. Calculate the cumulative fee and select the best precursor;
  4. Output the word segmentation calculation process and the final word segmentation results.


The Unary Grammar Model of Maximum Probability Method

The word string with the highest probability of all the word strings in the substring to be segmented is the most likely segmentation result. The following figure shows the Bayesian formula, where p(z|w) and p(z) are fixed values, so Just ask for p(w), the probability of the word string.
insert image description here
N-grams can be used to find the probability of word strings. Use 1-grams, that is, find out all possible words in the input string according to the vocabulary, and then find out all possible segmentation paths, and find the best path from them. , which is the segmentation result.
Since the maximum probability method considers the word string that is most likely to be divided under the condition that a certain word string appears, rather than relying on the order of the segmented words, etc., as long as the correctly segmented words in the ambiguous sentence have appeared and are in the vocabulary Therefore, under the premise of having a large amount of labeled corpus, the maximum probability method can avoid segmentation intersection ambiguity and combination ambiguity to a certain extent.

1. Read the given dictionary

Read the vocabulary file given in the experiment. This is a txt file in csv format. The content includes three items: characters, frequency of characters, and frequency of characters. You can get information about characters by reading this file. relevant information to calculate their costs.
insert image description here

2. Understanding and calculation of fees

Since the probability of each word is a very small positive number, if the Chinese character string is long, the probability of the final word string will be very small, even close to 0, and cannot be displayed, so take the negative logarithm of the probability, so that Multiplication becomes addition, so that seeking the maximum value becomes the minimum value, where is the cost of the word, which is the cost of the word string. Use the frequency of the file to calculate the corresponding cost. Here, use the log function of numpy to convert the percentage into the corresponding floating point number, and then go to the logarithm to find the cost; for words that do not appear in the dictionary, this will be given A fixed fee, here is the negative logarithm of 0.000001, the function code is as follows:

code show as below:

def cost():
    file = open(file='WordFrequency.txt', mode='r', encoding='utf-8')
    str1 = '春节将至,欢乐的气氛已悄悄降临。'
    lines = file.readlines()
    dic = {
    
    }
    for line in lines:
        item = line.split(',')
        fre = item[2].strip('\n')
        fre = fre.strip('%')
        dic[item[0]] = np.log(float(fre) / 100)

3. Select candidate words

For a candidate word, traverse its front candidate words in the candidate word list, if the position sequence number+1 of the last word of the candidate word in the sentence is equal to the position sequence number of the current candidate word, then this candidate word is the current candidate word Precursor. While traversing the precursors, calculate the cumulative cost, that is, the cost of the predecessor + the cost of the current word, and calculate the minimum cumulative cost, and the corresponding precursor is the best precursor. When selecting a candidate word, first read each word in turn, and then judge whether the word appears in the dictionary. If it appears in the dictionary, add it to the candidate word list and continue to traverse the specified number of words backwards, see Add Whether it still appears in the dictionary after the specified number of characters, and if it exists, add the word to the candidate dictionary

def get_candidate_words(sentence):
    candidate_words = []                    #存放候选词
    for sp in range(len(sentence)):
        w = sentence[sp]
        if w not in dic.keys():
            candidate_words.append([w, sp, sp])         # 有些字可能不在语料库中,把它作为单个字加进去,费用为一个定值
            dic[w] = np.log(0.000001)
        else:
            candidate_words.append([w, sp, sp])
        for mp in range(1, 20):
            if sp + mp < len(sentence):
                w += sentence[sp + mp]
                if w in dic:
                    candidate_words.append([w, sp, sp + mp])  # 存储词,初始位置,结束位置
    # print('候选词有:%s' % candidate_words)
    return candidate_words


4. Choose the best antecedent

While traversing the precursors, calculate the cumulative cost, that is, the cost of the predecessor + the cost of the current word, and calculate the minimum cumulative cost, and the corresponding precursor is the best precursor. When selecting the best precursor, first judge the position where the word appears. If it appears at the beginning of the sentence, then the word has no predecessor, and its predecessor can be set to null; if its position is not at the beginning of the sentence end, then it is necessary to calculate the cost from this word to the previous word, that is, to judge all the candidate words in front of it, if the start position of this word is equal to the end position of the candidate word in front of it plus one, then it is necessary to calculate the cost between them The cost of the time, and then select a predecessor of the word with the smallest cost, and use the minimum cost of the predecessor plus the cost of the current word during calculation.

def get_pro_word(candidate_words):                                                                   # 获得最佳前驱词以及它的费用
    pro_word = {
    
    }
    for i in range(len(candidate_words)):
        if candidate_words[i][1] == 0:
            pro_word[candidate_words[i][0]] = ['null', dic[candidate_words[i][0]]]                     #候选词, 最佳前驱词, 累计费用
        else:
            now = candidate_words[i][1]                                                                     # 当前候选词的第一个字在句子中的位置序号
            for j in range(i-1, -1, -1):                                                                        # 遍历当前候选词前边的候选词,找出它的前驱词
                pro = candidate_words[j][2]                                                                 # 前边词的第二个字在句子中的位置序号
                if candidate_words[i][0] in pro_word:
                    fi = pro_word[candidate_words[i][0]][1]
                else:
                    fi = -100
                # print(pro_word, j)
                if pro == now:
                    pass
                elif pro + 1 == now:
                    if candidate_words[j][0] in pro_word.keys():
                        temp = dic[candidate_words[i][0]] + pro_word[candidate_words[j][0]][1]
                    else:
                        temp = dic[candidate_words[i][0]] + dic[candidate_words[j][0]]
                    if temp > fi:
                        pro_word[candidate_words[i][0]] = [candidate_words[j][0], temp]
    # print(pro_word)
    return pro_word


5. Get the final word segmentation result

Find the candidate word whose last word position number is equal to the last word position number of the sentence in the candidate word list, and select the candidate word with the smallest cumulative cost, which is the terminal word of the sentence. Starting from the terminal word with the least cost, push forward and backward to get the best predecessor sequence, that is, the word segmentation result. Here I chose a special method, assuming that each sentence ends with a full stop, and there is only one full stop, then get the precursor of the full stop, and then proceed in this way until the precursor of a certain word is null to end the traversal , so that a list of the best precursors starting with a period can be obtained, and then the list is reversed, and finally divided by /, the elements in the list are divided, and the final word segmentation result is obtained.

def get_final_word(best_pro_word):
     # 按照逆序处理后的最佳前驱词列表,方便处理,但是只能对以句号结尾的句子有效
    result = []
    result.append('。')
    if '。' in best_pro_word.keys():
        p = best_pro_word['。'][0]
        while best_pro_word[p][0] != 'null':
            result.append(p)
            p = best_pro_word[p][0]
    result.append(p)
    result = reversed(result)
    print('/'.join('%s' %id for id in result))

Experiment screenshot

insert image description here

Guess you like

Origin blog.csdn.net/qq_48068259/article/details/127644789