Forward maximum matching algorithm (Chinese word segmentation)

1. Maximum matching method

  Maximum matching refers to the dictionary as the basis, taking the longest word in the dictionary as the scan string with the first number of words taken, and scanning in the dictionary (in order to improve the scanning efficiency, you can also design multiple dictionaries according to the number of words, and then Scanning from different dictionaries according to the number of words).
There are three maximum matching algorithms:
  1. Forward maximum matching
  2, reverse maximum matching
  3, and two-way matching. The
  three algorithms have the same principle. Taking the forward as an example, it is a process of scanning from front to back.

as follows:
Insert picture description here
Insert picture description here
Insert picture description here

Second, use the training set of Peking University to achieve forward maximum matching

1. Data set (select the training set of Peking University)

Insert picture description here
Insert picture description here
Insert picture description here

2. Code implementation


# -*- coding: utf-8 -*-
"""
@author: Junhui Yu
@Date:2020/08/30
"""

pku_training_words = 'icwb2-data/gold/pku_training_words.txt'
words=["欢迎大家来到文本计算与认知智能实验室"]
def get_dic(pku_training_words): 
    with open(pku_training_words,'r',encoding='gbk',) as f:
        try:
            file_content = f.read().split()
        finally:
            f.close()
    chars = list(set(file_content))
    return chars

dic = get_dic(pku_training_words)

def positive_max_matching():
    max_length = 5
    for word in words:#分别对每行进行正向最大匹配处理
        max_length = 5
        word_list = []
        len_hang = len(word)
        while len_hang>0 :
            tryWord = word[0:max_length]
            while tryWord not in dic:
                if len(tryWord)==1:
                    break
                tryWord=tryWord[0:len(tryWord)-1]
            word_list.append(tryWord)
            word = word[len(tryWord):]
            len_hang = len(word)
    return word_list

contents=positive_max_matching()
seg=""
for s in contents:
    if seg=="":
        seg+=s
    else:
        seg+="/"+s
print(seg)

3. Results

Insert picture description here

Guess you like

Origin blog.csdn.net/yjh_SE007/article/details/108308773