1. Maximum matching method
Maximum matching refers to the dictionary as the basis, taking the longest word in the dictionary as the scan string with the first number of words taken, and scanning in the dictionary (in order to improve the scanning efficiency, you can also design multiple dictionaries according to the number of words, and then Scanning from different dictionaries according to the number of words).
There are three maximum matching algorithms:
1. Forward maximum matching
2, reverse maximum matching
3, and two-way matching. The
three algorithms have the same principle. Taking the forward as an example, it is a process of scanning from front to back.
as follows:
Second, use the training set of Peking University to achieve forward maximum matching
1. Data set (select the training set of Peking University)
2. Code implementation
# -*- coding: utf-8 -*-
"""
@author: Junhui Yu
@Date:2020/08/30
"""
pku_training_words = 'icwb2-data/gold/pku_training_words.txt'
words=["欢迎大家来到文本计算与认知智能实验室"]
def get_dic(pku_training_words):
with open(pku_training_words,'r',encoding='gbk',) as f:
try:
file_content = f.read().split()
finally:
f.close()
chars = list(set(file_content))
return chars
dic = get_dic(pku_training_words)
def positive_max_matching():
max_length = 5
for word in words:#分别对每行进行正向最大匹配处理
max_length = 5
word_list = []
len_hang = len(word)
while len_hang>0 :
tryWord = word[0:max_length]
while tryWord not in dic:
if len(tryWord)==1:
break
tryWord=tryWord[0:len(tryWord)-1]
word_list.append(tryWord)
word = word[len(tryWord):]
len_hang = len(word)
return word_list
contents=positive_max_matching()
seg=""
for s in contents:
if seg=="":
seg+=s
else:
seg+="/"+s
print(seg)