Python 自然语言处理笔记（二）—— 中文分词

数据集与代码都放在了GitHub仓库

正向最大匹配算法

正向最大匹配FMM算法思想

假定词典中最长的单词长度为m，从左至右取待分词的前m个字符串作为匹配
字段。
查找字典，如果字典中存在和匹配字段相同的词语，则匹配成功，否则去掉匹
配字段的最后一个字符重新匹配
重复以上过程直到匹配全部完成

要求：
使用正向最大匹配算法，利用给定的数据：字典文件corpus.dict.txt，对语料corpus.sentence.txt进行分词，将分词的结果输出到文件corpus.out.txt中

1. 去除标点

# 去标点
def filter_punctuation(line):
    # 去除标点符号
    punc = "[！？。｡＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏.!\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+"
    line = re.sub(punc, "",line)
    return line

实验结果：

2. 最大匹配

算法实现逻辑

# 最大正向匹配
def max_left_match(line, dict):
    input_str = line
    output_str = ""
    # 最大词长
    max_length = dict['max_length']
    word_dict = dict['word_dict']
    while input_str.strip() != '':
        num = max_length
        W = input_str[0:num]
        while W not in word_dict:
            num -= 1
            W = W[0:num]
            if len(W) == 1:
                break
        output_str += W + "/"
        input_str = input_str[len(W):]
    return output_str

实验结果：

3.利用jieba库的分词功能

# 利用jieba库的分词功能
def jieba_cut(line):
    line_seg = " ".join(jieba.cut(line))
    return line_seg

实验结果：

4.完整代码

# encoding=utf-8
import nltk
import string
import re
import jieba

# 加载字典
def load_word_list():
    max_length = 0
    word_dict = set()
    for line in open('./data/corpus.dict.txt',encoding='utf-8',errors='ignore').readlines():
        tmp = len(line)
        if(max_length < tmp):
            max_length = tmp
        word_dict.add(line.strip())
    return {
            'max_length':max_length,
            'word_dict':word_dict
            }

# 去标点
def filter_punctuation(line):
    # 去除标点符号
    punc = "[！？。｡＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏.!\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+"
    line = re.sub(punc, "",line)
    return line

# 最大正向匹配
def max_left_match(line, dict):
    input_str = line
    output_str = ""
    # 最大词长
    max_length = dict['max_length']
    word_dict = dict['word_dict']
    while input_str.strip() != '':
        num = max_length
        W = input_str[0:num]
        while W not in word_dict:
            num -= 1
            W = W[0:num]
            if len(W) == 1:
                break
        output_str += W + "/"
        input_str = input_str[len(W):]
    return output_str

# 利用jieba库的分词功能
def jieba_cut(line):
    line_seg = " ".join(jieba.cut(line))
    return line_seg

# 测试
def main():
    dict = load_word_list()
    for line in open('./data/corpus.sentence.txt',encoding='utf-8',errors='ignore').readlines():
        # 去标点
        new_line = filter_punctuation(line)
        # 自己写的最大匹配
        result = max_left_match(new_line, dict)
        # jieba库的分词
        #result = jieba_cut(new_line)
        # 结果
        print(result)

if __name__ == '__main__':
    main()
    # print(__name__)

Python 自然语言处理笔记（二）—— 中文分词

Python 自然语言处理笔记（二）—— 中文分词

正向最大匹配算法

1. 去除标点

2. 最大匹配

3.利用jieba库的分词功能

4.完整代码

猜你喜欢