pyltp引入外部词典

大家好，今天跟大家介绍一下在文本学习过程中，为什么要引入外部词典以及引入外部词典之后又什么变化。

为什么引入外部词典
怎么引入（外部词典的配置）
一、为什么引入？
pyltp分词支持用户使用自定义词典，分词外部词典本身是一个文本文件（*.txt）。每行指定一个词，编码必须为UTF-8。（保存文件的时候，设置编码为UTF-8）。

代码注意以下几点：
1、改变模型文件路径！
2、外部词典的加载路径代码。（如下图）

完整代码如下：

# -*- coding: utf-8 -*-
import os
from pyltp import Segmentor, Postagger
# 分词
LTP_DATA_DIR = 'E:\Python\pyltp\ltp\ltp\ltp_data'  # ltp模型目录的路径
cws_model_path = os.path.join(LTP_DATA_DIR, 'cws.model')  # 分词模型路径，模型名称为`cws.model`
lexicon_path = os.path.join(LTP_DATA_DIR, 'E:\Python\pyltp\ltp\ltp\ltp_data\lexicon.txt')  # 参数lexicon是自定义词典的文件路径
segmentor = Segmentor()
segmentor.load_with_lexicon(cws_model_path, lexicon_path)
sent = '据韩联社12月28日反映，美国防部发言人杰夫·莫莱尔27日表示，美国防部长盖茨将于2011年1月14日访问韩国。2010年2月28日中国刘军报道'
words = segmentor.segment(sent)  # 分词
# 词性标注
pos_model_path = os.path.join(LTP_DATA_DIR, 'pos.model')  # 词性标注模型路径，模型名称为`pos.model`
postagger = Postagger()  # 初始化实例
postagger.load(pos_model_path)  # 加载模型
postags = postagger.postag(words)  # 词性标注
ner_model_path = os.path.join(LTP_DATA_DIR, 'ner.model')   # 命名实体识别模型路径，模型名称为`pos.model`
from pyltp import NamedEntityRecognizer
recognizer = NamedEntityRecognizer() # 初始化实例
recognizer.load(ner_model_path)  # 加载模型
# netags = recognizer.recognize(words, postags)  # 命名实体识别
# 提取识别结果中的人名，地名，组织机构名
persons, places, orgs = set(), set(), set()
netags = list(recognizer.recognize(words, postags))  # 命名实体识别
print(netags)
# print(netags)
i = 0
for tag, word in zip(netags, words):
    j = i
    # 人名
    if 'Nh' in tag:
        if str(tag).startswith('S'):
            persons.add(word)
        elif str(tag).startswith('B'):
            union_person = word
            while netags[j] != 'E-Nh':
                j += 1
                if j < len(words):
                    union_person += words[j]
            persons.add(union_person)
    # 地名
    if 'Ns' in tag:
        if str(tag).startswith('S'):
            places.add(word)
        elif str(tag).startswith('B'):
            union_place = word
            while netags[j] != 'E-Ns':
                j += 1
                if j < len(words):
                    union_place += words[j]
            places.add(union_place)
    # 机构名
    if 'Ni' in tag:
        if str(tag).startswith('S'):
            orgs.add(word)
        elif str(tag).startswith('B'):
            union_org = word
            while netags[j] != 'E-Ni':
                j += 1
                if j < len(words):
                    union_org += words[j]
            orgs.add(union_org)
    i += 1
print('人名：', '，'.join(persons))
print('地名：', '，'.join(places))
print('组织机构：', '，'.join(orgs))
# 释放模型
segmentor.release()
postagger.release()
recognizer.release()

我加入的外部词典如下图：
在这里插入图片描述
结果如下：

倘若不引入外部词典，那么分词的时对于某些词解析不是很对，导致其他工作的错误。
但是，外部词典的使用倘若用户数据很大，比如一本书，网络上应该是有现有的词典，供大家使用。
本片文章就写到这啦，祝大家生活愉快。

董文博（DWB）

发布了18 篇原创文章 · 获赞 30 · 访问量 3377

私信关注

pyltp引入外部词典

猜你喜欢