hanlp训练一元、二元语法模型

文章目录

二元语法模型的训练
与用户词典的集成

所谓一元语法模型，就是统计单词在语料库中出现的频数。
二元语法模型，就是连续的两个单词在语料库中出现的频数啦！

在汉语言处理中，要训练一元、二元语法模型，所使用的语料库必须是经过单词拆分后的句子哦。

二元语法模型的训练

下面，我们将使用微软提供的数据集 MSR ，训练一个一元、二元语法模型，代码如下：

rom pyhanlp import *
NatureDictionaryMaker = SafeJClass('com.hankcs.hanlp.corpus.dictionary.NatureDictionaryMaker')
CorpusLoader = SafeJClass('com.hankcs.hanlp.corpus.document.CorpusLoader')

corpus_path = r'E:\Anaconda3\Lib\site-packages\pyhanlp\static\data\test\icwb2-data\training\msr_training.utf8'    #数据集所在路径。可用txt文件
model_path = r'D:\桌面\比赛\msr_model'    #模型保留路径
def train_bigram(corpus_path, model_path):
    sents = CorpusLoader.convert2SentenceList(corpus_path)    #读取语料库（数据集）
    for sent in sents:
        for word in sent:
            if word.label is None:
                word.setLabel("n")
    maker = NatureDictionaryMaker()    #模型生成器（字典生成器）
    maker.compute(sents)    #训练一元语法和二元语法模型
    maker.saveTxtTo(model_path)  # 保存模型到'D:\桌面\比赛\msr_model' 中
    ```
    
    运行上述代码（需要等待相当一段时间），可以从模型路径中得到以下三个文件：  
![运行结果图](https://img-blog.csdnimg.cn/20200502145745388.png)
我们打开 msr_model.ngram（二元语法），可以看到模型：
![二元语法模型展示图](https://img-blog.csdnimg.cn/20200502145900221.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjE0MTM5MA==,size_16,color_FFFFFF,t_70)##使用二元语法模型进行分词

在使用上述模型拆分单词的时候，往往将目标句子与语法模型对应起来，构成一个**词网**进而对应一个词图（有向图，每个节点代码模型中的单词）。根据节点间的距离，按照维特比算法，或Dijstik算法，找到最合理的分词手段。

为了进行分词，首先我们需要将模型加载到 HanLP.Config.CoreDictionaryPath 中。若是用二元语法来分词，则需要将模型加载到 BiGramDictionaryPath 中：
```python
ViterbiSegment = JClass('com.hankcs.hanlp.seg.Viterbi.ViterbiSegment')
DijkstraSegment = JClass('com.hankcs.hanlp.seg.Dijkstra.DijkstraSegment')
CoreDictionary = LazyLoadingJClass('com.hankcs.hanlp.dictionary.CoreDictionary')
CoreBiGramTableDictionary = SafeJClass('com.hankcs.hanlp.dictionary.CoreBiGramTableDictionary')

def load_bigram(model_path):
    HanLP.Config.CoreDictionaryPath = model_path + ".txt"  # 加载一元语法模型
    HanLP.Config.BiGramDictionaryPath = model_path + ".ngram.txt"  # 加载二元语法模型
    
    # 以下部分为兼容新标注集，不感兴趣可以跳过
    HanLP.Config.CoreDictionaryTransformMatrixDictionaryPath = model_path + ".tr.txt"  # 词性转移矩阵，分词时可忽略
    if model_path != msr_model:
        with open(HanLP.Config.CoreDictionaryTransformMatrixDictionaryPath, encoding='utf-8') as src:
            for tag in src.readline().strip().split(',')[1:]:
                Nature.create(tag)

根据需要，可以分别使用维特比算法和Dijkstra算法，来生成汉语分词模型，代码如下：

segment_1 = ViterbiSegment()    #使用维特比算法，字典已经在HanLP的Config中设置过了。
segment_1.enableAllNamedEntityRecognize(False)    #开启命名实体识别
segment_1.enableCustomDictionary(False)     #不挂载用户词典

segment_2 = DijkstraSegment()
segment_2.enableAllNamedEntityRecognize(False)
segment_2.enableCustomDictionary(False)

当然，也可以在生成模型的同时，直接加载词典。并通过 seg 接口，来实现单词的拆分：

a = ViterbiSegment(model_path + ".ngram.txt")
print(a.seg('我爱你'))

与用户词典的集成

一元语法模型、二元语法模型的分词比起字典分词，虽然有其优点。但对于网络新词，未录入词（OOV）的拆分仍乏善可陈。因此，能否集成字典与语法模型，来进行分词呢？HanLP 实现了这一点

集成用户词典有两种方式：

其1为低优先级，即首先在不考虑用户词典的情况下，由语法模型分词。最后根据用户词典，将结果再次合并。
其2为高优先级，其首先考虑用户词典，但具体实现由语法模型自行决定。

两种集成方法如下所示：

from pyhanlp import *

ViterbiSegment = SafeJClass('com.hankcs.hanlp.seg.Viterbi.ViterbiSegment')
#一下内容参考何晗老师的新书《自然语言处理入门》
segment = ViterbiSegment()    
sentence = "社会摇摆简称社会摇"
segment.enableCustomDictionary(False)
print("不挂载词典：", segment.seg(sentence))
CustomDictionary.insert("社会摇", "nz 100")
segment.enableCustomDictionary(True)
print("低优先级词典：", segment.seg(sentence))
segment.enableCustomDictionaryForcing(True)
print("高优先级词典：", segment.seg(sentence))

hanlp训练一元、二元语法模型

文章目录

二元语法模型的训练

与用户词典的集成

猜你喜欢