NLP (xiii) Chinese word segmentation tool to use to try

  This article will be three kinds of Chinese word segmentation tool to try to use these three tools are HIT LTP, stuttered word and Beijing University of pkuseg.
  First, let's prepare the environment, the need to install three modules: pyltp, jieba, pkuseg and LTP model of word file cws.model. Add the following five words in user dictionary:

By
little Ann
He Fengying
F-35 fighter
Aida Er · A Lekan

  Python code test is as follows:

# -*- coding: utf-8 -*-

import os
import jieba
import pkuseg
from pyltp import Segmentor

lexicon = ['经', '少安', '贺凤英', 'F-35战斗机', '埃达尔·阿勒坎'] # 自定义词典

# 哈工大LTP分词
def ltp_segment(sent):
    # 加载文件
    cws_model_path = os.path.join('data/cws.model') # 分词模型路径,模型名称为`cws.model`
    lexicon_path = os.path.join('data/lexicon.txt') # 参数lexicon是自定义词典的文件路径
    segmentor = Segmentor()
    segmentor.load_with_lexicon(cws_model_path, lexicon_path)
    words = list(segmentor.segment(sent))
    segmentor.release()

    return words

# 结巴分词
def jieba_cut(sent):
    for word in lexicon:
        jieba.add_word(word)
    return list(jieba.cut(sent))

# pkuseg分词
def pkuseg_cut(sent):
    seg = pkuseg.pkuseg(user_dict=lexicon)
    words = seg.cut(sent)
    return words

sent = '尽管玉亭成家以后,他老婆贺凤英那些年把少安妈欺负上一回又一回,怕老婆的玉亭连一声也不敢吭,但少安他妈不计较他。'
#sent = '据此前报道,以色列于去年5月成为世界上第一个在实战中使用F-35战斗机的国家。'
#sent = '小船4月8日经长江前往小鸟岛。'
#sent = '1958年,埃达尔·阿勒坎出生在土耳其首都安卡拉,但他的求学生涯多在美国度过。'

print('ltp:', ltp_segment(sent))
print('jieba:', jieba_cut(sent))
print('pkuseg:', pkuseg_cut(sent))

& Emsp For the first word, the output results are as follows:

Original: although Yuting later married, his wife He Fengying those in the less safe Mom bully another on a back back, henpecked even heard Yuting did not dare say anything, but his mother does not care about the less safe him.

ltp: [ 'although', 'ting', 'married', 'future', ',', 'he', 'wife', 'He Fengying', 'who', 'in', 'to', 'less Ann ',' mother ',' bullying ',' on ',' a ',' back ',' and ',' a ',' back ',', ',' fear ',' wife ',' of ' 'ting', 'even', 'a', 'sound', 'also', 'no', 'dare', 'throat', ',', 'but', 'less safe', 'his mother ',' no ',' care ',' he ','. ']

jieba: [ 'although', 'ting', 'married', 'future', ',', 'he', 'wife', 'He Fengying', 'who', 'in', 'to', 'less Ann ',' mother ',' bullying ',' on ',' back '' and '' back ',', ',' henpecked ',' a ',' ting ',' even ' , 'sound', 'also', 'can not' 'say anything', ',', 'but less safe,' 'his mother', 'no', 'care', 'he', '. ']

pkuseg: [ 'although', 'ting', 'married', 'future', ',', 'he', 'wife', 'He Fengying', 'who', 'in', 'to', 'less Ann ',' mother ',' bullying ',' on ',' a ',' back ',' and ',' a ',' back ',', ',' fear ',' wife ',' of ' 'ting', 'even', 'a', 'sound', 'also', 'no', 'dare', 'throat', ',', 'but', 'less safe', 'his mother ',' no ',' care ',' he ','. ']

  For the second sentence, the output results are as follows:

Original: As previously reported, in May last year Israel became the first country to use the F-35 fighter jets in actual combat.

ltp: [ 'data', 'after', 'reports',', ',' Israel ',' in ',' Last year, '' May '' become '' the world ',' on ',' the first a ',' a ',' in ',' real ',' in ',' use ',' F-35 ',' aircraft ',' a ',' country ','. ']

jieba: [ 'accordingly', 'before', 'reports',', ',' Israel ',' in ',' last ',' 5 ',' month ',' become ',' world ',' ',' first ',' in ',' real ',' in ',' use ',' F ',' - ',' 35 ',' aircraft ',' a ',' country ','. ']

pkuseg: [ 'data', 'after', 'reports',', ',' Israel ',' in ',' Last year, '' May '' become '' the world ',' on ',' the first a ',' a ',' in ',' real ',' in ',' use ',' F-35 fighter jets ',' a ',' country ','. ']

  For the third words, the output results are as follows:

Original: Boat April 8 by the Yangtze go small Bird Island.

ltp: [ 'boat', 'April', '8' 'by the Yangtze River', 'go', 'Little Bird Island' '. ']

jieba: [ 'boat', '4', 'month', '8', 'Nikkei', 'Yangtze', 'go', 'small', 'Bird Island' '. ']

pkuseg: [ 'boat', 'April', '8', 'after', 'Yangtze', 'go', 'bird', 'island' '. ']

  For the fourth sentence, the output results are as follows:

Original: In 1958, Aida Er · A Lekan was born in the Turkish capital Ankara, but his school career to spend more than in the United States.

ltp: [ '1958 Nian', ',', 'Aida Er · A Lekan' 'born' 'in' 'Turkey', 'capital', 'Ankara', ',', 'but', ' he ',' 'and' school ',' career ',' more than ',' spend ',' America ',' ','. ']

jieba: [ '1958', 'in', ',', 'Egypt', 'Total', '*', 'Al', 'Hum', 'born', 'in', 'Turkey', 'capital '' Ankara ',', ',' but ',' he ',' a ',' school ',' career ',' more than ',' in ',' America ',' spend ''. ']

pkuseg: [ '1958 Nian', ',', 'Aida Er · A Lekan' 'born' 'in' 'Turkey', 'capital', 'Ankara', ',', 'but', ' he ',' 'and' school ',' career ',' more than ',' spend ',' America ',' ','. ']

  Next, the above test case to make a simple conclusion:

  1. User dictionary aspects: LTP and pkuseg the results are good, jieba's performance is not satisfactory, mainly because of a custom dictionary words which contain punctuation, about solutions to the problem, refer to the URL: HTTPS: / /blog.csdn.net/weixin_42471956/article/details/80795534

  2. From the point of view of the effect of the second sentence, pkuseg the segmentation results is the best 'by' should be as a single word segmentation out, and even the addition of LTP and jieba custom dictionary, there is no effect, the same token, 'F-35 fighter jets' is a similar situation.

  Overall, the segmentation effect of the three are excellent, the gap is not very big, but this custom dictionary, no doubt pkuseg effect is more stable. I will in the future use of the word a lot of consideration pkuseg ~
  introduction and use of related pkuseg, refer to the URL: https://github.com/lancopku/PKUSeg-python

Note: The author may wish to know under the micro-channel public number: Python crawlers and algorithms (Micro Signal as: easy_web_scrape), welcome attention ~

Guess you like

Origin www.cnblogs.com/jclian91/p/11295526.html