NLP- three kinds of Chinese word segmentation tool

  This article will be three kinds of Chinese word segmentation tool to try to use these three tools are HIT LTP, stuttered word and Beijing University of pkuseg.
  First, let's prepare the environment, the need to install three modules: pyltp, jieba, pkuseg and LTP model of word file cws.model. Add the following five words in user dictionary:

By
little Ann
He Fengying
F-35 fighter
Aida Er · A Lekan

  Python code test is as follows:

# - * - Coding: UTF-8 - * - 

Import os
 Import jieba
 Import pkuseg
 from pyltp Import Segmentor 

Lexicon = [ ' after ' , ' less safe ' , ' He Fengying ' , ' F-35 fighter jets ' , ' Aida Er-A Lekan ' ] # custom dictionary 

# HIT LTP word 
DEF ltp_segment (Sent):
     # load file 
    cws_model_path the os.path.join = ( ' Data / cws.model ' )# Word path model, model name cws.model` ` 
    lexicon_path the os.path.join = ( ' Data / lexicon.txt ' ) # parameters lexicon custom dictionary file path 
    segmentor = Segmentor () 
    segmentor.load_with_lexicon (cws_model_path, lexicon_path) 
    words = List (segmentor.segment (Sent)) 
    segmentor.release () 

    return words 

# stuttered word 
DEF jieba_cut (Sent):
     for Word in Lexicon: 
        jieba.add_word (Word) 
    return List (jieba.cut (Sent)) 

# pkuseg word 
DEF pkuseg_cut (Sent): 
    segPkuseg.pkuseg = (user_dict = Lexicon) 
    words = seg.cut (Sent)
     return words 

Sent = ' later although Yuting married, his wife He Fengying those years to less bullying mother Ann Yu-ting back on a return to another, henpecked even the sound did not dare say anything, but his mother does not care about the less safe him. ' 
# Sent =' As previously reported, in May last year Israel became the first country to use the F-35 fighter jets in actual combat. ' 
# Sent =' boat April 8 by the Yangtze go small Bird Island. ' 
# Sent =' 1958 years Aida Er · A Lekan was born in the Turkish capital Ankara, but his school career to spend more than in the United States. ' 

Print ( ' LTP: ' , ltp_segment (Sent))
 Print ( ' jieba: ' , jieba_cut (Sent))
 Print ( ' pkuseg: ' , pkuseg_cut (Sent))

  For the first word, the output results are as follows:

Original: although Yuting later married, his wife He Fengying those in the less safe Mom bully another on a back back, henpecked even heard Yuting did not dare say anything, but his mother does not care about the less safe him.

ltp: [ 'although', 'ting', 'married', 'future', ',', 'he', 'wife', 'He Fengying', 'who', 'in', 'to', 'less Ann ',' mother ',' bullying ',' on ',' a ',' back ',' and ',' a ',' back ',', ',' fear ',' wife ',' of ' 'ting', 'even', 'a', 'sound', 'also', 'no', 'dare', 'throat', ',', 'but', 'less safe', 'his mother ',' no ',' care ',' he ','. ']

jieba: [ 'although', 'ting', 'married', 'future', ',', 'he', 'wife', 'He Fengying', 'who', 'in', 'to', 'less Ann ',' mother ',' bullying ',' on ',' back '' and '' back ',', ',' henpecked ',' a ',' ting ',' even ' , 'sound', 'also', 'can not' 'say anything', ',', 'but less safe,' 'his mother', 'no', 'care', 'he', '. ']

pkuseg: [ 'although', 'ting', 'married', 'future', ',', 'he', 'wife', 'He Fengying', 'who', 'in', 'to', 'less Ann ',' mother ',' bullying ',' on ',' a ',' back ',' and ',' a ',' back ',', ',' fear ',' wife ',' of ' 'ting', 'even', 'a', 'sound', 'also', 'no', 'dare', 'throat', ',', 'but', 'less safe', 'his mother ',' no ',' care ',' he ','. ']

  For the second sentence, the output results are as follows:

Original: As previously reported, in May last year Israel became the first country to use the F-35 fighter jets in actual combat.

ltp: [ 'data', 'after', 'reports',', ',' Israel ',' in ',' Last year, '' May '' become '' the world ',' on ',' the first a ',' a ',' in ',' real ',' in ',' use ',' F-35 ',' aircraft ',' a ',' country ','. ']

jieba: [ 'accordingly', 'before', 'reports',', ',' Israel ',' in ',' last ',' 5 ',' month ',' become ',' world ',' ',' first ',' in ',' real ',' in ',' use ',' F ',' - ',' 35 ',' aircraft ',' a ',' country ','. ']

pkuseg: [ 'data', 'after', 'reports',', ',' Israel ',' in ',' Last year, '' May '' become '' the world ',' on ',' the first a ',' a ',' in ',' real ',' in ',' use ',' F-35 fighter jets ',' a ',' country ','. ']

  For the third words, the output results are as follows:

Original: Boat April 8 by the Yangtze go small Bird Island.

ltp: [ 'boat', 'April', '8' 'by the Yangtze River', 'go', 'Little Bird Island' '. ']

jieba: [ 'boat', '4', 'month', '8', 'Nikkei', 'Yangtze', 'go', 'small', 'Bird Island' '. ']

pkuseg: [ 'boat', 'April', '8', 'after', 'Yangtze', 'go', 'bird', 'island' '. ']

  For the fourth sentence, the output results are as follows:

Original: In 1958, Aida Er · A Lekan was born in the Turkish capital Ankara, but his school career to spend more than in the United States.

ltp: [ '1958 Nian', ',', 'Aida Er · A Lekan' 'born' 'in' 'Turkey', 'capital', 'Ankara', ',', 'but', ' he ',' 'and' school ',' career ',' more than ',' spend ',' America ',' ','. ']

jieba: [ '1958', 'in', ',', 'Egypt', 'Total', '*', 'Al', 'Hum', 'born', 'in', 'Turkey', 'capital '' Ankara ',', ',' but ',' he ',' a ',' school ',' career ',' more than ',' in ',' America ',' spend ''. ']

pkuseg: [ '1958 Nian', ',', 'Aida Er · A Lekan' 'born' 'in' 'Turkey', 'capital', 'Ankara', ',', 'but', ' he ',' 'and' school ',' career ',' more than ',' spend ',' America ',' ','. ']

  Next, the above test case to make a simple conclusion:

    1. User dictionary aspects: LTP and pkuseg the results are good, jieba's performance is not satisfactory, mainly because of a custom dictionary words which contain punctuation, about solutions to the problem, refer to the URL: HTTPS: / /blog.csdn.net/weixin_42471956/article/details/80795534

    2. From the point of view of the effect of the second sentence, pkuseg the segmentation results is the best 'by' should be as a single word segmentation out, and even the addition of LTP and jieba custom dictionary, there is no effect, the same token, 'F-35 fighter jets' is a similar situation.

  Overall, the segmentation effect of the three are excellent, the gap is not very big, but this custom dictionary, no doubt pkuseg effect is more stable.
  Introduction and use of related pkuseg, refer to the URL: https://github.com/lancopku/PKUSeg-python



Guess you like

Origin www.cnblogs.com/chen8023miss/p/11447141.html