jieba word learning with HMM

Question 1: jieba Chinese word principle?

Question 2: Application of HMM in jieba?

Question 3: HMM What applications in other industries?

 

The first step is to learn a thing should look at the official website https://github.com/fxsjy/jieba

Official website gives jieba applied to the algorithm are:

    • Prefix word dictionary FIG efficient scanning, to generate all possible characters based on words in the case of a sentence constituted directed acyclic graph (DAG)
    • It uses dynamic programming to find the maximum probability path to find a combination of cuts based on word frequency of maximum points
    • For unknown words, using the Chinese characters into words based on the ability of HMM models , using the Viterbi algorithm

Function analysis:

Key features include: 1, word; 2, add a custom dictionary: dictionary loaded and adjustment; 3, keyword extraction: IT-IDF algorithm, TextRank algorithm; 4, speech tagging; 5, parallel word; 6, ToKenize; 7, chineseAnalyzer for Whoosh search engine; 8, the command-line word

1. Segmentation

# Mainly related functions including jieba.cut (), jieba.cut_for_search () 
# jieba.cut method takes three input parameters: the word string needs; cut_all parameters used to control whether to use a full mode; parameter is used to control whether the HMM use HMM model 
# jieba.cut_for_search method accepts two parameters: the string needs word; whether to use HMM models. This method is suitable for constructing word search inverted index, the relatively small size 
# string word may be a unicode string or UTF-8, GBK character string. Is not recommended to directly input GBK character string, may unpredictably error decoded into UTF-8: Note 

# each term (Unicode) two or more method returns an iterative Generator, may be used for loop to get the word obtained after or return directly with jieba.lcut List and jieba.lcut_for_search 

# encoding = UTF-8 
Import jieba 

seg_list = jieba.cut ( " I came to Tsinghua University in Beijing " , cut_all = True)
 Print ( " Full Mode: " + "/ " .Join (seg_list))   # full mode 

seg_list = jieba.cut ( " I came to Tsinghua University in Beijing " , cut_all = False)
 Print ( " the Default Mode: " + " / " .join (seg_list))   # precision mode 

seg_list = jieba.cut ( " he came NetEase Hangzhou Research Building " )   # the default mode is exactly 
Print ( " , " .join (seg_list)) 

seg_list = jieba.cut_for_search ( " Master Xiao Ming graduated from the Chinese Academy of Sciences calculated that, after Kyoto University of Japan to study . ")  #Search engine mode 
Print ( " , " .join (seg_list))
Code word

operation result

[Full Mode]: I / to / Beijing / Tsinghua University / Tsinghua University / Mandarin / University 

[exact mode]: I / to / Beijing / Tsinghua University 

[new] word recognition: he came, and Netease, Hang research, building (here, "Hang research" is not in the dictionary, but also Viterbi algorithm identified a) 

[search engine mode]: Xiao Ming, MA, graduated in, China, Science, Arts, sciences, China Academy of Sciences, computing, after the calculation, in Japan, Kyoto University, Kyoto University, studies
result

 

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/students/p/11391942.html