Python Chinese word segmentation, using stuttering word segmentation to segment python

When collecting beauty stations , keywords need to be segmented, and the python's stuttering method is finally used.

Chinese word segmentation is a basic work of Chinese text processing, and stuttering word segmentation is used for Chinese word segmentation. There are three basic implementation principles:

  1. Realize efficient word graph scanning based on Trie tree structure, and generate a directed acyclic graph (DAG) composed of all possible word formations of Chinese characters in a sentence
  2. Dynamic programming is used to find the maximum probability path, and the maximum segmentation combination based on word frequency is found
  3. For unregistered words, the HMM model based on the ability of Chinese characters to form words is used , and the Viterbi algorithm is used.

Installation (Linux environment)

Download the toolkit , decompress it, enter the directory, and run: python setup.py install

 

model

  1. Default mode, which tries to cut sentences most precisely, suitable for text analysis
  2. Full mode, scans all words that can be turned into words in the sentence, suitable for search engines

 

interface

  • The component only provides the jieba.cut method for word segmentation
  • The cut method accepts two input arguments:
    •   The first parameter is the string that needs to be segmented
    •   The cut_all parameter is used to control the word segmentation mode
  • The string to be segmented can be gbk string, utf-8 string or unicode
  • The structure returned by jieba.cut is an iterable generator. You can use a for loop to get each word (unicode) obtained after word segmentation, or you can use list(jieba.cut(...)) to convert it into a list  
  • seg=jieba.cut("http://www.gg4493.cn/"):

 

Example

复制代码

#! -*- coding:utf- 8 -*-
 import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all = True)
print "Full Mode:", ' '.join(seg_list)

seg_list = jieba.cut("我来到北京清华大学")
print "Default Mode:", ' '.join(seg_list)

复制代码

结果

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326560791&siteId=291194637