When collecting beauty stations , keywords need to be segmented, and the python's stuttering method is finally used.
Chinese word segmentation is a basic work of Chinese text processing, and stuttering word segmentation is used for Chinese word segmentation. There are three basic implementation principles:
- Realize efficient word graph scanning based on Trie tree structure, and generate a directed acyclic graph (DAG) composed of all possible word formations of Chinese characters in a sentence
- Dynamic programming is used to find the maximum probability path, and the maximum segmentation combination based on word frequency is found
- For unregistered words, the HMM model based on the ability of Chinese characters to form words is used , and the Viterbi algorithm is used.
Installation (Linux environment)
Download the toolkit , decompress it, enter the directory, and run: python setup.py install
model
- Default mode, which tries to cut sentences most precisely, suitable for text analysis
- Full mode, scans all words that can be turned into words in the sentence, suitable for search engines
interface
- The component only provides the jieba.cut method for word segmentation
- The cut method accepts two input arguments:
- The first parameter is the string that needs to be segmented
- The cut_all parameter is used to control the word segmentation mode
- The string to be segmented can be gbk string, utf-8 string or unicode
- The structure returned by jieba.cut is an iterable generator. You can use a for loop to get each word (unicode) obtained after word segmentation, or you can use list(jieba.cut(...)) to convert it into a list
- seg=jieba.cut("http://www.gg4493.cn/"):
Example
#! -*- coding:utf- 8 -*-
import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all = True)
print "Full Mode:", ' '.join(seg_list)
seg_list = jieba.cut("我来到北京清华大学")
print "Default Mode:", ' '.join(seg_list)
结果