table of Contents
A, jieba basic introduction to the library
1.1 jieba Library Overview
jieba is an excellent word of Chinese third-party libraries
- Chinese text needs to obtain a single word by word
- jieba is an excellent word of Chinese third-party libraries, the need for additional installation
- jieba library offers three modes word, just the easiest to master a function
Installation 1.2 jieba library
pip install jieba
(Cmd command line)
Principle 1.3 jieba word of
Jieba rely on Chinese word thesaurus
- Chinese use a thesaurus to determine the correlation between the probability of characters
- Between the probability of large Chinese characters composed phrase, word formation results
- In addition to word, users can also add custom phrases
Two, jieba library instructions
Three modes 2.1 jieba word of
Precision mode, full mode, search engine mode
- Precise mode: to separate the text precise cut, there is no redundancy word
- Full mode: all possible words in the text are scanned, redundant
- Search engine mode: the precise mode on the basis of long-term re-segmentation
2.2 jieba library of commonly used functions
function | description |
---|---|
jieba.lcut (s) | Precise mode and return the result of a word list type |
jieba.lcut(s, cut_all=True) | Full mode, returns a list of the type of segmentation result, there are redundant |
jieba.lcut_for_sear ch(s) | Search engine mode, returns a list of the type of segmentation result, there are redundant |
jieba.add_word (w) | Add a new word to the dictionary word w |
import jieba
jieba.lcut("中国是一个伟大的国家")
Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/mh/krrg51957cqgl0rhgnwyylvc0000gn/T/jieba.cache
Loading model cost 1.174 seconds.
Prefix dict has been built succesfully.
['中国', '是', '一个', '伟大', '的', '国家']
jieba.lcut("中国是一个伟大的国家",cut_all=True)
['中国', '国是', '一个', '伟大', '的', '国家']
jieba.lcut_for_search("中华人民共和国是伟大的")
['中华', '华人', '人民', '共和', '共和国', '中华人民共和国', '是', '伟大', '的']
jieba.add_word("蟒蛇语言")
2.3 participle points
jieba.lcut(s)