1, jieba basic introduction to library
(1), jieba Library Overview
jieba is an excellent word of Chinese third-party libraries
- Chinese need to get a single text word by word
- jieba Chinese word is good third-party libraries, the need for additional installation
- jieba library offers three modes word, just the easiest to master a function
(2), jieba word principle
Jieba rely on Chinese word thesaurus
- use a Chinese dictionary to determine the probability of association between the characters
- between the probability of large characters composed of phrases, word formation results
- In addition to word, users can also add custom phrases
2, jieba library instructions
(1), jieba word of three modes
Precision mode, full mode, search engine mode
- exact model: separating text precise cut, there is no redundancy word
- full mode: all possible words in the text are scanned, redundant
- Search engine mode: the precise mode on the basis of long-term re-segmentation
(2), jieba common function library
3.jieba application examples
4. Using statistics appearances jieba library Three Kingdoms tasks
import jieba txt = open("D:\\三国演义.txt", "r", encoding='utf-8').read() words = jieba.lcut(txt) # 使用精确模式对文本进行分词 counts = {} # 通过键值对的形式存储词语及其出现的次数 for word in words: if len(word) == 1: # 单个词语不计算在内 continue else: counts[word] = counts.get(word, 0) + 1 # 遍历所有词语,每出现一次其对应的值加 1 items = list(counts.items())#将键值对转换成列表 items.sort(key=lambda x: x[1], reverse=True) # 根据词语出现的次数进行从大到小排序 for i in range(15): word, count = items[i] print("{0:<5}{1:>5}".format(word, count))
统计了次数对多前十五个名词,曹操不愧是一代枭雄,第一名当之无愧,但是我们会发现得到的数据还是需要进一步处理,比如一些无用的词语,一些重复意思的词语。