1, jieba basic introduction to library
(1), jieba Library Overview
jieba is an excellent word of Chinese third-party libraries
- Chinese need to get a single text word by word
- jieba Chinese word is good third-party libraries, the need for additional installation
- jieba library offers three modes word, just the easiest to master a function
(2), jieba word principle
Jieba rely on Chinese word thesaurus
- use a Chinese dictionary to determine the probability of association between the characters
- between the probability of large characters composed of phrases, word formation results
- In addition to word, users can also add custom phrases
2, jieba library instructions
(1), jieba word of three modes
Precision mode, full mode, search engine mode
- exact model: separating text precise cut, there is no redundancy word
- full mode: all possible words in the text are scanned, redundant
- Search engine mode: the precise mode on the basis of long-term re-segmentation
(2), jieba common function library
3, jieba Application Example
4, the use of three appearances jieba library statistics in the task Romance
jieba Import TXT = Open ( "D: \\ Three Kingdoms .txt", "R & lt", encoding = 'UTF-. 8') Read (). words = jieba.lcut (TXT) # mode using the exact text word counts # = {} stored in the form of key words and the number of occurrence for word in words: IF len (word). 1 ==: # individual words are not counted Continue the else: Counts [word] = counts.get ( word, 0) + 1 # through all words, which occurs once every corresponding value plus. 1 items list = (counts.items ()) # key-value pairs into a list items.sort (key = lambda x: x [1 ], reverse = True) # be sorted in descending order according to the number of words occurring for I in Range (15): word, COUNT = items [I] Print ( "{0: <{}. 1. 5:>}. 5" .format (word, count))
Statistics on the number of times more than the first fifteen nouns, Cao Cao is indeed the generation of dignity, well-deserved first place, but we will still need to find to get the data for further processing, such as some useless words, some duplicate words meaning.