Python third party libraries jieba (Chinese word)
I. Overview
jieba Chinese word is good third party libraries
- Chinese need to get a single text word by word
- jieba Chinese word is good third-party libraries, the need for additional installation
- jieba library offers three modes word, just the easiest to master a function
Second, the installation instructions
Automatic installation: (cmd command line) pip install jieba
Successful installation display
Three characteristics - jieba word
1. Principle: jieba rely on Chinese word thesaurus
- use a Chinese dictionary to determine the correlation between the probability of Chinese characters
- the probability of a large composition of phrases between Chinese characters, the result of the formation of word
- word phrase in addition, users can also add custom
2. The three kinds Sentence Mode:
- Precise mode: trying to sentence most accurately cut, there is no redundancy word for text analysis;
- Full mode: All words can be put into words in a sentence are scanned, very fast, redundant, does not resolve the ambiguity;
- Search engine mode: On the basis of the precise mode of long-term re-segmentation, improve recall, suitable for search engines word.
Four, jieba library of commonly used functions
1. jieba.lcut (s) # fine mode, returns a list of the result word type
The sample code
jieba.lcut ( "China is a great country.")
The results output: [ 'China', 'yes', 'a', 'great', 'a', 'country']
2. jieba.lcut (s, cut_all = True) # full mode, it returns a list of the type of segmentation results, redundant
The sample code
jieba.lcut ( " China is a great country " , cut_all = True)
The results output: [ 'China', 'nation is', 'a', 'great', 'of', 'country']
3. jieba.lcut_for_sear ch (s) # search mode, returns a list of the type of segmentation result, there are redundant
The sample code
jieba.lcut_for_search ( "People's Republic of China is great " )
The results output: [ 'China', 'Chinese', 'the people', 'republican', 'Republic', 'People's Republic of China', 'yes', 'great', 'the']
4. jieba.add_word (w) # add a new word to the dictionary word w
The sample code
jieba.add_word ( " python language " )
Jieba library using word frequency statistics
Examples - A Journey to the West back to the first word frequency statistics
Code
Import jieba path_txt = ' C: \\ the Users \ 86136 \ Desktop \ Journey .txt ' # document is located at a position on the computer TXT = Open (path_txt, " R & lt " ) .read () excludes = { " , " , " : " , " " ," , " . " , " " ," , " , " , " ; " , " " , " ! " ,," ","\n"} words = jieba.lcut(txt) counts = {} for word in words: counts[word] = counts.get(word,0)+1 for word in excludes: del counts[word] items = list(counts.items()) items.sort(key=lambda x:x[1],reverse = True) for i in range(15): word, count = items[i] print("{0:<10}{1:>5}".format(word,count))
result