Python jieba library instructions

1, jieba basic introduction to library

  (1), jieba Library Overview

         jieba is an excellent word of Chinese third-party libraries

         Chinese need to get a single text word by word
         -  jieba Chinese word is good third-party libraries, the need for additional installation

         -  jieba library offers three modes word, just the easiest to master a function

  (2), jieba word principle

         Jieba rely on Chinese word thesaurus

         use a Chinese dictionary to determine the probability of association between the characters
         -  between the probability of large characters composed of phrases, word formation results

         -  In addition to word, users can also add custom phrases

2, jieba library instructions

  (1), jieba word of three modes

         Precision mode, full mode, search engine mode

         exact model: separating text precise cut, there is no redundancy word
         -  full mode: all possible words in the text are scanned, redundant

         -  Search engine mode: the precise mode on the basis of long-term re-segmentation

  (2), jieba common function library


 

3, jieba Application Example

 

 

4, the use of three appearances jieba library statistics in the task Romance

Copy the code
jieba Import 

TXT = Open ( "D: \\ Three Kingdoms .txt", "R & lt", encoding = 'UTF-. 8') Read (). 
words = jieba.lcut (TXT) # mode using the exact text word 
counts # = {} stored in the form of key words and the number of occurrence 

for word in words: 
    IF len (word). 1 ==: # individual words are not counted 
        Continue 
    the else: 
        Counts [word] = counts.get ( word, 0) + 1 # through all words, which occurs once every corresponding value plus. 1 
        
items list = (counts.items ()) # key-value pairs into a list 
items.sort (key = lambda x: x [1 ], reverse = True) # be sorted in descending order according to the number of words occurring 

for I in Range (15): 
    word, COUNT = items [I] 
    Print ( "{0: <{}. 1. 5:>}. 5" .format (word, count))
Copy the code

 

Statistics on the number of times more than the first fifteen nouns, Cao Cao is indeed the generation of dignity, well-deserved first place, but we will still need to find to get the data for further processing, such as some useless words, some duplicate words meaning.

Guess you like

Origin www.cnblogs.com/w2538060594/p/12652429.html