Jieba library using word frequency statistics

Python third party libraries jieba (Chinese word)

I. Overview

jieba Chinese word is good third party libraries
- Chinese need to get a single text word by word
- jieba Chinese word is good third-party libraries, the need for additional installation
- jieba library offers three modes word, just the easiest to master a function

Second, the installation instructions

Automatic installation: (cmd command line) pip install jieba                                              

   

  Successful installation display      

Three characteristics - jieba word

1. Principle: jieba rely on Chinese word thesaurus

- use a Chinese dictionary to determine the correlation between the probability of Chinese characters
- the probability of a large composition of phrases between Chinese characters, the result of the formation of word
- word phrase in addition, users can also add custom

2. The three kinds Sentence Mode:

  • Precise mode: trying to sentence most accurately cut, there is no redundancy word for text analysis;
  • Full mode: All words can be put into words in a sentence are scanned, very fast, redundant, does not resolve the ambiguity;
  • Search engine mode: On the basis of the precise mode of long-term re-segmentation, improve recall, suitable for search engines word.

Four, jieba library of commonly used functions

1. jieba.lcut (s) # fine mode, returns a list of the result word type

The sample code

jieba.lcut ( "China is a great country.")

The results output: [ 'China', 'yes', 'a', 'great', 'a', 'country']

2. jieba.lcut (s, cut_all = True) # full mode, it returns a list of the type of segmentation results, redundant

The sample code

jieba.lcut ( " China is a great country " , cut_all = True)

The results output:  [ 'China', 'nation is', 'a', 'great', 'of', 'country']

3. jieba.lcut_for_sear ch (s) # search mode, returns a list of the type of segmentation result, there are redundant

The sample code

jieba.lcut_for_search ( "People's Republic of China is great " )

The results output:    [ 'China', 'Chinese', 'the people', 'republican', 'Republic', 'People's Republic of China', 'yes', 'great', 'the']

4. jieba.add_word (w) # add a new word to the dictionary word w 

The sample code

jieba.add_word ( " python language " )

Jieba library using word frequency statistics 

Examples    - A Journey to the West back to the first word frequency statistics

Code

Import jieba 
path_txt = ' C: \\ the Users \ 86136 \ Desktop \ Journey .txt '    # document is located at a position on the computer 
TXT = Open (path_txt, " R & lt " ) .read () 
excludes = { " , " , " : " , " " ," , " . " , " " ," , " , " , " ; " , "  " , " ! " ,," ","\n"}
words = jieba.lcut(txt)
counts = {}
for word in words:
    counts[word] = counts.get(word,0)+1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse = True)
for i in range(15):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word,count)) 

result

 

Guess you like

Origin www.cnblogs.com/Mindf/p/12640269.html