Word frequency analysis

Word frequency analysis, frequency analysis section of text or paragraph each word appears. For the article in English, we can use the split () function to cut a paragraph of the article, for the Chinese, we can use the library division jieba article paragraph.

jieba Import 

# jieba offers three Sentence Mode 
txt = "Long live the People's Republic of China, China total - producing - Long live the party, Chinese Long live the people!" 
words1 = jieba.lcut (TXT) 
words2 = jieba.lcut (TXT, cut_all = True) 
= jieba.lcut_for_search words3 (TXT) 
Print (words2) 
Print (words2) 
Print (words3)

  The above code is the result of this txt text, using three Sentence Mode jieba provided by the word, so obtained.

[ ' China ' , ' Chinese people ' , ' People's Republic of China ' , ' Chinese ' , ' the people ' , ' People's Republic ' , ' republican ' , ' Republic ' , ' long live ' , '' , '' , ' Chinese ' , ' Chinese co - production - party ' , ' KMT ' , 'Communist ' , 'Total - producing - party ' , ' long live ' , ' ' , ' ' , ' China ' , ' people ' , ' the people ' , ' long live ' , ' ' , ' ' ] 
[ ' Chinese ' , ' Chinese people ' , ' people's Republic of China ' , ' Chinese ' , ' the people ' , 'People's Republic ' , ' republican' , ' Republic ' , ' long live ' , ' ' , ' ' , ' China ' , ' China - Communist - party ' , ' KMT ' , ' communist ' , ' Communism - the party ' , ' long live ' , '' , '' , ' China ' , ' people ' , ' the people ' ,' Long live ' ,'' , '' ] 
[ ' Chinese ' , ' Chinese ' , ' the people ' , ' republican ' , ' Republic ' , ' People's Republic of China ' , ' long live ' , ' , ' , ' China ' , ' KMT ' , ' communist ' , ' communism - the party ' , 'China - total - producing - party ' , 'Long live ' , ' , ' , ' China ' , ' the people ' , ' long live ' , ' ! ' ] 
Loading Model cost 0.579 seconds The.

Three kinds of segmentation models are accurate mode, full mode and search engine mode. In daily analysis, I used a precise pattern.

The following word frequency analysis of nineteen large Chinese report. Using the dictionary get method.

= Open TXT ( " ../ text / nineteen Report .txt " , ' R & lt ' , encoding = " GBK " ) .read () 
words = jieba.lcut (TXT) 

Counts = {} 
excludes = { " , " , ' . ' , ' ! ' , ' : ' , ' ' ' , ' " ' , ' , ' , ' A ' ,' , ' \ N- ' , ' a ' , ' in ' , ' to ' }
 for Word in words:
     IF Word in excludes:
         Continue 
    the else : 
        Counts [Word] = counts.get (Word, 0) + 1'd
 # based on the presence Sort number 
# dictionary is not a sequence of combinations of data types 
# first serialize the dictionary 
items = List (counts.items ()) 
the Items.Sort (Key = the lambda X: X [. 1], Reverse = True)
 for iin range(15):
    print(items[i])

The result is:

Built dict has been succesfully the Prefix. 
( ' Development ' , 212 ) 
( ' China ' , 168 ) 
( ' the people ' , 157 ) 
( ' construction ' , 148 ) 
( ' socialist ' 146 ) 
( ' stick ' , 130 ) 
( ' party ' , 103 ) 
( ' national ' , 90 ) 
( ' comprehensive ' , 88) 
( ' Achieve ' , 83 ) 
( ' the system ' , 83 ) 
( ' advance ' , 81 ) 
( ' social ' , 80 ) 
( ' political ' , 80 ) 
( ' Featured ' , 79)

For nineteen word frequency analysis of the report success.

Also: For we do not want to show in word word frequency analysis, in addition to the above approach, you can also use the number appearing figured out first, then use the dictionary del (counts [word]) method removed.

 

 I. . . Doing something wrong. . .

Guess you like

Origin www.cnblogs.com/0422hao/p/11703570.html