python jieba library instructions

1, jieba basic introduction to library

  (1), jieba Library Overview 

         jieba is an excellent word of Chinese third-party libraries 

         Chinese need to get a single text word by word
         -  jieba Chinese word is good third-party libraries, the need for additional installation

         -  jieba library offers three modes word, just the easiest to master a function

  (2), jieba word principle

         Jieba rely on Chinese word thesaurus 

         use a Chinese dictionary to determine the probability of association between the characters
         -  between the probability of large characters composed of phrases, word formation results

         -  In addition to word, users can also add custom phrases

 

2, jieba library instructions

  (1), jieba word of three modes 

         Precision mode, full mode, search engine mode 

         exact model: separating text precise cut, there is no redundancy word
         -  full mode: all possible words in the text are scanned, redundant

         -  Search engine mode: the precise mode on the basis of long-term re-segmentation

  (2), jieba common function library


 

3.jieba application examples

 

4. Using statistics appearances jieba library Three Kingdoms tasks

 

复制代码
import  jieba

txt = open("D:\\三国演义.txt", "r", encoding='utf-8').read()
words = jieba.lcut(txt)     # 使用精确模式对文本进行分词
counts = {}     # 通过键值对的形式存储词语及其出现的次数

for word in words:
    if  len(word) == 1:    # 单个词语不计算在内
        continue
    else:
        counts[word] = counts.get(word, 0) + 1    # 遍历所有词语,每出现一次其对应的值加 1
        
items = list(counts.items())#将键值对转换成列表
items.sort(key=lambda x: x[1], reverse=True)    # 根据词语出现的次数进行从大到小排序

for i in range(15):
    word, count = items[i]
    print("{0:<5}{1:>5}".format(word, count))
复制代码

 

统计了次数对多前十五个名词,曹操不愧是一代枭雄,第一名当之无愧,但是我们会发现得到的数据还是需要进一步处理,比如一些无用的词语,一些重复意思的词语。

 

本文为转载,原文链接:https://www.cnblogs.com/wkfvawl/p/9487165.html

Guess you like

Origin www.cnblogs.com/mxk123/p/11789328.html