Python helps you analyze how many times did Sun Monkey appear when he was making trouble in Tiangong?

table of Contents

Common functions of jieba library

Precise mode

Full mode

Search engine mode

Add custom word segmentation

Chinese word segmentation case


First of all, I wish you all Internet bigwigs a happy holiday and get rich forever!

I am a little gray ape, a programmer who can write bugs!

The ieba library is an important third-party Chinese word segmentation library in Python,

Since the library is a third-party library and not a module that comes with Python, it needs to be installed through the pip command. The pip installation command is as follows:

pip install jieba

The principle of the jieba library is to use a Chinese thesaurus to compare the content of the word to be segmented with the word segmentation library, and find the phrase with the greatest probability through graph structure and dynamic programming. Of course, the words in the Chinese thesaurus cannot be unique. Therefore, in the jieba library, we can add our custom words to it according to our personal needs.

In the jieba library, three word segmentation modes are supported:

Precise mode

jieba.cut(s)

Cut the sentence into the most precise, often suitable for text analysis

Full mode:

jieba..cut(s,cut_all=True)

Scan all the words that can be formed into words in the sentence, which is fast, but the disadvantage is that it cannot eliminate ambiguity.

Search engine mode:

jieba.cut_for_search(s)

On the basis of the precise model, the long words are divided again to improve the recall rate, suitable for search engine word segmentation

Common functions of jieba library

The commonly used functions in the jieba library are as follows:

function

description

jieba.cut(s)

Precise mode, returns a data type that can be iterated

jieba..cut(s,cut_all=True)

Full mode, output all possible words in the text s

jieba.cut_for_search(s)

Search engine mode, suitable for word segmentation results for search engine indexing

jieba.lcut(s)

Precise mode, returns a list type, it is recommended to use

jieba.lcut(s,cut_all=True)

Full mode, returns a list type, it is recommended to use

jieba.lcut_for_search(s)

Search engine mode, returns a list type, it is recommended to use

jieba.add_word (w)

Add a new word w to the word segmentation dictionary

Next, analyze the above commonly used word segmentation techniques with specific codes:

Precise mode

The jieba.lcut() function returns the exact mode, and the return result will output the original text complete and not redundant in the form of word segmentation

# 精确模式
import jieba
str1 = "中华人民共和国是一个伟大的国家"
list1 = jieba.lcut(str1)
print(list1)

['People's Republic of China','Yes','One','Great','of','Country']

 

Full mode

The jieba.lcut(s,cut_all=True) function returns to the full mode, that is, all possible word segmentation will be output, but the disadvantage is that the data redundancy is large.

# 全模式
import jieba
str1 = "中华人民共和国是一个伟大的国家"
list2 = jieba.lcut(str1, cut_all=True)
print(list2)

['China','Chinese people','People's Republic of China','Chinese','People','People's Republic','Republic','Republic','Country is','One','Great', 's country']

 

Search engine mode

The jieba.lcut_for_search(s) function returns the search engine mode, which will execute the precise mode first, and then segment the obtained words step by step to obtain the results.
 

# 搜索引擎模式
import jieba
str1 = "中华人民共和国是一个伟大的国家"
list3 = jieba.lcut_for_search(str1)
print(list3)

['China','Chinese','People','Republic','Republic','People's Republic of China','Yes','One','Great','The','Country']

 

Add custom word segmentation

However, the Chinese word segmentation resources in the jieba library must be limited, so when we make some custom words, the system will not be able to divide according to our needs. At this time, we need to use the add_word() function to add words in the library. The effect as follows:

str2 = "灰哥哥正在努力的学习Python"
list4 = jieba.lcut(str2)
print(list4)
jieba.add_word("灰哥哥")
list5 = jieba.lcut(str2)
print(list5)

 

Chinese word segmentation case

Next, I will show you a related example, using the Chinese word segmentation technology of jieba library, to count the number of appearances of the characters in the section of "Journey to the West"

import jieba
text = open("dntg.txt").read()    # 读取本章节文本
words = jieba.lcut(text)    

# 将可能出现的任务放入列表
nameWords = ["太白金星", "玉皇大帝", "太上老君", "唐僧", "东海龙王", "孙悟空", "马温", "悟空", "齐天大圣"]
swkWords = ["孙悟空", "马温", "悟空", "齐天大圣"]

counts = {}    # 定义存储数据的字典
for word in words:
    if word not in nameWords:
        continue
    else:
        if word in swkWords:
            word = "孙悟空"
        counts[word] = counts.get(word, 0) + 1  # 将分解后的词数量进行统计
wordLists = list(counts.items())    # 讲字典内容转化为列表形式
wordLists.sort(key=lambda x: x[1], reverse=True)   # 对获取到的词语进行由大到小的排序

for wordList in wordLists:
    word, count = wordList[0], wordList[1]
    print("{}    {}".format(word, count))

Word segmentation result:

It can be seen from the results that the top six characters in the chapter of "Making the Heavenly Palace" are "Monkey King", "Tao Bai Jin Xing", "Jade Emperor", "Tang Seng", "Dragon King of the East Sea". "Among them, Monkey King has the most appearance order, 18 times.

Well, the explanation about the Chinese word segmentation technology of jieba library is shared with you here.

If you think it’s good, remember to like and follow it .

The big bad wolf accompanies you to make progress together!

Guess you like

Origin blog.csdn.net/weixin_44985880/article/details/109261272