Segment text through Python's jieba library


foreword

Hello everyone, I am Kongkong star. In this article, I will share with you the word segmentation of text through Python's jieba library.


1. What is the jieba library?

Python's jieba library is a Chinese word segmentation tool, which can divide a piece of Chinese text into words one by one, which is convenient for subsequent natural language processing tasks, such as text classification, sentiment analysis, etc. The jieba library uses a word segmentation method based on a prefix dictionary, which can handle various complex situations in Chinese, such as ambiguous words, new words, etc. It also provides a variety of word segmentation modes, such as exact mode, full mode, search engine mode, etc., to meet the needs of different scenarios. In addition, the jieba library also supports user-defined dictionaries, making word segmentation results more accurate.

2. Install the jieba library

 pip install jieba

3. Check the jieba version

 pip show jieba

Name: jieba
Version: 0.42.1
Summary: Chinese Words Segmentation Utilities
Home-page: https://github.com/fxsjy/jieba
Author: Sun, Junyi
Author-email: [email protected]
License: MIT
Requires:
Required-by:

4. How to use

1. Import library

import jieba

2. Define the text that needs word segmentation

text = "我爱发动态,我喜欢使用搜索引擎模式进行分词"

3. Use word segmentation mode for word segmentation

3.1 Accurate mode (default)

Attempts to slice sentences most precisely, suitable for text analysis.

seg_list = jieba.cut(text)

3.2 Full Mode

Scanning all possible words in a sentence is very fast, but it cannot resolve ambiguity.

seg_list = jieba.cut(text, cut_all=True)

3.3 Search Engine Mode

On the basis of the precise mode, the long words are segmented again to improve the recall rate, which is suitable for word segmentation of search engines.

seg_list = jieba.cut_for_search(text)

4. Convert the word segmentation result into a list

word_list = list(seg_list)

5. Print word segmentation results

print(word_list)

6. Comparison of word segmentation effects

6.1 Accurate mode (default)

['我爱发', '动态', ',', '我', '喜欢', '使用', '搜索引擎', '模式', '进行', '分词']

6.2 Full Mode

['我', '爱', '发动', '动态', ',', '我', '喜欢', '使用', '搜索', '搜索引擎', '索引', '引擎', '模式', '进行', '分词']

6.3 Search Engine Mode

['我爱发', '动态', ',', '我', '喜欢', '使用', '搜索', '索引', '引擎', '搜索引擎', '模式', '进行', '分词']

Summarize

おすすめ

転載: blog.csdn.net/weixin_38093452/article/details/130688568