Python basic tutorial: use jieba library to segment text

1. What is the jieba library?

Python's jieba library is a Chinese word segmentation tool, which can divide a piece of Chinese text into words one by one, which is convenient for subsequent natural language processing tasks, such as text classification, sentiment analysis, etc. The jieba library uses a word segmentation method based on a prefix dictionary, which can handle various complex situations in Chinese, such as ambiguous words, new words, etc. It also provides a variety of word segmentation modes, such as exact mode, full mode, search engine mode, etc., to meet the needs of different scenarios. In addition, the jieba library also supports user-defined dictionaries, making word segmentation results more accurate.

2. Install the jieba library

 pip install jieba

3. Check the jieba version

 pip show jieba

Name: jieba
Version: 0.42.1
Summary: Chinese Words Segmentation Utilities
Home-page: https://github.com/fxsjy/jieba
Author: Sun, Junyi
Author-email: [email protected]
License: MIT
Requires:
Required-by:

4. How to use

1. Import library

import jieba

2. Define the text that needs word segmentation

text = "我爱发动态,我喜欢使用搜索引擎模式进行分词"
"""
# 对于刚学Python的小伙伴,我还给大家准备了Python基础教程、数百本电子书
# 直接在文末名片自取
"""

3. Use word segmentation mode for word segmentation

3.1. Precise mode (default)
tries to cut sentences most precisely, suitable for text analysis.

seg_list = jieba.cut(text)

3.2. Full mode
Scan all possible words in a sentence, which is very fast, but cannot resolve ambiguity.

seg_list = jieba.cut(text, cut_all=True)

3.3. Search engine mode
On the basis of the precise mode, long words are segmented again to improve the recall rate, which is suitable for word segmentation in search engines.

seg_list = jieba.cut_for_search(text)

4. Convert the word segmentation result into a list

word_list = list(seg_list)

5. Print word segmentation results

print(word_list)

6. Comparison of word segmentation effects

6.1. Accurate mode (default)

['我爱发', '动态', ',', '我', '喜欢', '使用', '搜索引擎', '模式', '进行', '分词']

6.2. Full mode

['我', '爱', '发动', '动态', ',', '我', '喜欢', '使用', '搜索', '搜索引擎', '索引', '引擎', '模式', '进行', '分词']

6.3. Search engine mode

['我爱发', '动态', ',', '我', '喜欢', '使用', '搜索', '索引', '引擎', '搜索引擎', '模式', '进行', '分词']

Guess you like

Origin blog.csdn.net/ooowwq/article/details/130705753