1. What is the jieba library?
Python's jieba library is a Chinese word segmentation tool, which can divide a piece of Chinese text into words one by one, which is convenient for subsequent natural language processing tasks, such as text classification, sentiment analysis, etc. The jieba library uses a word segmentation method based on a prefix dictionary, which can handle various complex situations in Chinese, such as ambiguous words, new words, etc. It also provides a variety of word segmentation modes, such as exact mode, full mode, search engine mode, etc., to meet the needs of different scenarios. In addition, the jieba library also supports user-defined dictionaries, making word segmentation results more accurate.
2. Install the jieba library
pip install jieba
3. Check the jieba version
pip show jieba
Name: jieba
Version: 0.42.1
Summary: Chinese Words Segmentation Utilities
Home-page: https://github.com/fxsjy/jieba
Author: Sun, Junyi
Author-email: [email protected]
License: MIT
Requires:
Required-by:
4. How to use
1. Import library
import jieba
2. Define the text that needs word segmentation
text = "我爱发动态,我喜欢使用搜索引擎模式进行分词"
"""
# 对于刚学Python的小伙伴,我还给大家准备了Python基础教程、数百本电子书
# 直接在文末名片自取
"""
3. Use word segmentation mode for word segmentation
3.1. Precise mode (default)
tries to cut sentences most precisely, suitable for text analysis.
seg_list = jieba.cut(text)
3.2. Full mode
Scan all possible words in a sentence, which is very fast, but cannot resolve ambiguity.
seg_list = jieba.cut(text, cut_all=True)
3.3. Search engine mode
On the basis of the precise mode, long words are segmented again to improve the recall rate, which is suitable for word segmentation in search engines.
seg_list = jieba.cut_for_search(text)
4. Convert the word segmentation result into a list
word_list = list(seg_list)
5. Print word segmentation results
print(word_list)
6. Comparison of word segmentation effects
6.1. Accurate mode (default)
['我爱发', '动态', ',', '我', '喜欢', '使用', '搜索引擎', '模式', '进行', '分词']
6.2. Full mode
['我', '爱', '发动', '动态', ',', '我', '喜欢', '使用', '搜索', '搜索引擎', '索引', '引擎', '模式', '进行', '分词']
6.3. Search engine mode
['我爱发', '动态', ',', '我', '喜欢', '使用', '搜索', '索引', '引擎', '搜索引擎', '模式', '进行', '分词']