Based on the detailed introduction of the jieba package in python

1. Introduction of Jieba

Jieba is a Python Chinese word segmentation component that performs well at present. It mainly has the following features:

  • Support four word segmentation modes :

    • Precise mode
    • Full mode
    • Search engine mode
    • paddle mode
  • Support traditional word segmentation

  • Support custom dictionaries

  • MIT License Agreement

Two, installation and use

1. Installation
pip3 install jieba
2. Use
import jieba

Three, the main function of word segmentation

1. jieba.cut and jieba.lcut

lcut converts the returned object into a list object and returns

Analysis of incoming parameters:

def cut(self, sentence, cut_all=False, HMM=True, use_paddle=False):
# sentence: 需要分词的字符串;
# cut_all: 参数用来控制是否采用全模式;
# HMM: 参数用来控制是否使用 HMM 模型;
# use_paddle: 参数用来控制是否使用paddle模式下的分词模式,paddle模式采用延迟加载方式,通过enable_paddle接口安装paddlepaddle-tiny
1) Precision mode (default):

Try to cut the sentence most accurately, suitable for text analysis

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("精准模式: " + "/ ".join(seg_list))  # 精确模式

# -----output-----
精准模式:/ 来到/ 北京/ 清华大学
2) Full mode:

Scan all the words that can be formed into words in the sentence. The speed is very fast, but it cannot resolve the ambiguity;

seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("全模式: " + "/ ".join(seg_list))  # 全模式

# -----output-----
全模式:/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
3) paddle mode

Using PaddlePaddle deep learning framework, training sequence labeling (two-way GRU) network model to achieve word segmentation. It also supports part-of-speech tagging.
To use paddle mode, you need to install paddlepaddle-tiny, pip install paddlepaddle-tiny==1.6.1.
Currently paddle mode supports jieba v0.40 and above.
For versions below jieba v0.40, please upgrade jieba, pip install jieba --upgrade. PaddlePaddle official website

import jieba

# 通过enable_paddle接口安装paddlepaddle-tiny,并且import相关代码;
jieba.enable_paddle()  # 初次使用可以自动安装并导入代码
seg_list = jieba.cut(str, use_paddle=True)
print('Paddle模式: ' + '/'.join(list(seg_list)))

# -----output-----
Paddle enabled successfully......
Paddle模式:/来到/北京清华大学
2,jieba.cut_for_search 和 jieba.lcut_for_search
Search engine mode

On the basis of the precise mode, the long words are segmented again to improve the recall rate, suitable for search engine word segmentation

seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造")  # 搜索引擎模式
print(", ".join(seg_list))

# -----output-----
小明, 硕士, 毕业,, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所,,,, 日本, 京都, 大学, 日本京都大学, 深造
3,jieba.Tokenizer(dictionary=DEFAULT_DICT)

Create a new custom word segmenter, which can be used to use different dictionaries at the same time. jieba.dt is the default tokenizer, and all global tokenizer related functions are the mapping of this tokenizer.

import jieba
  
test_sent = "永和服装饰品有限公司"
result = jieba.tokenize(test_sent) ##Tokenize:返回词语在原文的起始位置
print(result)
for tk in result:
    # print ("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2])    )
    print (tk)
    
# -----output-----
<generator object Tokenizer.tokenize at 0x7f6b68a69d58>
('永和', 0, 2)
('服装', 2, 4)
('饰品', 4, 6)
('有限公司', 6, 10)    

Fourth, add a custom dictionary

Developers can specify their own custom dictionary to include words that are not in the jieba dictionary. Although Jieba has the ability to recognize new words, adding new words by itself can ensure a higher accuracy rate.

1. Add dictionary usage:
 jieba.load_userdict(dict_path)    # dict_path为文件类对象或自定义词典的路径。
2. Examples of custom dictionaries are as follows:

One word occupies a line; each line is divided into three parts: words, word frequency (can be omitted), part of speech (can be omitted), separated by spaces, and the order cannot be reversed.

创新办 3 i
云计算 5
凱特琳 nz
中信建投
投资公司
3. Examples of using custom dictionaries:
1) Use a custom dictionary file
import jieba

test_sent = "中信建投投资公司投资了一款游戏,中信也投资了一个游戏公司"
jieba.load_userdict("userdict.txt")
words = jieba.cut(test_sent)
print(list(words))

#-----output------
['中信建投', '投资公司', '投资', '了', '一款', '游戏', ',', '中信', '也', '投资', '了', '一个', '游戏', '公司']
2) Use jieba to dynamically modify the dictionary in the program
import jieba

# 定义示例句子
test_sent = "中信建投投资公司投资了一款游戏,中信也投资了一个游戏公司"

#添加词
jieba.add_word('中信建投')
jieba.add_word('投资公司')

# 删除词
jieba.del_word('中信建投')

words = jieba.cut(test_sent)
print(list(words))

#-----output------
['中信', '建投', '投资公司', '投资', '了', '一款', '游戏', ',', '中信', '也', '投资', '了', '一个', '游戏', '公司']

Five, keyword extraction

1. Keyword extraction based on TF-IDF algorithm
1) TF-IDF interface and example
import jieba.analyse
  • jieba.analyse.extract_tags(sentence, topK=20, withWeight=False,allowPOS=())
    Among them, what needs to be explained is:
    • 1.sentence is the text to be extracted
    • 2.topK is to return several keywords with the highest TF/IDF weight, the default value is 20
    • 3.withWeight is whether to return the keyword weight value together, the default value is False
    • 4.allowPOS only includes the words of the specified part of speech, the default value is empty, that is, no filtering
  • jieba.analyse.TFIDF(idf_path=None) Create a new TFIDF instance, idf_path is the IDF frequency file
import jieba
import jieba.analyse
#读取文件,返回一个字符串,使用utf-8编码方式读取,该文档位于此python同以及目录下
content  = open('data.txt','r',encoding='utf-8').read()
tags = jieba.analyse.extract_tags(content,topK=10,withWeight=True,allowPOS=("nr")) 
print(tags)

# ----output-------
[('虚竹', 0.20382572423643955), ('丐帮', 0.07839419568792882), ('什么', 0.07287469641815765), ('自己', 0.05838617200768695), ('师父', 0.05459680087740782), ('内力', 0.05353758008018405), ('大理', 0.04885277765801372), ('咱们', 0.04458784837687502), ('星宿', 0.04412126568280158), ('少林', 0.04207588649463058)]
2) The IDF text corpus used for keyword extraction can be switched to the path of a custom corpus

Usage:
jieba.analyse.set_idf_path(file_name) # file_name is the path of the
custom corpus Example of a custom corpus:

劳动防护 13.900677652 勞動防護 13.900677652 生化学 13.900677652 生化學 13.900677652 奥萨贝尔 13.900677652 奧薩貝爾 13.900677652 考察队员 13.900677652 考察隊員 13.900677652 岗上 11.5027823792 崗上 11.5027823792 倒车档 12.2912397395 倒車檔 12.2912397395 编译 9.21854642485 編譯 9.21854642485 蝶泳 11.1926274509 外委 11.8212361103
3) The text corpus of Stop Words used for keyword extraction can be switched to the path of a custom corpus
  • Usage: jieba.analyse.set_stop_words(file_name) # file_name is the path of the custom corpus
  • Example of a custom corpus:
import jieba
import jieba.analyse
#读取文件,返回一个字符串,使用utf-8编码方式读取,该文档位于此python同以及目录下
content  = open(u'data.txt','r',encoding='utf-8').read()
jieba.analyse.set_stop_words("stopwords.txt")
tags = jieba.analyse.extract_tags(content, topK=10)
print(",".join(tags))
4) Example of keyword weight value returned by keywords
import jieba
import jieba.analyse
#读取文件,返回一个字符串,使用utf-8编码方式读取,该文档位于此python同以及目录下
content  = open(u'data.txt','r',encoding='utf-8').read()
jieba.analyse.set_stop_words("stopwords.txt")
tags = jieba.analyse.extract_tags(content, topK=10,withWeight=True)
print(tags)
2. Part-of-speech tagging
  • jieba.posseg.POSTokenizer(tokenizer=None) Create a new custom tokenizer. The tokenizer parameter can specify the internally used jieba.Tokenizer tokenizer. jieba.posseg.dt is the default part of speech tagging tokenizer.
  • Mark the part-of-speech of each word after sentence segmentation, using a notation compatible with ictclas.
  • Usage example
import jieba.posseg as pseg
words = pseg.cut("我爱北京天安门")
for word, flag in words:
    print('%s %s' % (word, flag))
    
# ----output--------
我 r
爱 v
北京 ns
天安门 ns

Part of speech

Insert picture description here

3. Parallel word segmentation

After dividing the target text by line, each line of text is distributed to multiple Python processes for parallel word segmentation, and then the results are merged, thereby obtaining a considerable increase in word segmentation speed. usage:

  • jieba.enable_parallel(4): Enable parallel word segmentation mode, the parameter is the number of parallel processes
  • jieba.disable_parallel(): Turn off parallel word segmentation mode

Can refer to test_file.py

Note: Based on the multiprocessing module that comes with python, currently does not support Windows

4. Tokenize: return the beginning and ending positions of the words in the original text
1) Default mode

Note that the input parameters only accept unicode

import jieba
import jieba.analyse
result = jieba.tokenize(u'永和服装饰品有限公司')
for tk in result:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
    
# ----output-------
word 永和		 start: 0 		 end:2
word 服装		 start: 2 		 end:4
word 饰品		 start: 4 		 end:6
word 有限公司		 start: 6 		 end:10
2) Search mode
import jieba
import jieba.analyse
result = jieba.tokenize(u'永和服装饰品有限公司', mode='search')
for tk in result:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
    
# ----output-------
word 永和		 start: 0 		 end:2
word 服装		 start: 2 		 end:4
word 饰品		 start: 4 		 end:6
word 有限		 start: 6 		 end:8
word 公司		 start: 8 		 end:10
word 有限公司		 start: 6 		 end:10
5. Search engine ChineseAnalyzer for Whoosh

Use jieba and whoosh can realize the search engine function.
whoosh is a full-text search toolkit implemented by python, you can install it using pip:

pip install whoosh

Before introducing jieba + whoosh to implement search, you can read the brief introduction of whoosh below.
Let's look at an example of a simple search engine:

import os
import shutil

from whoosh.fields import *
from whoosh.index import create_in
from whoosh.qparser import QueryParser
from jieba.analyse import ChineseAnalyzer


analyzer = ChineseAnalyzer()

schema = Schema(title=TEXT(stored=True),
                path=ID(stored=True),
                content=TEXT(stored=True,
                             analyzer=analyzer))
if not os.path.exists("test"):
    os.mkdir("test")
else:
    # 递归删除目录
    shutil.rmtree("test")
    os.mkdir("test")

idx = create_in("test", schema)
writer = idx.writer()

writer.add_document(
    title=u"document1",
    path="/tmp1",
    content=u"Tracy McGrady is a famous basketball player, the elegant basketball style of him attract me")
writer.add_document(
    title=u"document2",
    path="/tmp2",
    content=u"Kobe Bryant is a famous basketball player too , the tenacious spirit of him also attract me")
writer.add_document(
    title=u"document3",
    path="/tmp3",
    content=u"LeBron James is the player i do not like")

writer.commit()
searcher = idx.searcher()
parser = QueryParser("content", schema=idx.schema)

for keyword in ("basketball", "elegant"):
    print("searched keyword ",keyword)
    query= parser.parse(keyword)
    print(query,'------')
    results = searcher.search(query)
    for hit in results:
        print(hit.highlights("content"))
    print("="*50)

Six, lazy loading

Ieba uses lazy loading. Import jieba and jieba.Tokenizer() will not trigger the loading of the dictionary immediately. Once it is necessary, it will start to load the dictionary to build the prefix dictionary. If you want to initialize jieba manually, you can also initialize it manually.

import jieba
jieba.initialize()  # 手动初始化(可选)

In the above code, a document is added to the index using add_document(). Among these documents, search for documents containing "basketball" and "elegant".

Seven, other dictionaries

1. Dictionary file with small memory usage https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
2. A dictionary file with better support for traditional Chinese word segmentation https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big

Download the dictionary you need, and then overwrite jieba/dict.txt; or use jieba.set_dictionary('data/dict.txt.big')

Guess you like

Origin blog.csdn.net/TFATS/article/details/108810284