Detailed use in Chinese word segmentation based on jieba package in python (1)

Detailed use in Chinese word segmentation based on jieba package in python (1)

01. Preface

Some jieba participles have also been used in previous articles, but they are basically in the fur. Now I will make some understanding and specific introductions to their official documents in the existing python environment. The main content of this article is also obtained from the official website documentation.

02. Introduction of jieba

02.1 What

"jieba" (Chinese for "to stutter") Chiese text segmentation: built to be the best Python Chinse word segmenmtation module.
"jieba" Chinese word segmentation: do the best Python Chinese word segmentation component

02.2 Features

  • Supports three word segmentation modes:
    precise mode, which tries to cut the sentence most accurately, suitable for text analysis;
    full mode, scans and processes all the words in the sentence that can be formed into words, which is very fast, but cannot resolve ambiguity;
    search engine Mode, on the basis of the precise mode, the long words are segmented again to improve the recall rate, which is suitable for engine word segmentation.
  • Support traditional word segmentation
  • Support custom dictionary
  • MIT License Agreement

02.3 Installation and use

In view of the current organizations that provide major packages are gradually abandoning the maintenance of Python2, Python3 is also strongly recommended here. The installation of jieba participle is also very simple.
Automatic installation method: pip install jieba (window environment) pip3 install jieba (Linux environment);
use method: import jieba

02.4 Algorithms involved

03. Main functions

03.01 Participle

  • jieba.cutThe method accepts three input parameters: the string that needs to be segmented; the cut_all parameter is used to control whether to use the full mode; the HMM parameter is used to control whether to use the HMM model
  • jieba.cut_for_searchThe method accepts two parameters: the string that needs to be segmented; whether to use the HMM model. This method is suitable for word segmentation for search engines to build inverted indexes, and the granularity is relatively fine.
  • The string to be segmented can be unicode or UTF-8 string, GBK string. Note: It is not recommended to enter GBK strings directly, it may be decoded into UTF-8 incorrectly unexpectedly
  • jieba.cutAnd the structure returned by jieba.cut_for_search is an iterable generator, you can use the for loop to get each word (unicode) obtained after the word segmentation, or use
  • jieba.lcutand jieba.lcut_for_searchdirectly return list
  • jieba.Tokenizer(dictionary=DEFAULT_DICT)Create a new custom tokenizer, which can be used to use different dictionaries at the same time. jieba.dtAs the default tokenizer, all global tokenizer related functions are mappings of this tokenizer.
    code example
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Date    : 2018-05-05 22:15:13
# @Author  : JackPI ([email protected])
# @Link    : https://blog.csdn.net/meiqi0538
# @Version : $Id$
import jieba

seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("全模式: " + "/ ".join(seg_list))  # 全模式

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("精准模式: " + "/ ".join(seg_list))  # 精确模式

seg_list = jieba.cut("他来到了网易杭研大厦")  # 默认是精确模式
print(", ".join(seg_list))

seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造")  # 搜索引擎模式
print(", ".join(seg_list))

output result

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\JACKPI~1\AppData\Local\Temp\jieba.cache
Loading model cost 1.026 seconds.
Prefix dict has been built succesfully.
全模式: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
精准模式: 我/ 来到/ 北京/ 清华大学
他, 来到, 了, 网易, 杭研, 大厦
小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, ,, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
[Finished in 1.7s]

03.02 Add custom dictionary

  • Developers can specify their own custom dictionaries to include words that are not in the jieba thesaurus. Although jieba has the ability to recognize new words, adding new words by yourself can ensure a higher accuracy rate
  • Usage: jieba.load_userdict(file_name)# file_name is the path of a file-like object or a custom dictionary
  • The dictionary format is the dict.txtsame as , one word occupies one line; each line is divided into three parts: word, word frequency (can be omitted), part of speech (can be omitted), separated by spaces, and the order cannot be reversed. file_name If it is a path or binary opened file, the file must be UTF-8 encoded .
  • When the word frequency is omitted, use the automatically calculated word frequency that can guarantee the separation of the word
    Add a custom dictionary example
创新办 3 i
云计算 5
凱特琳 nz
台中
  • Change the tmp_dir and cache_file attributes of the tokenizer (jieba.dt by default) to specify the folder where the cache file is located and its file name, respectively, for restricted file systems.
    Use Cases
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Date    : 2018-05-05 22:15:13
# @Author  : JackPI ([email protected])
# @Link    : https://blog.csdn.net/meiqi0538
# @Version : $Id$
#导入jieba包
import jieba
#管理系统路径
import sys
sys.path.append("../")
#获取自定义词典
jieba.load_userdict("userdict.txt")
#导入词性标注的包
import jieba.posseg as pseg

#添加词
jieba.add_word('石墨烯')
jieba.add_word('凱特琳')
#删除词
jieba.del_word('自定义词')
#元组类型的测试数据
test_sent = (
"李小福是创新办主任也是云计算方面的专家; 什么是八一双鹿\n"
"例如我输入一个带“韩玉赏鉴”的标题,在自定义词库中也增加了此词为N类\n"
"「台中」正確應該不會被切開。mac上可分出「石墨烯」;此時又可以分出來凱特琳了。"
)
#默认分词
words = jieba.cut(test_sent)
print('/'.join(words))#使用/把分词的结果分开

print("="*40)
#用于词性标注
result = pseg.cut(test_sent)
#使用for循环把分出的词及其词性用/隔开,并添加,和空格
for w in result:
    print(w.word, "/", w.flag, ", ", end=' ')

print("\n" + "="*40)

#对英文的分割
terms = jieba.cut('easy_install is great')
print('/'.join(terms))
#对英文和汉字的分割
terms = jieba.cut('python 的正则表达式是好用的')
print('/'.join(terms))

print("="*40)
# test frequency tune
testlist = [
('今天天气不错', ('今天', '天气')),
('如果放到post中将出错。', ('中', '将')),
('我们中出了一个叛徒', ('中', '出')),
]

for sent, seg in testlist:
    print('/'.join(jieba.cut(sent, HMM=False)))
    word = ''.join(seg)
    print('%s Before: %s, After: %s' % (word, jieba.get_FREQ(word), jieba.suggest_freq(seg, True)))
    print('/'.join(jieba.cut(sent, HMM=False)))
    print("-"*40)

result

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\JACKPI~1\AppData\Local\Temp\jieba.cache
Loading model cost 1.063 seconds.
Prefix dict has been built succesfully.
李小福/是/创新办/主任//是/云计算/方面//专家/;/ /什么//八一双鹿/
/例如//输入/一个/带//韩玉赏鉴//的/标题/,//自定义/词库/中//增加//此/词为/N/类/
/「/台中/」/正確/應該/不會/被/切開/。/mac//可/分出/「/石墨烯/」//此時//可以/分出/來/凱特琳/了/
========================================
李小福 / nr ,  是 / v ,  创新办 / i ,  主任 / b ,  也 / d ,  是 / v ,  云计算 / x ,  方面 / n ,  的 / uj ,  专家 / n ,  ; / x ,    / x ,  什么 / r ,  是 / v ,  八一双鹿 / nz ,  
 / x ,  例如 / v ,  我 / r ,  输入 / v ,  一个 / m ,  带 / v ,  “ / x ,  韩玉赏鉴 / nz ,  ” / x ,  的 / uj ,  标题 / n ,  , / x ,  在 / p ,  自定义 / l ,  词库 / n ,  中 / f ,  也 / d ,  增加 / v ,  了 / ul ,  此 / r ,  词 / n ,  为 / p ,  N / eng ,  类 / q ,  
 / x ,  「 / x ,  台中 / s ,  」 / x ,  正確 / ad ,  應該 / v ,  不 / d ,  會 / v ,  被 / p ,  切開 / ad ,  。 / x ,  mac / eng ,  上 / f ,  可 / v ,  分出 / v ,  「 / x ,  石墨烯 / x ,  」 / x ,  ; / x ,  此時 / c ,  又 / d ,  可以 / c ,  分出 / v ,  來 / zg ,  凱特琳 / nz ,  了 / ul ,  。 / x ,  
========================================
easy_install/ /is/ /great
python/ /的/正则表达式/是/好用/的
========================================
今天天气/不错
今天天气 Before: 3, After: 0
今天/天气/不错
----------------------------------------
如果/放到/post/中将/出错/
中将 Before: 763, After: 494
如果/放到/post//将/出错/。
----------------------------------------
我们/中//了/一个/叛徒
中出 Before: 3, After: 3
我们/中//了/一个/叛徒
----------------------------------------
[Finished in 2.6s]

03.02 Adjust the dictionary

  • Dictionaries can be modified dynamically in the program using add_word(word, freq=None, tag=None)and .del_word(word)
  • Use suggest_freq(segment, tune=True)to adjust the word frequency of individual words so that they can (or cannot) be separated.
  • Note: The automatically calculated word frequency may not be valid when using the HMM new word discovery feature.
>>> print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
如果/放到/post/中将/出错/。
>>> jieba.suggest_freq(('中', '将'), True)
494
>>> print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
如果/放到/post/中/将/出错/。
>>> print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
「/台/中/」/正确/应该/不会/被/切开
>>> jieba.suggest_freq('台中', True)
69
>>> print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
「/台中/」/正确/应该/不会/被/切开

04. Conclusion

For more information on jieba word segmentation, please pay attention to the next blog: Detailed use in Chinese word segmentation based on jieba package in python (2) . For more about natural language processing, you can check the WeChat subscription account for people, which contains a large number of natural and processing, machine learning articles, learning materials, etc.
write picture description here

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326126609&siteId=291194637