NLP-Jieba word segmentation

      As the library name suggests, the Jieba library is mainly used for Chinese word segmentation. The processing of the Jieba function is like stuttering, generating words one by one. It is currently a very useful Python Chinese word segmentation component.

      Jieba word segmentation supports four modes:

  1. Precise mode, trying to cut the sentence most precisely, suitable for text analysis;

  1. Full mode, scans all the words that can be formed into words in the sentence, the speed is very fast, but it cannot solve the ambiguity;

  1. The search engine mode, based on the precise mode, segments long words again to improve the recall rate, and is suitable for word segmentation in search engines.

  1. The paddle mode uses the PaddlePaddle deep learning framework to train a sequence tagging (bidirectional GRU) network model to achieve word segmentation. Part-of-speech tagging is also supported.


The main function:

Participle

  1. The jieba.cut method accepts four input parameters: the string that needs word segmentation; the cut_all parameter is used to control whether to use the full mode; the HMM parameter is used to control whether to use the HMM model; the use_paddle parameter is used to control whether to use the word segmentation mode in paddle mode, The paddle mode adopts the lazy loading method, installs paddlepaddle-tiny through the enable_paddle interface, and imports related codes;

  1. The jieba.cut_for_search method accepts two parameters: the character string to be segmented; whether to use the HMM model. This method is suitable for word segmentation for search engines to build inverted indexes, and the granularity is relatively fine;

  1. The character string to be segmented can be unicode or UTF-8 character string, GBK character string. Note: It is not recommended to directly input the GBK string, which may be unpredictably wrongly decoded into UTF-8;

  1. The structure returned by jieba.cut and jieba.cut_for_search is an iterable generator. You can use the for loop to obtain each word (unicode) obtained after word segmentation, or use jieba.lcut and jieba.lcut_for_search to return the list directly;

  1. jieba.Tokenizer(dictionary=DEFAULT_DICT) creates a custom tokenizer, which can be used to use different dictionaries at the same time. jieba.dt is the default tokenizer, and all global tokenizer-related functions are mappings of this tokenizer.

# encoding=utf-8
import jieba

jieba.enable_paddle()# 启动paddle模式。 0.40版之后开始支持,早期版本不支持
strs=["我来到北京清华大学","乒乓球拍卖完了","中国科学技术大学"]
for str in strs:
    seg_list = jieba.cut(str,use_paddle=True) # 使用paddle模式
    print("Paddle Mode: " + '/'.join(list(seg_list)))

seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list))  # 全模式

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))  # 精确模式

seg_list = jieba.cut("他来到了网易杭研大厦")  # 默认是精确模式
print(", ".join(seg_list))

seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造")  # 搜索引擎模式
print(", ".join(seg_list))
【全模式】: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学

【精确模式】: 我/ 来到/ 北京/ 清华大学

【新词识别】:他, 来到, 了, 网易, 杭研, 大厦    (此处,“杭研”并没有在词典中,但是也被Viterbi算法识别出来了)

【搜索引擎模式】: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造

custom dictionary

      Load the dictionary:

  1. Developers can specify their own custom dictionaries to include words that are not in the jieba thesaurus. Although jieba has the ability to recognize new words, adding new words by yourself can ensure a higher accuracy rate

  1. Usage: jieba.load_userdict(file_name) # file_name is a file-like object or the path of a custom dictionary

  1. The dictionary format is the same as dict.txt , one word occupies one line; each line is divided into three parts: word, word frequency (can be omitted), part of speech (can be omitted), separated by spaces, and the order cannot be reversed. If file_name is a path or a file opened in binary mode, the file must be encoded in UTF-8.

  1. When the word frequency is omitted, the automatically calculated word frequency that can guarantee the separation of the word is used.

      Adjust the dictionary:

  1. Use add_word(word, freq=None, tag=None) and del_word(word) to modify the dictionary dynamically in the program.

  1. Use suggest_freq(segment, tune=True) to adjust the word frequency of a single word so that it can (or cannot) be separated.

  1. Note: The automatically calculated word frequency may not be valid when using the new word discovery function of HMM.

>>> print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
如果/放到/post/中将/出错/。
>>> jieba.suggest_freq(('中', '将'), True)
494
>>> print('/'.join(jieba.cut('如果放到post中将出错。', HMM=False)))
如果/放到/post/中/将/出错/。
>>> print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
「/台/中/」/正确/应该/不会/被/切开
>>> jieba.suggest_freq('台中', True)
69
>>> print('/'.join(jieba.cut('「台中」正确应该不会被切开', HMM=False)))
「/台中/」/正确/应该/不会/被/切开

keyword extraction

part-of-speech tagging


Main source: https://github.com/fxsjy/jieba

Guess you like

Origin blog.csdn.net/fzz97_/article/details/128879357