jieba is an excellent third-party library for Chinese word segmentation
Chinese text needs to get individual words through word segmentation
jieba is an excellent third-party library for Chinese word segmentation, which requires additional installation (pip install jieba)
The jieba library provides three word segmentation modes, the easiest is to master one function
The principle of jieba word segmentation
Using a Chinese thesaurus to determine the relationship probability between Chinese characters
There is a high probability of forming phrases between Chinese characters to form word segmentation results
In addition to word segmentation, users can also add custom phrases
Three modes of jieba word segmentation
Exact Mode, Full Mode, Search Engine Mode
Precise mode: split the text precisely without redundant words (most commonly used)
Full mode: scan all possible words in the text, with redundancy
Search engine mode: On the basis of the precise mode, the long words are segmented again
Common functions of jieba library:
jieba.lcut(s) exact mode, returns a list-type word segmentation result l--> list cut -- word segmentation type exact mode
jieba.lcut(s,cut_all=True) full mode, returns a list type word segmentation, there is redundancy
jieba.lcut_for_search(s) Search engine mode, returns a list-type word segmentation result, there is redundancy
jieba.add_word(w) adds a new word w to the word segmentation dictionary