Install
pip install jieba
ElasticSearch uses stuttering word segmentation
- Install
cd /usr/share/elasticsearch/plugins && \
wget https://github.com/sing1ee/elasticsearch-jieba-plugin/archive/refs/tags/v5.4.0.zip && \
unzip v5.4.0.zip && \
rm -rf v5.4.0.zip
- Hot update dictionary
- Dictionary format
词条 [词频] [词性]
- Storage path
/usr/share/elasticsearch/plugins/elasticsearch-jieba-plugin-5.4.0/dic/词典名.dict
, 60S automatic refresh
- Dictionary format
- Analyzer
jieba_search
: stuttering word segmentation search mode, suitable for search enginesjieba_index
: full mode of stammering word segmentation, all possible entries are obtained by word segmentation
Algorithm principle
- Efficiently scan the word graph based on the prefix dictionary to generate a directed acyclic graph composed of all possible words
- Dynamic programming finds the maximum probability path and finds the maximum segmentation combination based on word frequency
- New word discovery based on HMM model for unentered words (Viterbi algorithm)
word segmentation mode
- Exact mode: exact word segmentation, suitable for text analysis (default mode)
- Search mode: split long words on exact mode, suitable for search engines
- Full mode: tokenize all possible terms, quickly but not resolve ambiguity
- paddle mode: deep learning training word segmentation, and supports part-of-speech tagging (requires installation
paddlepaddle-tiny
)
word segmentation interface
jieba.cut(字串, cut_all是否全模式, HMM是否新词发现, use_paddle是否训练分词)
jieba.cut_for_search(字串, HMM是否新词发现)
: search mode word segmentation, suitable for search engines to build inverted indexesjieba.posseg.cut(字串, use_paddle=True)
: Part-of-speech tagging word segmentation, need jieba.enable_paddle () in advancejieba.tokenize(字串)
: Location tagging word segmentation- keyword extraction
- TF-IDF algorithm:
jieba.analyse.extract_tags(字串, topK返回几个首要关键词, withWeight是否返回权重, allowPOS词性过滤)
- TextRank algorithm:
jieba.analyse.textrank(字串, topK返回几个首要关键词, withWeight是否返回权重, allowPOS词性过滤)
- TF-IDF algorithm:
Dictionary settings
jieba.set_dictionary('data/dict.txt.big')
: Adjust the main dictionary pathjieba.load_userdict(filepath)
: Append a custom dictionary to improve the accuracy of word segmentation (each line format:词条 [词频] [词性]
),词频=目标词条数/文本词条总数
Dynamically adjust terms
- Dynamic adjustment method
- entry:
jieba.add_word(word, freq=None, tag=None)
,jieba.del_word(word)
- Word frequency:
jieba.suggest_freq(segment, tune=True)
, you can only set the entry to be split or not split
- entry:
- Mandatory entry:
jieba.add_word(word)
,jieba.suggest_freq(word,true)
- Block entry:
jieba.suggest_freq(word,false)
part of speech table
Go platform usage
// go get github.com/yanyiwu/gojieba
jieba := gojieba.NewJieba() //可指定自定义词典路径
defer jieba.Free()
jieba.AddWord("你好") //动态添加词条
words := jieba.Cut(s, hmm) //精确模式 + 是否新词发现
words := jieba.CutForSearch(s, hmm) //搜索模式 + 是否新词发现
words := jieba.CutAll(s) //全模式
posWords := jieba.Tag(s) //词性标注分词
indexWords := jieba.Tokenize(s, mode, hmm) //位置标注分词,gojieba.SearchMode/DefaultMode
weightWords := jieba.ExtractWithWeight(s, topk) //关键词提取,topK返回几个首要关键词
{{o.name}}
{{m.name}}