stutter participle

Install

pip install jieba

ElasticSearch uses stuttering word segmentation

  • Install
cd /usr/share/elasticsearch/plugins && \
wget https://github.com/sing1ee/elasticsearch-jieba-plugin/archive/refs/tags/v5.4.0.zip && \
unzip v5.4.0.zip && \
rm -rf v5.4.0.zip
  • Hot update dictionary
    • Dictionary format词条 [词频] [词性]
    • Storage path /usr/share/elasticsearch/plugins/elasticsearch-jieba-plugin-5.4.0/dic/词典名.dict, 60S automatic refresh
  • Analyzer
    • jieba_search: stuttering word segmentation search mode, suitable for search engines
    • jieba_index: full mode of stammering word segmentation, all possible entries are obtained by word segmentation

Algorithm principle

  • Efficiently scan the word graph based on the prefix dictionary to generate a directed acyclic graph composed of all possible words
  • Dynamic programming finds the maximum probability path and finds the maximum segmentation combination based on word frequency
  • New word discovery based on HMM model for unentered words (Viterbi algorithm)

word segmentation mode

  • Exact mode: exact word segmentation, suitable for text analysis (default mode)
  • Search mode: split long words on exact mode, suitable for search engines
  • Full mode: tokenize all possible terms, quickly but not resolve ambiguity
  • paddle mode: deep learning training word segmentation, and supports part-of-speech tagging (requires installation paddlepaddle-tiny)

word segmentation interface

  • jieba.cut(字串, cut_all是否全模式, HMM是否新词发现, use_paddle是否训练分词)
  • jieba.cut_for_search(字串, HMM是否新词发现): search mode word segmentation, suitable for search engines to build inverted indexes
  • jieba.posseg.cut(字串, use_paddle=True): Part-of-speech tagging word segmentation, need jieba.enable_paddle () in advance
  • jieba.tokenize(字串): Location tagging word segmentation
  • keyword extraction
    • TF-IDF algorithm:jieba.analyse.extract_tags(字串, topK返回几个首要关键词, withWeight是否返回权重, allowPOS词性过滤)
    • TextRank algorithm:jieba.analyse.textrank(字串, topK返回几个首要关键词, withWeight是否返回权重, allowPOS词性过滤)

Dictionary settings

  • jieba.set_dictionary('data/dict.txt.big'): Adjust the main dictionary path
  • jieba.load_userdict(filepath): Append a custom dictionary to improve the accuracy of word segmentation (each line format: 词条 [词频] [词性]),词频=目标词条数/文本词条总数

Dynamically adjust terms

  • Dynamic adjustment method
    • entry: jieba.add_word(word, freq=None, tag=None),jieba.del_word(word)
    • Word frequency: jieba.suggest_freq(segment, tune=True), you can only set the entry to be split or not split
  • Mandatory entry: jieba.add_word(word),jieba.suggest_freq(word,true)
  • Block entry:jieba.suggest_freq(word,false)

part of speech table

Go platform usage

// go get github.com/yanyiwu/gojieba

jieba := gojieba.NewJieba() //可指定自定义词典路径
defer jieba.Free()

jieba.AddWord("你好") //动态添加词条

words := jieba.Cut(s, hmm) //精确模式 + 是否新词发现
words := jieba.CutForSearch(s, hmm) //搜索模式 + 是否新词发现
words := jieba.CutAll(s) //全模式
posWords := jieba.Tag(s) //词性标注分词
indexWords := jieba.Tokenize(s, mode, hmm) //位置标注分词,gojieba.SearchMode/DefaultMode
weightWords := jieba.ExtractWithWeight(s, topk) //关键词提取,topK返回几个首要关键词
{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324137613&siteId=291194637