Two python - jieba library (mandatory)

jieba library ( "stuttering" Libraries)

  • Important third-party Chinese word library
  • As the Chinese text word is not divided by a space or punctuation, there is an important word problems and similar language Chinese
  • A use of the Chinese lexicon, the word lexicon with words and points for comparison, find the maximum probability by drawing the structure and dynamic programming methods phrases

Three kinds Sentence Mode jieba library

  • Fine mode : for text analysis, redundancy is low
  • Full mode : the sentence in all possible words are divided out, very fast, but can not solve the problem of the uprising,Highest redundancy
  • Search engine mode : On the basis of precise patterns, based on the long-term redistribution

Fine mode: jieba.lcut ()

The most commonly used Chinese word function

>>> import jieba
>>> jieba.lcut("全国计算机等级考试")
Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\hy\AppData\Local\Temp\jieba.cache
Loading model cost 1.007 seconds.
Prefix dict has been built successfully.
['全国', '计算机', '等级', '考试']

Search engine mode: jieba.lcut_for_search ()

First precise mode, then long term segmentation

>>> jieba.lcut_for_search("全国计算机等级考试")
['全国', '计算', '算机', '计算机', '等级', '考试']

Full mode: jieba.lcut (s, cut_all = True)

>>> jieba.lcut("全国计算机等级考试", cut_all=True)
['全国', '国计', '计算', '计算机', '算机', '等级', '考试']

Really good thought, on the use of search engine model, moderate redundancy

jieba.add_word ()

To add new words to the lexicon jieba

>>> jieba.lcut("全国计算机等级考试python科目")
['全国', '计算机', '等级', '考试', 'python科目']
Published 203 original articles · won praise 56 · views 20000 +

Guess you like

Origin blog.csdn.net/weixin_44478378/article/details/104588020