Module 041 using 5-jieba library

A, jieba basic introduction to the library

1.1 jieba Library Overview

jieba is an excellent word of Chinese third-party libraries

  • Chinese text needs to obtain a single word by word
  • jieba is an excellent word of Chinese third-party libraries, the need for additional installation
  • jieba library offers three modes word, just the easiest to master a function

Installation 1.2 jieba library

pip install jieba(Cmd command line)

Using 5-jieba -01.jpg 041- module library? X-oss-process = style / watermark

Principle 1.3 jieba word of

Jieba rely on Chinese word thesaurus

  • Chinese use a thesaurus to determine the correlation between the probability of characters
  • Between the probability of large Chinese characters composed phrase, word formation results
  • In addition to word, users can also add custom phrases

Two, jieba library instructions

Three modes 2.1 jieba word of

Precision mode, full mode, search engine mode

  • Precise mode: to separate the text precise cut, there is no redundancy word
  • Full mode: all possible words in the text are scanned, redundant
  • Search engine mode: the precise mode on the basis of long-term re-segmentation

2.2 jieba library of commonly used functions

function description
jieba.lcut (s) Precise mode and return the result of a word list type
jieba.lcut(s, cut_all=True) Full mode, returns a list of the type of segmentation result, there are redundant
jieba.lcut_for_sear ch(s) Search engine mode, returns a list of the type of segmentation result, there are redundant
jieba.add_word (w) Add a new word to the dictionary word w
import jieba

jieba.lcut("中国是一个伟大的国家")
Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/mh/krrg51957cqgl0rhgnwyylvc0000gn/T/jieba.cache
Loading model cost 1.174 seconds.
Prefix dict has been built succesfully.





['中国', '是', '一个', '伟大', '的', '国家']
jieba.lcut("中国是一个伟大的国家",cut_all=True)
['中国', '国是', '一个', '伟大', '的', '国家']
jieba.lcut_for_search("中华人民共和国是伟大的")
['中华', '华人', '人民', '共和', '共和国', '中华人民共和国', '是', '伟大', '的']
jieba.add_word("蟒蛇语言")

2.3 participle points

jieba.lcut(s)

Guess you like

Origin www.cnblogs.com/nickchen121/p/11200531.html