For some English, if desired extracts words therein, only you need to use the split string processing () method, for example "China is a great country".
However, for the Chinese text, the lack of a separator between the Chinese word, which is similar to Chinese language and unique "word problems."
jieba ( "stutter") python is an important third-party Chinese word library. jieba library is third-party libraries, not python installation package comes, therefore, we need to be installed pip instruction.
Use the command to install Windows: In networking mode, enter the command line pip install jieba
installation, the installation is complete will be prompted to install successfully.
- Word of three modes jieba
Precision mode, full mode, search engine mode
- exact model: separating text precise cut, there is no redundancy word
- full mode: all possible words in the text are scanned, redundant
- Search engine mode: the precise mode on the basis of long-term re-segmentation
- jieba library of commonly used functions
- For example as follows
jieba._lcut ( "People's Republic of China is a great country.")
jieba._lcut ( "People's Republic of China is a great country", cut_all = True)
jieba._lcut_for_search ( "People's Republic of China is a great country.")
operation result: