Learning search engine (four) Chinese word breaker

I. Introduction

Word principle

  1. Reader reads the data stream
  2. After the first data word will uppercase lowercase conversion
  3. After the second data word is in accordance with a space divided into a word
  4. Data through the third word is the punctuation, prepositions, etc. removed
  5. After three filtration, generating a plurality of vocabulary means

Common word breaker (Chinese word scene is not recommended)

StandardAnalyzer (the official word is recommended): English results were better word, can not be correct Chinese word. (Word segmentation, will be a Chinese word for word by word points)

CJKAnalyzer (Japan and South Korea co-tokenizer): dichotomy word, proceed as word segmentation

smartChineseAnalyzer: support for Chinese is better, but poor scalability, extended dictionary, thesaurus and hard to deal with disabled thesaurus, etc.

Third-party Chinese word breaker (Chinese support results were better, the Chinese word scene recommended)

IK Analyzer: now more recommended Chinese word breaker

Two, IK Analyzer tokenizer

Use

 

 

 

 

 PS: the same word is used when the word is used when you want to search the index!

Guess you like

Origin www.cnblogs.com/riches/p/11448059.html