I. Introduction
Word principle
- Reader reads the data stream
- After the first data word will uppercase lowercase conversion
- After the second data word is in accordance with a space divided into a word
- Data through the third word is the punctuation, prepositions, etc. removed
- After three filtration, generating a plurality of vocabulary means
Common word breaker (Chinese word scene is not recommended)
StandardAnalyzer (the official word is recommended): English results were better word, can not be correct Chinese word. (Word segmentation, will be a Chinese word for word by word points)
CJKAnalyzer (Japan and South Korea co-tokenizer): dichotomy word, proceed as word segmentation
smartChineseAnalyzer: support for Chinese is better, but poor scalability, extended dictionary, thesaurus and hard to deal with disabled thesaurus, etc.
Third-party Chinese word breaker (Chinese support results were better, the Chinese word scene recommended)
IK Analyzer: now more recommended Chinese word breaker
Two, IK Analyzer tokenizer
Use
PS: the same word is used when the word is used when you want to search the index!