1:Analyzer is generally composed of three parts:
character filters, tokenizers, token filters
2 Components of
Analyzer: The inside of Analyzer is a pipeline
Step 1 Character filter
Step 2 Tokenization
Step 3 Token filtering
3:Analyzer pipeline:
(input)
——---String----->> (CharacterFilters)
-----String----->> (Tokenizer)
-----Tokens----->> (TokensFilters)
-----Tokens----->>
(outpur)
=========================Example 1========================== ==
{
"index": {
"analysis": {
"analyzer": {
"customHTMLSnowball": {
"type": "custom",
"char_filter": ["html_strip"],
"tokenizer": "standard",
"filter": ["lowercase", "stop", "snowball"]
}
}
}
}
}
The above custom Analyzer is named customHTMLSnowball, which means:
remove html tags (html_strip character filter), such as <p> <a> <div> .
Word segmentation, remove punctuation (standard tokenizer)
Convert uppercase words to lowercase (lowercase token filter)
Filter stop words (stop token filter), such as "the" "they" "i" "a" "an" "and" .
Extracting word stems (snowball token filter, snowball algorithm is the most commonly used algorithm for extracting English word stems.)
cats -> cat
catty -> cat
stemmer -> stem
stemming -> stem
stemmed -> stem
=========================Example 1========================== ==
=========================Example 2========================== ==
naked es search, pinyin search
curl -XPUT "http://localhost:9200/yyyy" -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "ik_smart",
"char_filter": [
"html_strip"
],
"filter": [
"pinyin_filter",
"lowercase",
"stop",
"ngram_1_20"
]
},
"default_search":{
"type": "custom",
"tokenizer": "ik_smart",
"char_filter": [
"html_strip"
]
}
},
"filter": {
"ngram_1_20": {
"type": "ngram",
"min_gram": 1,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
},
"pinyin_filter": {
"type": "pinyin",
"keep_original": true,
"keep_joined_full_pinyin": true
}
}
}
}
}'
=========================Example 2========================== ==