Analyzer
An analyzer is a wrapper that combines three functions in a package, and the three functions are executed in ++ order ++:
- Character filter For example, remove html tags, or turn "&" into "and". An analyzer may have zero or more character filters.
- Tokenizers Tokenizers break strings into individual terms or lexical units. An analyzer++ must have a unique ++ tokenizer.
- Word unit filter After word segmentation, the resulting word unit stream will pass through the specified word unit filter in the order specified by ++++. For example, lowercase, delete useless words like "a", "then", or add synonyms like jump and leap.
custom analyzer
- grammar structure
PUT /my_index
{
"settings": {
"analysis": {
"char_filter": { ... custom character filters ... },
"tokenizer": { ... custom tokenizers ... },
"filter": { ... custom token filters ... },
"analyzer": { ... custom analyzers ... }
}
}
}
char_fiter character filter
tokenizer tokenizer
filter word unit filter
analyzer analyzer
- Examples are as follows
PUT /my_index
{
"settings": {
"analysis": {
"char_filter": {
"&_to_and": {
"type": "mapping",
"mappings": [ "&=> and "]
}},
"filter": {
"my_stopwords": {
"type": "stop",
"stopwords": [ "the", "a" ]
}},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [ "html_strip", "&_to_and" ],
"tokenizer": "standard",
"filter": [ "lowercase", "my_stopwords" ]
}}
}}}
IK Analyzer
Because elasticsearch's default tokenizer standard is not suitable for Chinese, for example, the default analyzer will decompose "Happy Home 3" into ["Xing","Fu","Home","Garden","3","Period "], so in practical applications, most of them use some Chinese analyzers. IK analyzer is one of them.
ik analyzer address https://github.com/medcl/elasticsearch-analysis-ik
Download and install IK
- Method 1 Download the zip file and extract it to the plugin folder of elasticsearch. https://github.com/medcl/elasticsearch-analysis-ik/releases
- Method 2 Command installation (version > v5.5.1) ==Local unsuccessful==
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.0.0/elasticsearch-analysis-ik-6.0.0.zip
After the download is complete, restart elasticsearch.
test analyzer
GET /employee/_analyze
{
"analyzer": "ik_max_word",
"text": "幸福家园3期"
}
the result of the response
{
"tokens": [
{
"token": "幸福",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "家园",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
},
{
"token": "3",
"start_offset": 4,
"end_offset": 5,
"type": "ARABIC",
"position": 2
},
{
"token": "期",
"start_offset": 5,
"end_offset": 6,
"type": "COUNT",
"position": 3
}
]
}
use ik analyzer
After the test installation of ik is successful, you can specify the analyzer of a field as ik when creating an index.
PUT /test
{
"mappings": {
"doc": {
"properties": {
"chinese_txt": {
"type": "text",
"analyzer": "ik_max_word",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
The ik analyzer allows users to customize local or remote dictionary libraries, stop thesaurus, and support hot updates. For more detailed instructions, please check the official instructions of ik.
Higher-level queries such as match queries know the field mapping relationships and can apply the correct analyzer for each field being queried. This behavior can be viewed using the validate-query API, where the analyzer specified english_title at index time is english:
GET /my_index/my_type/_validate/query?explain
{
"query": {
"bool": {
"should": [
{ "match": { "title": "Foxes"}},
{ "match": { "english_title": "Foxes"}}
]
}
}
}
Returns the explanation result of the statement: (title:foxes english_title:fox)
Different analyzers can be used at search time and at index time
Elasticsearch supports an optional search_analyzer mapping, which is only applied when searching.