ElasticSearch study notes (4)--analyzer

Analyzer

An analyzer is a wrapper that combines three functions in a package, and the three functions are executed in ++ order ++:

  • Character filter For example, remove html tags, or turn "&" into "and". An analyzer may have zero or more character filters.
  • Tokenizers Tokenizers break strings into individual terms or lexical units. An analyzer++ must have a unique ++ tokenizer.
  • Word unit filter After word segmentation, the resulting word unit stream will pass through the specified word unit filter in the order specified by ++++. For example, lowercase, delete useless words like "a", "then", or add synonyms like jump and leap.

custom analyzer

  • grammar structure
PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": { ... custom character filters ... },
            "tokenizer":   { ...    custom tokenizers     ... },
            "filter":      { ...   custom token filters   ... },
            "analyzer":    { ...    custom analyzers      ... }
        }
    }
}

char_fiter character filter

tokenizer tokenizer

filter word unit filter

analyzer analyzer

  • Examples are as follows
PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type":       "mapping",
                    "mappings": [ "&=> and "]
            }},
            "filter": {
                "my_stopwords": {
                    "type":       "stop",
                    "stopwords": [ "the", "a" ]
            }},
            "analyzer": {
                "my_analyzer": {
                    "type":         "custom",
                    "char_filter":  [ "html_strip", "&_to_and" ],
                    "tokenizer":    "standard",
                    "filter":       [ "lowercase", "my_stopwords" ]
            }}
}}}

IK Analyzer

Because elasticsearch's default tokenizer standard is not suitable for Chinese, for example, the default analyzer will decompose "Happy Home 3" into ["Xing","Fu","Home","Garden","3","Period "], so in practical applications, most of them use some Chinese analyzers. IK analyzer is one of them.

ik analyzer address https://github.com/medcl/elasticsearch-analysis-ik

Download and install IK

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.0.0/elasticsearch-analysis-ik-6.0.0.zip

After the download is complete, restart elasticsearch.

test analyzer
GET /employee/_analyze
{
    "analyzer": "ik_max_word",
    "text": "幸福家园3期"
}

the result of the response

{
  "tokens": [
    {
      "token": "幸福",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "家园",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "3",
      "start_offset": 4,
      "end_offset": 5,
      "type": "ARABIC",
      "position": 2
    },
    {
      "token": "期",
      "start_offset": 5,
      "end_offset": 6,
      "type": "COUNT",
      "position": 3
    }
  ]
}
use ik analyzer

After the test installation of ik is successful, you can specify the analyzer of a field as ik when creating an index.

PUT /test
{
  "mappings": {
    "doc": {
      "properties": {
        "chinese_txt": {
          "type": "text",
          "analyzer": "ik_max_word",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

The ik analyzer allows users to customize local or remote dictionary libraries, stop thesaurus, and support hot updates. For more detailed instructions, please check the official instructions of ik.

Higher-level queries such as match queries know the field mapping relationships and can apply the correct analyzer for each field being queried. This behavior can be viewed using the validate-query API, where the analyzer specified english_title at index time is english:

GET /my_index/my_type/_validate/query?explain
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title":         "Foxes"}},
                { "match": { "english_title": "Foxes"}}
            ]
        }
    }
}

Returns the explanation result of the statement: (title:foxes english_title:fox)

Different analyzers can be used at search time and at index time

Elasticsearch supports an optional search_analyzer mapping, which is only applied when searching.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324503462&siteId=291194637