【Elasticsearch】Elasticsearch analyzer

Insert picture description here

1 Overview

What is analysis?

Analysis is the process that Elasticsearch performs on the body of the document before it is sent to add it to the inverted index. Before adding documents to the index, Elasticsearch performs many steps for each analyzed field:

  1. Character filtering (Character filter): Use character filter to convert characters
  2. Breaking text into tokens (Convert text into tags): divide the text into a group of one or more tags
  3. Token filtering: Use the tag filter to convert each tag
  4. Token indexing: Store these tags in the index

Next we will discuss each step in more detail, but first let us look at the entire process summarized in the diagram. Figure 5.1 shows "share your experience with NoSql & big data technologies" as the analysis mark: share, your, experience, with, nosql, big, data, tools, and technologies.
Insert picture description here

Shown above is a custom analyzer consisting of a character filter, a standard tokenizer and a token filter. The above picture is very good. It succinctly describes the basic components of an analyzer and what each part needs to express.

Whenever a document is included by the ingest node, it needs to go through the following steps to finally write the document into the Elasticsearch database:

Insert picture description here
The middle part above is called the analyzer, which is the analyzer. It has three parts: Char Filters, Tokenizer and Token Filter. Their functions are as follows:

  1. Char Filter: The job of the character filter is to perform cleanup tasks, such as stripping HTML tags, and converting "&" to "and" string above
  2. Tokenizer: The next step is to split the text into terms called tokens. This is done by the tokenizer. Splitting can be done based on any rules (such as spaces). For more detailed information about tokennizer, please visit the following URL: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html.
  3. Token filter: Once the tokens are created, they will be passed to the token filter, which will normalize the tokens. The Token filter can change the token, delete terms or add terms to the token.

2. Time of occurrence

The analyzer performs the process of decomposing the input character stream into tokens, which generally occurs in two occasions:

  1. When indexing, that is, when creating an index
  2. When searching, that is, when searching, analyze the words that need to be searched

Insert picture description here

3. Use the analyzer out of the box

Elasticsearch already provides a richer out-of-the-box analyzer. We can create our own token analyzer, or even use the existing char filter, tokenizer, and token filter to recombine into a new analyzer, and define our own analyzer for each field in the document. If you are more interested in analyzers, please refer to our website https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html.

By default, the standard analyzer is the default analyzer of Elasticsearch (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html):

  1. No Char Filter
  2. Use standard tokonizer
  3. Change the string to lowercase and selectively delete some stop words, etc. By default, stop words are _none_, that is, no stop words are filtered.

Insert picture description here
Generally speaking, an analyzer can be divided into the following parts:

0个或1个以上的character filter
1个tokenizer
0个或1个以上的token filter

Insert picture description here

3.Analyze API

GET /_analyze
POST /_analyze
GET /<index>/_analyze
POST /<index>/_analyze

Use the _analyze API to test how the analyzer parses our string, such as:

GET /_analyze
{
    
    
  "analyzer": "standard",
  "text": "Quick Brown Foxes!"
}

Return result:

  "tokens" : [
    {
    
    
      "token" : "quick",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
    
    
      "token" : "brown",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
    
    
      "token" : "foxes",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

Here we use the standard analyzer, which decomposes our string into three tokens and displays their respective location information.

Another example is the following case

GET _analyze
{
    
    
  "analyzer": "ik_smart",
  "text": "我是社会主义接班人"
}

# 结果
{
    
    
  "tokens" : [
    {
    
    
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
    
    
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
    
    
      "token" : "社会主义",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
    
    
      "token" : "接班人",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}

Another example

GET _analyze
{
    
    
  "analyzer": "ik_max_word",
  "text": "我是社会主义接班人"
}

# 结果

{
    
    
  "tokens" : [
    {
    
    
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
    
    
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
    
    
      "token" : "社会主义",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
    
    
      "token" : "社会",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
    
    
      "token" : "主义",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
    
    
      "token" : "接班人",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
    
    
      "token" : "接班",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
    
    
      "token" : "人",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 7
    }
  ]
}

Reprinted: https://www.cnblogs.com/sanduzxcvbnm/p/12084607.html

Guess you like

Origin blog.csdn.net/qq_21383435/article/details/108813759