Elasticsearch uses regular expressions to match Chinese strings and the result is an empty set. es regular expressions cannot find results.

Problem Description

When I tested the syntax of regular expressions in the es database, I found that the Chinese string was never recognized. The expression that could be recognized was .*this. This puzzled me.

Cause Analysis

The reason: In fact, it is inseparable from the principle of elasticsearch database-inverted list, what does it mean? Suppose we store a sentence, then the tokenizer in the ES database will segment the sentence, and then store these Tokens in the inverted list.

  • For example, for the sentence "I am so beautiful", what will the word segmenter identify it as?

    • In fact, it is related to the choice of word segmenter. For the ES default word segmenter, it will be recognized as the following format

      {
              
              
          "tokens": [
              {
              
              
                  "token": "我",
                  "start_offset": 0,
                  "end_offset": 1,
                  "type": "<IDEOGRAPHIC>",
                  "position": 0
              },
              {
              
              
                  "token": "真",
                  "start_offset": 1,
                  "end_offset": 2,
                  "type": "<IDEOGRAPHIC>",
                  "position": 1
              },
              {
              
              
                  "token": "的",
                  "start_offset": 2,
                  "end_offset": 3,
                  "type": "<IDEOGRAPHIC>",
                  "position": 2
              },
              {
              
              
                  "token": "好",
                  "start_offset": 3,
                  "end_offset": 4,
                  "type": "<IDEOGRAPHIC>",
                  "position": 3
              },
              {
              
              
                  "token": "美",
                  "start_offset": 4,
                  "end_offset": 5,
                  "type": "<IDEOGRAPHIC>",
                  "position": 4
              },
              {
              
              
                  "token": "丽",
                  "start_offset": 5,
                  "end_offset": 6,
                  "type": "<IDEOGRAPHIC>",
                  "position": 5
              }
          ]
      }
      

      You will find that they are all single words!

  • With word segmentation, es makes these tokens into an inverted list. When you use the regular expression grammar to strictly search for "I am so beautiful", you will find that the result is an empty set. Why is this answer possible for you ? I have already guessed: Because the grammar of regular expressions is too strict, it strictly matches strings that conform to its grammatical rules. However, there are only Chinese words in the inverted index of the es database . After searching around, it said, I found no such sentence and returned the empty set.

problem solved

It is necessary to rebuild an index (database) and specify the word segmenter to be used when creating it.

PUT website
{
    
    
    "mappings": {
    
    
            "properties": {
    
    
                "user_id": {
    
     "type": "text" ,
                            "analyzer": "ik_max_word" , //对应 IK 的 ik_max_word 和 ik_smart 两种分词策略 也可以不写,不写则默认
                            "search_analyzer": "standard" //查询时使用的分词器
                           },
            	  "name": {
    
    
                    "type": "text",
                    "analyzer": "english"
                },
                "age": {
    
     "type": "integer" },
                "sex": {
    
     "type": "keyword" },
                "birthday": {
    
    
                    "type": "date", 
                    "format": "strict_date_optional_time||epoch_millis"
                },
                "address": {
    
    
                    "type": "text",
                    "index": false         // 不分词
                }
            }
    }
}

In this way, we have changed a word segmenter ik_max_word. Let's take a look at its word segmentation results for this sentence.

{
    
    
    "tokens": [
        {
    
    
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },
        {
    
    
            "token": "真的",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 1
        },
        {
    
    
            "token": "好美",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 2
        },
        {
    
    
            "token": "美丽",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 3
        }
    ]
}

Some friends may ask: You can't solve our needs like this? Isn’t this the complete field we require?

This is indeed the case, so we are required to find a suitable word segmentation method, or even construct it ourselves

For this example, we can use samplea tokenizer, which is a native tokenizer of ES and will not split any Chinese characters, so it can meet our needs.

Recommended reading:

  1. Understand what a tokenizer is and what tokenizers there are
  2. What is mapping

Guess you like

Origin blog.csdn.net/Zilong0128/article/details/120954153