Let's talk about the online problem of changing Elasticsearch space for time...

1. Online practical problems

I would like to ask my friends, does the slop of ngram affect the search results?

1. Preconditions:

  • The SPUCodeText of product A is: OWBB050C99JER0021001

  • The SPUCodeText of commodity B is: VSA00293ABBLACKFW2022

  • The SPUCodeText of commodity C is: 2WHGG0VNT03HHFC99FW2022

2. Current situation: search for the SPUCodeText code of product A: OWBB050, the product cannot be queried if the slop is set to 49-54; the product A can only be queried if the slop is set to 55 or above;

3. Pursue the goal: search SPUCodeText for any combination of 4 numbers and above, and you can find out the product?

For reasons of space, DSL definitions and query statements are omitted.

——The source of the topic: Dead Elasticsearch Knowledge Planet https://t.zsxq.com/08rmVBnhA

2. Interpretation of the problem

Major premise: The storage of the product code is similar to the storage of the mobile phone number we mentioned in the previous video, and the traditional word segmentation (default standard, Chinese ik_max_word, etc.) cannot handle it.

It needs to be implemented with the help of Ngram custom word segmentation.

Then the problem comes: the data after Ngram word segmentation, there is a problem with match_phrase + slop retrieval, and the slop must be set to a large value to get it done!

What causes it? Is there a more concise way?

3. Elasticsearch space for time

What is space for time? Take the example of the current World Cup to understand.

As the commentator said: "15 people can win". 15 people are far more than the normal 11 people by 4 people, which means that there is more space in exchange for time or results. Of course, the facts of the game are far from what the commentator said.

667a4e1381560c16f6d1ae2ac2d47c65.png

The essence of Ngram word segmentation in Elasticsearch is the way of exchanging space for time. Segmenting documents at a very small granularity will cause a surge in space storage and affect the writing speed, but in exchange for improved retrieval efficiency!

4. Realization after streamlining the problem

PUT /products-001
{
  "settings": {
     "max_ngram_diff": 40,
      "analysis": {
        "analyzer": {
          "ruishan_ngram_analyzer": {
            "filter": [
              "lowercase"
            ],
            "type": "custom",
            "tokenizer": "ruishan_ngram_tokenizer"
          }
        },
        "tokenizer": {
          "ruishan_ngram_tokenizer": {
            "token_chars": [
              "letter",
              "digit"
            ],
            "min_gram": 3,
            "type": "ngram",
            "max_gram": 40
          }
        }
      }
    },
  "mappings": {
     "properties" : {
        "id" : {
          "type" : "keyword"
        },
        "sPUCodeText" : {
          "type" : "text",
          "analyzer" : "ruishan_ngram_analyzer"
        }
      }
  }
}

PUT products-001/_bulk
{"index":{"_id":1}}
{"id":1,"sPUCodeText":"OWBB050C99JER0021001"}
{"index":{"_id":2}}
{"id":2,"sPUCodeText":"VSA00293ABBLACKFW2022"}
{"index":{"_id":3}}
{"id":3,"sPUCodeText":"2WHGG0VNT03HHFC99FW2022"}

GET products-001/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "sPUCodeText": {
              "query": "OWBB050"
            }
          }
        }
      ]
    }
  }
}

See the following results, a match can be done!

e77fa103563fd97edd7926f38b720276.png

Next, can match_phrase be used?

4baec711ab61ada4dc679f88e3ed90d1.png

What about match_phrase plus a larger slop? Can it be done? !

After repeated tests, it is necessary to set the slop at least 52 to get it done, as shown below.

6c2707add5deaa1586c69e7db1450057.png

why? Why is it  52  ?

5. What is the essence of match_phrase phrase matching retrieval?

In layman's terms: the word segmentation results of the query part to be retrieved (such as the beginning: OWBB050) must be exactly the same as the order and position of the word segmentation results in the document (such as: OWBB050C99JER0021001)!

You can view word segmentation results through the analyzer api, as follows:

POST products-001/_analyze
{
  "field": "sPUCodeText",
  "text": ["OWBB050C99JER0021001"]
}
50d00521d6629f3901e4904918bf2b3f.png

The term unit after word segmentation, "OWBB050" is shown on the left side of the figure below, and "OWBB050C99JER0021001" is shown on the right side of the figure below.

bf9477c522f95fdff3d417421c4140bf.png

The two are not consistent , which is the reason why they cannot be matched, that is, there is a deviation!

6. The essence of parameter slop under match_phrase phrase retrieval

A picture completely understands!

ac7dc114a62b201856e11b6b4f9053f9.png

The same color represents: the word to be retrieved is the same as the word segmentation result in the source document.

The calculation method of the maximum difference, for example: the term "050" after word segmentation ranks 15 in the term to be searched, and ranks 67 in the source document "OWBB050C99JER0021001".

Bad: 67-15=52.

Therefore, slop fills up the maximum difference of 52, and the retrieval and data recall can be realized!

If the slop is set to 51, it will not work! At least 52 or more can recall data.

217e160935d1339fbc41c8c21f0d3bee.png

7. Summary

After word segmentation similar to Ngram, we have worked hard on the spatial level! There is no need to work hard at the time level and retrieval level!

A direct match search will surely retrieve the result!

08f0e0b5dd827833187b5e487f57e433.png

The above-mentioned filter can be cached, and it is recommended to use it.

So, is there a faster way to write it?

f66bbd27ea3a47108d8c57826a286874.png

Careful students will find that only when "OWBB05" is changed to lowercase "owbb05" can the data be recalled, but direct capitalization and direct term retrieval cannot recall the data!

why? Leave a message for everyone to think about!

4253633d582fae0ea56e39e7fb9582fe.png

recommended reading

  1. First release on the whole network! From 0 to 1 Elasticsearch 8.X clearance video

  2. Heavyweight | Dead Elasticsearch 8.X Methodology Cognition List (2022 National Day Update)

  3. How to systematically learn Elasticsearch?

c7bf34ae9530cabe384d68d9dfb6406e.jpeg

Acquire more dry goods faster in a shorter time!

Improve with 1800+ Elastic enthusiasts around the world!

f3d638d8ca2c8d40e9edac956013ab25.gif

Learn advanced dry goods one step ahead of your colleagues!

Guess you like

Origin blog.csdn.net/wojiushiwo987/article/details/128246293