How does Elasticsearch limit the score between 0 and 1?

The topic of this article comes from a WeChat group discussion.

94f0d3cf85ae15e1ce76bf3843ca007a.jpeg

In Elasticsearch, scoring (or scoring) is usually done during the query process to judge the relevance of documents.

The default scoring mechanism uses BM25 , but you can also customize the scoring mechanism through . However, if you want to limit the score to a range between 0 and 1, you may need to use a script in the query to do so.

Elasticsearch scores are primarily concerned with relevancy rankings, not exact score values, so if you want Elasticsearch scores to map proportionally between 0 and 1, you'll need to use some form of normalization or scaling . But this is not a built-in function of Elasticsearch, you need to implement it yourself.

1. Normalized interpretation

When we talk about "normalization", we mean transforming a dataset to a shared, normalized scale or range. This is very common in data analysis and machine learning as it helps us to make fair comparisons between different datasets.

d87fefcb661215bc5d21084a1446da7f.png

For example, suppose you have two datasets, one with people's height in centimeters and one with people's weight in kilograms. Both datasets have different ranges and units. If we compare them directly, it is difficult to draw meaningful conclusions. However, if we normalize both to be between 0 and 1 , we can compare and understand the two datasets more easily.

A common normalization method is to use Min-Max Normalization. We will use the following formula:

6c7ebbc4e247a91387165958b6e65820.png

where Xmaxrepresents the maximum value and Xminrepresents the minimum value. It should be noted that when new data comes in, the maximum or minimum value may be changed. At this time, we need to redefine Xmax and Xmin in the formula to avoid errors.

Reference: https: //www.cupoy.com/collection/0000018008CD5D70000000000000000000000000000463656C6561736355/00000181709BCC8f00000563706F7956C656C656C656C656C656C656C656C 173654349

2. Elasticsearch normalization

In this Elasticsearch case, we are talking about how to normalize the score (_score) between 0 and 1.

By default, Elasticsearch's scoring can vary widely, depending on many factors, such as the complexity of the query, the number of documents, and so on. If we want to compare and understand these ratings more easily, we can normalize them so that all ratings will be between 0 and 1.

In short, normalization is to transform data into a uniform range so that we can compare and understand it more easily.

The method of normalization depends on you knowing the upper and lower bounds of the rating range, or being willing to accept some approximation. One possible approach is to first perform a query to get the highest and lowest ratings, and then use these values ​​to normalize the ratings of the other queries.

However, it should be noted that this method may produce inconsistent results, because Elasticsearch's scoring mechanism considers various factors (such as tf-idf, field length, etc.), and for different queries, the highest and lowest scores may be will vary.

Therefore, normalized scoring is a complex task in Elasticsearch that may need to be handled at the query level and/or application level. If you are designing a system that maps scores proportionally between 0 and 1, you may need to reconsider whether Elasticsearch's scoring mechanism is the best fit, or you may need to find other ways to supplement or replace Elasticsearch's scoring.

3. Elasticsearch 8.X score normalization

If you want to map Elasticsearch scores proportionally between 0 and 1, you first need to know the range of possible scores. This may require you to first perform a query to find the highest and lowest possible score. Below is a simple example. First, we do a query to find the scoring range:

GET /your_index/_search
{
  "query": { "match_all": {} },
  "size": 1,
  "sort": [ { "_score": "desc" } ]
}

This query returns the document with the highest score. You can find the _score field from the returned results, which is the highest score. You can also find the lowest rating by changing the sort direction to "asc". You can then use these values ​​for normalization.

Assuming you have found the highest score max_score and the lowest score min_score, you can use a script in the query to do the normalization:

{
  "query": {
    "function_score": {
      "query": { "match_all": {} },
      "script_score": {
        "script": {
          "source": "(_score - params.min) / (params.max - params.min)",
          "params": {
            "max": max_score,
            "min": min_score
          }
        }
      }
    }
  }
}

In this query, we use a script that normalizes the raw score (_score) between 0 and 1 . Note that you need to replace max_score and min_score with the values ​​you found in the previous query.

Note that this is just a simple example and there are some limitations to this approach. For example, the highest and lowest ratings may change as the index is updated. You may need to update these values ​​periodically, or calculate these values ​​on every query, which may affect the performance of the query.

Also, this script assumes that the score is always between min_score and max_score . This script may return a value less than 0 or greater than 1 if a new document or query results in a score outside of this range.

When using this method, you need to consider these limitations and adjust according to your actual situation.

4. Elasticsearch 8.X normalization practice

Next, we demonstrate this process through a practical operation example.

4.1 Get the maximum score

POST kibana_sample_data_ecommerce/_search
{
  "_source": [""],
  "query": {
    "match": {
      "customer_full_name": "Underwood"
    }
  },
  "size": 10,
  "sort": [
    {
      "_score": "desc"
    }
  ]
}

Get the result: 4.4682097.

4.2 Get the minimum score

POST kibana_sample_data_ecommerce/_search
{
  "_source": [""],
  "query": {
    "match": {
      "customer_full_name": "Underwood"
    }
  },
  "size": 10,
  "sort": [
    {
      "_score": "asc"
    }
  ]
}

Get the result: 3.731265.

4.3 Calculation to score between 0-1

POST kibana_sample_data_ecommerce/_search
{
  "from": 0,
  "size": 10,
  "_source": [
    ""
  ],
  "sort": [
    {
      "_score": {
        "order": "asc"
      }
    }
  ],
  "query": {
    "script_score": {
      "query": {
        "match": {
          "customer_full_name": "Underwood"
        }
      },
      "script": {
        "source": "(_score - params.min) / (params.max - params.min)",
        "params": {
          "max": 4.4682097,
          "min": 3.731265
        }
      }
    }
  }
}

Through these steps, we can map the scores proportionally between 0 and 1 in Elasticsearch.

e25f89518ab7c827f6f18f15e557710f.png

However, this method has its limitations and challenges, and needs to be adjusted and optimized according to the actual situation.

5. Summary

This article discusses in detail how to implement score normalization in Elasticsearch.

This involves taking the highest and lowest ratings and then normalizing them through a script in the query. While this method works well for mapping scores proportionally between 0 and 1 , there are limitations such as the score range changes with index updates, and new documents or queries may cause scores outside the preset range.

Therefore, although specific operation examples are given in this article, in actual applications, users need to flexibly adjust and optimize according to specific situations.

recommended reading

  1. First release on the whole network! From 0 to 1 Elasticsearch 8.X clearance video

  2. Heavyweight | Dead Elasticsearch 8.X Methodology Cognition List

  3. How to systematically learn Elasticsearch?

  4. 2023, do something

  5. Dry goods | Disassemble Elasticsearch BM25 model scoring details step by step

  6. Actual combat | N methods of Elasticsearch custom scoring

564b7d49f205875a5f14a839e4dcbf4b.jpeg

Acquire more dry goods faster in a shorter time!

Improve with nearly 2000+ Elastic enthusiasts around the world!

72c233834c9e8e7e0293f96d4abf8e4b.gif

In the era of large models, learn advanced dry goods one step ahead!

Guess you like

Origin blog.csdn.net/wojiushiwo987/article/details/131255445