The topic of this article comes from a WeChat group discussion.
In Elasticsearch, scoring (or scoring) is usually done during the query process to judge the relevance of documents.
The default scoring mechanism uses BM25 , but you can also customize the scoring mechanism through . However, if you want to limit the score to a range between 0 and 1, you may need to use a script in the query to do so.
Elasticsearch scores are primarily concerned with relevancy rankings, not exact score values, so if you want Elasticsearch scores to map proportionally between 0 and 1, you'll need to use some form of normalization or scaling . But this is not a built-in function of Elasticsearch, you need to implement it yourself.
1. Normalized interpretation
When we talk about "normalization", we mean transforming a dataset to a shared, normalized scale or range. This is very common in data analysis and machine learning as it helps us to make fair comparisons between different datasets.
For example, suppose you have two datasets, one with people's height in centimeters and one with people's weight in kilograms. Both datasets have different ranges and units. If we compare them directly, it is difficult to draw meaningful conclusions. However, if we normalize both to be between 0 and 1 , we can compare and understand the two datasets more easily.
A common normalization method is to use Min-Max Normalization. We will use the following formula:
where Xmax
represents the maximum value and Xmin
represents the minimum value. It should be noted that when new data comes in, the maximum or minimum value may be changed. At this time, we need to redefine Xmax and Xmin in the formula to avoid errors.
Reference: https: //www.cupoy.com/collection/0000018008CD5D70000000000000000000000000000463656C6561736355/00000181709BCC8f00000563706F7956C656C656C656C656C656C656C656C 173654349
2. Elasticsearch normalization
In this Elasticsearch case, we are talking about how to normalize the score (_score) between 0 and 1.
By default, Elasticsearch's scoring can vary widely, depending on many factors, such as the complexity of the query, the number of documents, and so on. If we want to compare and understand these ratings more easily, we can normalize them so that all ratings will be between 0 and 1.
In short, normalization is to transform data into a uniform range so that we can compare and understand it more easily.
The method of normalization depends on you knowing the upper and lower bounds of the rating range, or being willing to accept some approximation. One possible approach is to first perform a query to get the highest and lowest ratings, and then use these values to normalize the ratings of the other queries.
However, it should be noted that this method may produce inconsistent results, because Elasticsearch's scoring mechanism considers various factors (such as tf-idf, field length, etc.), and for different queries, the highest and lowest scores may be will vary.
Therefore, normalized scoring is a complex task in Elasticsearch that may need to be handled at the query level and/or application level. If you are designing a system that maps scores proportionally between 0 and 1, you may need to reconsider whether Elasticsearch's scoring mechanism is the best fit, or you may need to find other ways to supplement or replace Elasticsearch's scoring.
3. Elasticsearch 8.X score normalization
If you want to map Elasticsearch scores proportionally between 0 and 1, you first need to know the range of possible scores. This may require you to first perform a query to find the highest and lowest possible score. Below is a simple example. First, we do a query to find the scoring range:
GET /your_index/_search
{
"query": { "match_all": {} },
"size": 1,
"sort": [ { "_score": "desc" } ]
}
This query returns the document with the highest score. You can find the _score field from the returned results, which is the highest score. You can also find the lowest rating by changing the sort direction to "asc". You can then use these values for normalization.
Assuming you have found the highest score max_score and the lowest score min_score, you can use a script in the query to do the normalization:
{
"query": {
"function_score": {
"query": { "match_all": {} },
"script_score": {
"script": {
"source": "(_score - params.min) / (params.max - params.min)",
"params": {
"max": max_score,
"min": min_score
}
}
}
}
}
}
In this query, we use a script that normalizes the raw score (_score) between 0 and 1 . Note that you need to replace max_score and min_score with the values you found in the previous query.
Note that this is just a simple example and there are some limitations to this approach. For example, the highest and lowest ratings may change as the index is updated. You may need to update these values periodically, or calculate these values on every query, which may affect the performance of the query.
Also, this script assumes that the score is always between min_score and max_score . This script may return a value less than 0 or greater than 1 if a new document or query results in a score outside of this range.
When using this method, you need to consider these limitations and adjust according to your actual situation.
4. Elasticsearch 8.X normalization practice
Next, we demonstrate this process through a practical operation example.
4.1 Get the maximum score
POST kibana_sample_data_ecommerce/_search
{
"_source": [""],
"query": {
"match": {
"customer_full_name": "Underwood"
}
},
"size": 10,
"sort": [
{
"_score": "desc"
}
]
}
Get the result: 4.4682097.
4.2 Get the minimum score
POST kibana_sample_data_ecommerce/_search
{
"_source": [""],
"query": {
"match": {
"customer_full_name": "Underwood"
}
},
"size": 10,
"sort": [
{
"_score": "asc"
}
]
}
Get the result: 3.731265.
4.3 Calculation to score between 0-1
POST kibana_sample_data_ecommerce/_search
{
"from": 0,
"size": 10,
"_source": [
""
],
"sort": [
{
"_score": {
"order": "asc"
}
}
],
"query": {
"script_score": {
"query": {
"match": {
"customer_full_name": "Underwood"
}
},
"script": {
"source": "(_score - params.min) / (params.max - params.min)",
"params": {
"max": 4.4682097,
"min": 3.731265
}
}
}
}
}
Through these steps, we can map the scores proportionally between 0 and 1 in Elasticsearch.
However, this method has its limitations and challenges, and needs to be adjusted and optimized according to the actual situation.
5. Summary
This article discusses in detail how to implement score normalization in Elasticsearch.
This involves taking the highest and lowest ratings and then normalizing them through a script in the query. While this method works well for mapping scores proportionally between 0 and 1 , there are limitations such as the score range changes with index updates, and new documents or queries may cause scores outside the preset range.
Therefore, although specific operation examples are given in this article, in actual applications, users need to flexibly adjust and optimize according to specific situations.
recommended reading
First release on the whole network! From 0 to 1 Elasticsearch 8.X clearance video
Heavyweight | Dead Elasticsearch 8.X Methodology Cognition List
Dry goods | Disassemble Elasticsearch BM25 model scoring details step by step
Acquire more dry goods faster in a shorter time!
Improve with nearly 2000+ Elastic enthusiasts around the world!
In the era of large models, learn advanced dry goods one step ahead!