Accuracy problems and solutions under Elasticsearch 8.X aggregation query

1. Online environment problems

Student Gupao asked: I was doing a test while looking at the runtime document. When agg asks for avg, no matter whether it is double or long, the data is not accurate. How to solve this problem in the production environment?

737b136799fdb8d9b14bda46f0de0414.jpeg

37dbd666e63f5fd02159ffc2913546df.png

2. Problem classification and occurrence scenarios

The above problems can be classified as: precision problems under Elasticsearch aggregation query .

In daily data processing work, we often encounter operations such as big data query, statistics, and aggregation using Elasticsearch. Elasticsearch shows excellent search performance in practice, but in some complex aggregation operations, such as averaging (avg), there may be problems with inaccurate data accuracy .

Next, we will introduce the occurrence scenario, possible causes and solutions of this problem in detail.

In Elasticsearch, the problem of data accuracy mainly occurs in the aggregation (aggregation) operation. For example, when we do some large number operations, such as summation (sum) and average value (avg), we may encounter precision problems caused by the data type (double or long). This is because Elasticsearch uses a method called "floating point calculation" to perform large number calculations in order to improve performance and efficiency when performing aggregation operations, and this calculation method often loses some precision when processing large numbers .

3. Minimize the recurrence of the problem

Let's illustrate this problem with a simple example. We have some item data stored in Elasticsearch and now we want to calculate the average price of all items.

The DSL for data and query is as follows (verified in Elasticsearch 8.X environment):

  • data:

POST /product/_bulk
{ "index" : { "_id" : "1" } }
{ "name" : "商品1", "price" : 1234.56 }
{ "index" : { "_id" : "2" } }
{ "name" : "商品2", "price" : 7890.12 }
  • Query DSL:

GET /product/_search
{
  "size": 0,
  "aggs": {
    "avg_price": {
      "avg": {
        "field": "price"
      }
    }
  }
}

Although we expect the average price to be (1234.56 + 7890.12) / 2 = 4562.34, due to the accuracy of floating-point calculations, the returned results may be slightly biased, as shown in the figure below.

e4cdb04ce9266742828c7d6d576221ce.png

4. Solution discussion and implementation

How to solve the above-mentioned precision problem after aggregation? We combine the basic knowledge of Elasticsearch and practical experience to give the following three solutions.

  • Solution 1: Use the scaled_float type to improve precision.

  • Solution 2: Use scripted_metric to improve accuracy.

  • Option 3: Write code at the business level to implement it yourself.

Next, we will practice and interpret the above three solutions one by one.

4.1 Improve precision with scaled_float type

4.1.1 What is scaled_float?

scaled_float It is a special numeric data type provided by Elasticsearch for storing numbers with decimals.

Unlike float and double , scaled_float is actually a long type except that it stores actual floating-point numbers multiplied by a given scaling factor.

In many application scenarios, we need to store numbers with decimals, such as prices, ratings, etc. float and double are commonly used data types, but they have some problems: for example, they may lose precision when storing and sorting, and they take up more storage space than integer types. Instead,   scaled_float the float is actually multiplied by one scaling factor, and the result is stored as   long .

For example, if scaling factoris 100, then the number 12.34 will be stored as 1234. When querying and returning results, Elasticsearch will divide by scaling factor and return the original floating point number.

4.1.2 Advantages of scaled_float

  • Precision is more accurate and controllable

Compared with float and double, scaled_float is more accurate in storage and sorting, because it is actually a stored long integer, and there is no precision problem with floating point numbers.

  • better performance

Since scaled_float uses the long type, it occupies less storage space and has better performance.

  • more flexible

The scaling factor can be set as needed to balance accuracy and performance. If higher precision is required, a larger scaling factor can be used. If the project needs to focus on performance and storage space, you can use a smaller scaling factor.

4.1.3 Using scaled_float in Elasticsearch

To use scaled_float in Elasticsearch, you need to define the field type in the mapping and provide a scaling factor. For example:

{
  "properties": {
    "price": {
      "type": "scaled_float",
      "scaling_factor": 100.0
    }
  }
}

This mapping defines a scaled_float field called price with a scaling factor of 100. This means that all prices will be multiplied by 100 and stored as long.

For example, a price of 12.34 would be stored as 1234.

Overall, scaled_float is a very useful tool that provides better precision and performance in situations where floating point numbers need to be stored.

4.1.4 Actual combat to solve similar problems at the beginning

In this example we have two products whose prices are floats.

If you want to use scaled_float, you first need to set up a mapping. Assuming you want to store prices with minute precision, you can set scaling_factor to 100.0. Here are the steps how to define a mapping:

First, create a new index and define the mapping:

PUT /product
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "price": {
        "type": "scaled_float",
        "scaling_factor": 100.0
      }
    }
  }
}

This command creates a new index product and defines two fields: name (type text) and price (type scaled_float, scaling_factor 100.0).

Then, the batch bulk insert data is as follows:

POST /product/_bulk
{ "index" : { "_id" : "1" } }
{ "name" : "商品1", "price" : 1234.56 }
{ "index" : { "_id" : "2" } }
{ "name" : "商品2", "price" : 7890.12 }

In this process, the value of the price field will be automatically multiplied by scaling_factor (100.0 in this case), and then stored as long type. So the actual stored values ​​are 123456 and 789012.

When querying, Elasticsearch will automatically divide the price by scaling_factor and return the original floating point number. For example, if you execute the following query:

GET /product/_doc/1

The returned result will be:

{
  "_index": "product",
  "_id": "1",
  "_version": 1,
  "_seq_no": 0,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "name": "商品1",
    "price": 1234.56
  }
}

Although the price is multiplied by 100 when stored, it is divided by 100 when queried, so the price seen is still 1234.56.

In this way, prices can be stored and queried with less storage space and better performance while maintaining high precision.

In the end we achieve the following:

GET product/_search
{
  "size": 0,
  "aggs": {
    "avg_price": {
      "avg": {
        "field": "price"
      }
    }
  }
}

As shown below, the resulting accuracy is as expected.

b1c23a9853a3cdb9b44e273c31bc852a.png

4.2 Using scripted_metric to improve accuracy

In the face of this situation, we can use another powerful function of Elasticsearch - script calculation (scripted_metric) to solve.

scripted_metric allows us to customize complex aggregation logic, such as the following DSL:

####务必要删除索引
DELETE product

POST /product/_bulk
{ "index" : { "_id" : "1" } }
{ "name" : "商品1", "price" : 1234.56 }
{ "index" : { "_id" : "2" } }
{ "name" : "商品2", "price" : 7890.12 }

GET /product/_search
{
  "size": 0,
  "aggs": {
    "avg_price": {
      "scripted_metric": {
        "init_script": "state.total = 0.0; state.count = 0",
        "map_script": "state.total += params._source.price; state.count++",
        "combine_script": "HashMap result = new HashMap(); result.put('total', state.total); result.put('count', state.count); return result",
        "reduce_script": """
  double total = 0.0; long count = 0; 
  for (state in states) { 
    total += state['total']; 
    count += state['count']; 
  }
  double average = total / count;
  DecimalFormat df = new DecimalFormat("#.00");
  return df.format(average);
        """
      }
    }
  }
}

Elasticsearch is a distributed search and analytics engine, meaning data can be stored and processed across multiple shards. To process distributed data, Elasticsearch uses a programming model called map-reduce. This model is divided into two steps: mapping (Map) and reduction (Reduce). init_script, map_script, combine_script, and reduce_script are all components of this model for more complex aggregations.

In the above script, we defined four steps:

  • init_script: Initialization script that creates a new state for each aggregation on each shard.

  • map_script: A mapping script that processes input documents and converts their state into a format that can be merged.

  • combine_script: A combine script for merging the state of each shard at the node level.

  • reduce_script: A reduce script for merging state globally.

In this way, we can get a more accurate average.

The specific meaning of the above script is explained as follows:

  • init_script: This script is executed once per shard, creating a new state for each shard.

In the above script, it creates a state object which contains a sum (total) and a counter (count). The state object is initialized to {total: 0.0, count: 0}.

  • map_script: This script is executed once per document.

In the above script, it reads each document's pricefield and adds this value to it totalwhile incrementing countthe value. This way, totalthe sum of all document prices will be included, and countthe number of processed documents will be included.

  • combine_script: This script is executed once per shard, combining the state of each shard.

In the above script, it just totalputs the sum into counta HashMapreturn. If there are a lot of states to merge, there might be some preprocessing in this script.

  • reduce_script: This script is executed once when the results are merged, and the state of all fragments is reduced to calculate the final result.

In the above script, it iterates over the states of all shards, calculates the sum , totaland countthen calculates the average price. DecimalFormatA string used to format the average price to two decimal places.

In simple terms, this is a step-by-step process of calculating the average: first initialize the state, then update the state for each document, then merge the state on each shard, and finally merge the state globally and calculate the result.

The final result is shown in the figure below, achieving the expected accuracy.

77ca765af6678045470c08d8f11974d2.png

4.3 At the business level, write the code yourself.

Accuracy control at the application level: Get raw data to the application layer, and then perform precise calculations at the application layer. The advantage of this method is that very accurate results can be obtained, but the disadvantage is that a large amount of data may need to be processed, which increases the burden of network transmission and calculation.

Dealing with data accuracy issues at the application level usually requires two steps:

  • First, the raw data needs to be obtained from Elasticsearch;

  • Then, precise calculations are performed at the application layer.

The following is an example of handling data precision in Java:

Assuming the system application is written in Java, you can use Java's BigDecimal class for precise floating-point calculations. Here is a simple example:

BigDecimal price1 = new BigDecimal("1234.56");
BigDecimal price2 = new BigDecimal("7890.12");
BigDecimal average = price1.add(price2).divide(new BigDecimal(2), 2, RoundingMode.HALF_UP);

System.out.println(average);  // 输出:4562.34

In the above example, we first created two BigDecimal objects representing two prices. Then we call the add method to add them up, and then call the divide method to calculate the average. Finally, we use the RoundingMode.HALF_UP parameter to control the rounding mode.

Note that this approach requires all data to be processed at the application layer, which may cause performance issues if the data volume is large. In order to reduce the burden of data transmission and calculation, it may be necessary to use more precise queries in Elasticsearch to obtain only the required data, or use the aggregation function of Elasticsearch to reduce the amount of returned data.

In addition, it may be necessary to do some optimizations at the application layer, such as using technologies such as parallel processing and caching to improve processing performance. The specific method will be determined according to the specific situation and needs of the application.

5. Summary

In general, although Elasticsearch may have the problem of inaccurate data accuracy when performing aggregation operations, more accurate results can be obtained by using the scaled_float type to improve accuracy, using scripted_metric to improve accuracy, and writing code at the business level to achieve more accurate results.

When encountering similar problems, we need to choose the most suitable solution according to the actual situation. On the one hand, the accuracy requirements must be considered, and on the other hand, query performance and resource consumption must also be considered. We should use script calculations in a timely manner to improve the accuracy of aggregation operations according to the actual needs of the business.

recommended reading

  1. First release on the whole network! From 0 to 1 Elasticsearch 8.X clearance video

  2. Heavyweight | Dead Elasticsearch 8.X Methodology Cognition List

  3. How to systematically learn Elasticsearch?

  4. 2023, do something

f61a4e975f2aec87d8cd8470e352d7a7.jpeg

Acquire more dry goods faster in a shorter time!

Improve with nearly 2000+ Elastic enthusiasts around the world!

cc625dfb83be0e5364b17070320e54ab.gif

In the era of large models, learn advanced dry goods one step ahead!

Guess you like

Origin blog.csdn.net/wojiushiwo987/article/details/131618237