【Troubleshooting】Exploring the cardinality principle of ES for a business problem | JD Cloud technical team

Author: Wang Changchun, Jingdong Technology

business problem

A server system in charge of business in the editor’s work uses Elasticsearch service for data storage. According to feedback from business operators, when users use this product, they find that the number of orders counted by the user’s background is inconsistent with the number of exported orders !

The number of transaction orders is wrong, and there is an error order? This is extremely shocking! If such a problem occurs, it is absolutely not allowed to happen in a financial technology company, and the problem must be located and solved immediately!

Inconsistency between online feedback business data query and exported data

The editor immediately contacted the business and related personnel, and found out that the business system used the ES storage service on my side by sorting out the calling relationship of the upstream system, and then reproduced the online situation to basically understand the phenomenon of the problem:

  1. The total number of orders in the user operation background: "Total number of orders" on the merchant page, "Total number of orders" uses the statistical aggregation function of ES in the ES storage service of the editor, and the total number of orders uses cardinality Operation, and the orderId (order number) is used for statistical deduplication.
  2. The total number of orders in the export function: the export function uses the ES conditional query function in the ES storage service, and the export function performs page-by-page query.

identify the problem

The numbers of these two queries are inconsistent. First, check whether the query conditions are consistent?

After some investigation, the two query conditions of the business system to query the total number of orders and export the total number of orders are consistent, that is, when the ES service is requested to my side, the query conditions of statistical aggregation and paged export are consistent. , but why are the results of queries in ES inconsistent? Is the data in ES incomplete? One of the statistical aggregation or paged export is not allowed?

In order to specifically check which operation may have problems, the total number of query databases under the same conditions is compared with the data in ES. It is found that under the same conditions, the data in the database is consistent with the total number of ES conditional queries, and the orerId field of the business is not repeated, so it is certain that there is a problem with the operation of statistical aggregation and deduplication through orderId.

Number of database queries

Operation background query quantity

Database query: the database is divided into sub-databases and sub-tables. Here, the database query uses the Galaxy table of the company's data department - the company's data department will extract T-day incremental data from the business database database on T+1 day and put it in In the established "big table", it is convenient for each business to use data.

Operational background query: Operational background query is a direct query of ES storage services.

The number of large tables in the data department = the number of MySQL database sub-databases and sub-tables = the number of operations console queries = the number of ES storage documents

Problem location:
The ES storage service provides external services: the function of statistical aggregation and deduplication (cardinality) through orderId should be problematic.

Exploration of the cardinality principle of ES

As mentioned above, the ES storage service that Xiaobian is in charge of provides the business with the function of statistical aggregation and deduplication through specified business fields. The statistical aggregation and deduplication uses the cardinality function of ES. Through the query conditions of the business, use the aggregation function cardinality operation of ES, and the operation command mapped to the ES layer is shown in the following code,

Execute the query condition operation of the business, and the query from the backend of the ES management terminal actually reproduces the same result as the online production. The aggregated statistics are 21514, and the conditional query is 21427! ! !

What can be determined is this cardinality operation, which leads to the data inconsistency of the two queries, as shown in the following figure:

GET datastore_big_es_1_index/datastore_big_es_1_type/_search
{
  "size": 3,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "v021.raw": "selfhelp"
          }
        },
        {
          "match": {
            "v012.raw": "1001"
          }
        },
        {
          "match": {
            "typeId": "00029"
          }
        },
        {
          "range": {
            "createdDate": {
              "gte": "2021-02-01",
              "lt": "2021-03-01"
            }
          }
        },
        {
          "bool": {
            "should": [
              {
                "match": {
                  "v031.raw": "113692300"
                }
              }
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "distinct_orderId": {
      "cardinality": {
        "field": "v033.raw"
      }
    }
  }
}

ES cluster console cardinality operation

Why does the cardinality operation have such a result?

The editor began to fall into the trap of taking it for granted - thinking that this is a simple statistical de-duplication function, how well ES does it, helping you de-duplication and counting the number. Then the fact is not, through Elasticsearch's explanation of cardinality's official documentation, I finally found the reason.

You can refer to the explanation of cardinality in the official documentation of Elasticsearch 2.x version: cardinality

The core explanation of the cardinality algorithm is:

Introduction to the cardinality algorithm in the ES document

It can be summarized as follows:

  1. Cardinality is not as accurate as the relational database MySQL. Cardinality is an approximate value, which is "estimated" by ES for you. The HyperLogLog++ (HLL) algorithm used for this estimation is very fast in speed, and it can be traversed once . Statistics can be used to deduplicate. For details, see the papers recommended in the document.
  2. When ES does cardinality estimation, you can set the estimation accuracy, that is, set the parameter precision_threshold parameter, but this parameter is in the range of 0-40000. The larger the value, the higher the accuracy and the loss of more memory, which is exchanged for memory space. precision.
  3. Under the small amount of data, the "estimation" accuracy of ES is very high, almost equal to the actual quantity.

Cardinality parameter verification in ES

The precision_threshold parameter of ES cardinality is verified below:

1. Under the large amount of data, setting the highest precision or above, there will still be errors:

Under the large amount of data, set precision_threshold high-precision value verification

2. With a small amount of data, set the highest precision, which can be consistent with the actual quantity:

Under small amount of data, set precision_threshold high-precision verification

So why is the online aggregation statistics 21514, and the conditional query is 21427?

The precision_threshold parameter has not been actively set during online code running and ES cluster settings, so we can know that this should be the default value set by ES clusters. The online ES cluster version is 5.4x, so I found the official document of version 5.4, and found that the default value precision_threshold=3000 is set in version 5.4 , and the aggregated value of the query statistics under this condition is 21514.

In addition, ES officials have also conducted research on the precision_threshold parameter in cardinality operations , and studied the relationship between precision_threshold settings in official documents , cardinality query failure rates , and query data magnitudes , which can be used as our reference in business development, as shown in the following figure:
Research on the relationship between precision_threshold setting and cardinality query failure rate in official documents

Elasticsearch version 5.4 official document research document on the precision_threshold parameter in cardinality: precision_threshold

Summary and plan

Through the exploration of the principle of cardinality, we need to understand that we need to distinguish the usage scenarios when using cardinality.

  1. It is not recommended for precise statistical business scenarios . For example, it is not recommended to use in the scenario of counting the number of orders (statistical results will cause ambiguity).
  2. **For non-accurate statistical business scenarios, it can be said to be very useful, especially in scenarios with a large amount of data, while maintaining a certain degree of accuracy while providing high performance. **For example: scenarios such as monitoring index data, market ratio calculation, etc., are very useful under inaccurate statistics.

Based on the business scenario of the editor, the statistics of merchant orders is an accurate statistical scenario, so the cardinality operation is not suitable. And because the orderId of the business will not be repeated, in theory, the orderId of each record in our ES cluster is unique, so it is not necessary to deduplicate, but can directly use the count operation of ES to summarize the number of orders , corresponding to the COUNT API in the Elasticsearch development kit is as follows:

org.springframework.data.elasticsearch.core.ElasticsearchTemplate
#count(org.springframework.data.elasticsearch.core.query.SearchQuery, java.lang.Class<T>)

public <T> long count(SearchQuery searchQuery, Class<T> clazz) {
    QueryBuilder elasticsearchQuery = searchQuery.getQuery();
    QueryBuilder elasticsearchFilter = searchQuery.getFilter();
    return elasticsearchFilter == null ? this.doCount(this.prepareCount(searchQuery, clazz), elasticsearchQuery) : this.doCount(this.prepareSearch(searchQuery, clazz), elasticsearchQuery, elasticsearchFilter);
}

Finally, everyone is welcome to like, bookmark, comment, and forward! ❤️❤️❤️

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/8707473