Elasticsearch counts the number after deduplication

Count the number after deduplication

The first approximate aggregation provided by Elasticsearch is the  cardinality (note: cardinality) metric. It provides the cardinality of a field, that  is, the number of distinct  or  unique values ​​of the field  . You may be familiar with the SQL form:

SELECT COUNT(DISTINCT color)
FROM cars

Deduplication is a very common operation that can answer many basic business questions:

  • How many unique visitors to the website?
  • How many kinds of cars are sold?
  • How many unique users buy goods every month?

We can use cardinality metrics to determine the number of car colors sold by  dealers:

GET /cars/transactions/_search
{
    "size" : 0,
    "aggs" : {
        "distinct_colors" : {
            "cardinality" : {
              "field" : "color"
            }
        }
    }
}

Copy as cURL to view in Sense 

The returned results indicate that three different color cars have been sold:

...
"aggregations": {
  "distinct_colors": {
     "value": 3
  }
}
...

Can make our example more useful: how many color cars are sold every month? To get this metric, we only need to  cardinality embed one metric  date_histogram :

GET /cars/transactions/_search
{
  "size" : 0,
  "aggs" : {
      "months" : {
        "date_histogram": {
          "field": "sold",
          "interval": "month"
        },
        "aggs": {
          "distinct_colors" : {
              "cardinality" : {
                "field" : "color"
              }
          }
        }
      }
  }
}

Copy as cURL to view in Sense 

Learn to weigh

As we mentioned at the beginning of this chapter,  cardinality metric is an approximate algorithm. It is based on the  HyperLogLog++  (HLL) algorithm. HLL will first hash our input, and then estimate the probability based on the bits in the result of the hash operation to get the base.

We do not need to understand the technical details (if you are really interested, you can read this paper), but we should better pay attention to the characteristics of this algorithm   :

  • Configurable precision, used to control memory usage (more precision = more memory).
  • The accuracy of small data sets is very high.
  • We can set the fixed memory usage required for deduplication through configuration parameters. Regardless of the unique value of thousands or billions, memory usage is only related to the accuracy of your configuration.

To configure the precision, we must specify  precision_threshold the value of the parameter. This threshold defines the base level at which we hope to get a nearly accurate result. Consider the following example:

GET /cars/transactions/_search
{
    "size" : 0,
    "aggs" : {
        "distinct_colors" : {
            "cardinality" : {
              "field" : "color",
              "precision_threshold" : 100 
            }
        }
    }
}

Copy as cURL to view in Sense 

 

precision_threshold Accept numbers between 0 and 40,000, larger values ​​will still be treated as 40,000.

The example will ensure that very accurate results are obtained when the unique value of the field is within 100. Although the algorithm cannot guarantee this, if the cardinality is below the threshold, it is almost always 100% correct. Base numbers above the threshold will start to save memory at the expense of accuracy, and will also introduce errors into the measurement results.

For the specified threshold, the HLL data structure will probably use  precision_threshold * 8 bytes of memory, so it is necessary to balance the sacrifice of memory and gaining additional accuracy.

In practical applications,  100 the threshold can still maintain the error within 5% even when the unique value is one million.

Speed ​​optimization

If you want to get the number of unique values, you  usually  need to query the entire data set (or almost all data). All operations based on all data must be fast, the reason is obvious. The speed of HyperLogLog is already very fast. It simply hashes the data and performs some bit operations.

But if speed is important to us, we can do further optimization. Because HLL only needs the hash value of the field content, we can pre-calculate it when indexing. You can skip the hash calculation during query and load the hash value directly from the fielddata.

Pre-calculating the hash value is only useful for fields with very long content or high cardinality, and the cost of calculating the hash value of these fields cannot be ignored in the query.

Although the hash calculation of numeric fields is very fast, storing their original values ​​usually requires the same (or less) memory space. This is also applicable to string fields with low cardinality. Elasticsearch's internal optimization can ensure that each unique value is hashed only once.

Basically, pre-calculation does not guarantee that all fields are faster, it only works for those string fields with high cardinality and/or very long content. What needs to be remembered is that pre-calculation simply transfers the time consumed by the query to the index in advance. It is not without any cost. The difference is that you can choose  when to  do it, either during indexing or during query.

To do this, we need to add a new multi-value field to the data. We first delete the index, add a mapping that includes the hash value field, and then re-index:

DELETE /cars/

PUT /cars/
{
  "mappings": {
    "transactions": {
      "properties": {
        "color": {
          "type": "string",
          "fields": {
            "hash": {
              "type": "murmur3" 
            }
          }
        }
      }
    }
  }
}

POST /cars/transactions/_bulk
{ "index": {}}
{ "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }
{ "index": {}}
{ "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" }
{ "index": {}}
{ "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" }
{ "index": {}}
{ "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" }

Copy as cURL to view in Sense 

 

The type of multi-valued field is  murmur3 that this is a hash function.

Now when we perform aggregation, we use  color.hash fields instead of  color fields:

GET /cars/transactions/_search
{
    "size" : 0,
    "aggs" : {
        "distinct_colors" : {
            "cardinality" : {
              "field" : "color.hash" 
            }
        }
    }
}

Copy as cURL to view in Sense 

 

Note that we specify the hashed multi-value field, not the original field.

Now the  cardinality metric will read "color.hash" the value in (pre-calculated hash value) instead of  dynamically calculating the hash of the original value.

The time saved by a single document is very small, but if you aggregate 100 million data and each field takes an extra 10 nanoseconds, then an extra second will be added for each query. If we want to use a very large amount of data Use  cardinality , we can weigh the significance of using pre-computation, whether it is necessary to calculate the hash in advance, so as to obtain better performance in the query, do some performance tests to verify whether the pre-computed hash is suitable for your application scenario. .

Guess you like

Origin blog.csdn.net/allway2/article/details/109182858