ElasticSearch usage summary (six)

I have summarized the content of metric aggregation before, this article will talk about the knowledge of bucket aggregation. Bucket can be understood as a bucket, it will traverse the content in the document, and put those that meet the requirements into the bucket created according to the requirements.
This article focuses on the aggregation of terms, which is classified according to the value in a field: for
example, if the gender is male or female, two buckets will be created to store the information of men and women respectively. By default, the information of doc_count is collected, that is, how many boys and girls are recorded, and then returned to the client, thus completing the statistics of a terms.
Terms aggregation

{
    "aggs" : {
        "genders" : {
            "terms" : { "field" : "gender" }
        }
    }
}

The result obtained is as follows

{
    ...

    "aggregations" : {
        "genders" : {
            "doc_count_error_upper_bound": 0, 
            "sum_other_doc_count": 0, 
            "buckets" : [ 
                {
                    "key" : "male",
                    "doc_count" : 10
                },
                {
                    "key" : "female",
                    "doc_count" : 10
                },
            ]
        }
    }
}

Uncertainty of data
Using terms aggregation, the results may have certain bias and error.
For example: we want to get the top 5 most frequent occurrences in the name field.
At this point, the client sends an aggregation request to the ES, and after the master node receives the request, it sends the request to each independent shard.
The shard independently calculates the first 5 names on its own shard and returns it. After all the sharding results are returned, the results are merged at the master node, and the top 5 with the highest frequency are obtained and returned to the client.
This will cause certain errors. For example, among the first 5 returned items, one is called A, but since each shard stores information independently, the distribution of information is also uncertain. It is possible that there are 9 pieces of B information in the first shard, but they are not ranked in the top 5, so they do not appear in the final merged result. This results in the total number of B being undercounted by 9, which may have been ranked first, but was ranked behind A.
size and shard_size
In order to improve the above problem, the size and shard_size parameters can be used.

  • The size parameter specifies the number of terms returned at the end (the default is 10)
  • The shard_size parameter specifies the number of returns on each shard
  • If shard_size is less than size, then the shards will also be calculated according to the number specified by size

    Through these two parameters, if we want to return the first 5, size=5; shard_size can be set to be greater than 5, so that the entry information returned by each shard will increase, and the corresponding error probability will also decrease.
    Through these two parameters, if we want to return the first 5, size=5; shard_size can be set to be greater than 5, so that the entry information returned by each shard will increase, and the corresponding error probability will also decrease.

order sorting
order specifies the sorting method of the final returned results. The default is to sort according to doc_count.

{
    "aggs" : {
        "genders" : {
            "terms" : {
                "field" : "gender",
                "order" : { "_count" : "asc" }
            }
        }
    }
}

It can also be sorted lexicographically:

{
    "aggs" : {
        "genders" : {
            "terms" : {
                "field" : "gender",
                "order" : { "_term" : "asc" }
            }
        }
    }
}

Of course, you can also specify a single-valued metric aggregation by order to sort

{
    "aggs" : {
        "genders" : {
            "terms" : {
                "field" : "gender",
                "order" : { "avg_height" : "desc" }
            },
            "aggs" : {
                "avg_height" : { "avg" : { "field" : "height" } }
            }
        }
    }
}

It also supports multi-valued metric aggregation, but you need to specify the multi-valued field to use:

{
    "aggs" : {
        "genders" : {
            "terms" : {
                "field" : "gender",
                "order" : { "height_stats.avg" : "desc" }
            },
            "aggs" : {
                "height_stats" : { "stats" : { "field" : "height" } }
            }
        }
    }
}

The fields aggregated by min_doc_count and shard_min_doc_count
may have some entries with very low frequency. If the proportion of these entries is large, it will cause many unnecessary calculations.
Therefore, the minimum number of documents can be specified by setting min_doc_count and shard_min_doc_count, and only the entries that meet the requirements of this parameter will be recorded and returned.

It can be seen from the name:

  • min_doc_count: specifies the filtering of the final result
  • shard_min_doc_count: specifies the filtering when computing returns in shards

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325902611&siteId=291194637