Detailed explanation of es_aggregation usage

foreword

ES Statistical Analysis Concepts

Aggregate queries in ES, similar to SQL SUM/AVG/COUNT/GROUP BY group queries, are mainly used in statistical analysis scenarios.
The following first introduces the core process and core concepts of ES aggregation query.

1. ES aggregation query process

ES aggregation query is similar to SQL's GROUP by, and general statistical analysis is mainly divided into two steps:

Aggregation within a grouping group
First, a round of grouping is performed on the queried data, and grouping conditions can be set, for example, when new students are enrolled, all students are divided into classes according to majors. The process of classifying is to group students.

Aggregation within a group is to make statistics on the data in the group, for example: calculating the total, calculating the average, etc. Following the above example, the students are divided into classes according to their majors, and then the total number of students in each class can be counted. This statistic The calculation of the total number of students in each class is the aggregate calculation within the group.

Tip: Grouping is similar to the conditions set by the SQL group by statement, and the aggregation within the group is the avg, sum, and count statistical functions written in select; those who are familiar with SQL statements know that the sum and count statistical functions do not necessarily need to cooperate with the group by statement Using, using the statistical function alone is equivalent to dividing all the data into one group, and directly counting all the data.

2. Core concepts

Through the above aggregation query process, the following is the core concept of ES aggregation, which is easy to understand

2.1. Bucket

A collection of documents that meet certain criteria is called a bucket.
A bucket is a collection of a group of data. After grouping the data, a group of data is obtained, which is a bucket.
Tip: A bucket is equivalent to a group. Bucketing and grouping mean the same thing. ES uses buckets to represent a set of data with the same characteristics.

Bucket aggregation in ES refers to grouping data first. ES supports multiple grouping conditions, for example: it supports SQL-like group by to group by field. Of course, ES is more powerful than SQL and supports more grouping conditions to meet various statistical needs.

2.2. Indicators

Index refers to the statistical calculation method for documents, also known as index aggregation.
In-bucket aggregation refers to grouping data (bucketing) first, and then performing indicator aggregation on the data in each bucket.
To put it bluntly, after the data is aggregated in a round of buckets, and the data is divided into buckets, we count the data in the buckets based on the above calculation indicators.
Commonly used indicators include: SUM, COUNT, MAX and other statistical functions.
Understand buckets and indicators with the help of SQL statistical statements:

SELECT COUNT(*) 
FROM order
GROUP BY shop_id 

Explanation:
COUNT( ) is equivalent to an indicator, also called a statistical indicator.
GROUP BY shop_id is equivalent to the condition of bucketing, and it can also be called grouping condition. The data of the same shop_id are all divided into one bucket.
The function of this SQL statement is to count the number of orders of each store, so the first step of SQL statistics is to divide the data with the same shop_id (shop ID) into a group (bucket) according to the condition of group by shop_id, and then A set of data uses the count(
) statistical function (indicator) to calculate the total, and finally get the total number of orders for each store. ES is a similar process.

3. ES aggregation query syntax

You can first have a general understanding of the basic grammatical structure of ES aggregation query

{
    
    
  "aggregations" : {
    
    
    "<aggregation_name>" : {
    
    
        "<aggregation_type>" : {
    
    
            <aggregation_body>
        }
        [,"aggregations" : {
    
     [<sub_aggregation>]+ } ]? // 嵌套聚合查询,支持多层嵌套
    }
    [,"<aggregation_name_2>" : {
    
     ... } ]* // 多个聚合查询,每个聚合查询取不同的名字
  }
}

Explanation:
aggregations - represents the aggregation query statement, which can be abbreviated as aggs
<aggregation_name> - represents the name of an aggregation calculation, which can be named arbitrarily, because ES supports multiple statistical analysis queries at one time, and we need to use this name to find us in the query results later desired calculation result.
<aggregation_type> - aggregation type, which represents how we want to count data. There are two main types of aggregation types, bucket aggregation and indicator aggregation. These two types of aggregation include multiple aggregation types, for example: indicator aggregation: sum, avg, bucket Aggregation: terms, Date histogram, etc.
<aggregation_body> - Parameters of the aggregation type. Different aggregation types have different parameters.
aggregation_name_2 - represents the name of other aggregation calculations, which means that multiple types of statistics can be performed at one time.
Let's look at an example of a simple aggregation query:
Assume that there is an order index that stores each car sales order, which contains the car color field color.

GET /order/_search
{
    
    
    "size" : 0, // 设置size=0的意思就是,仅返回聚合查询结果,不返回普通query查询结果。
    "aggs" : {
    
     // 聚合查询语句的简写
        "popular_colors" : {
    
     // 给聚合查询取个名字,叫popular_colors
            "terms" : {
    
     // 聚合类型为,terms,terms是桶聚合的一种,类似SQL的group by的作用,根据字段分组,相同字段值的文档分为一组。
              "field" : "color" // terms聚合类型的参数,这里需要设置分组的字段为color,根据color分组
            }
        }
    }
}

The terms bucket aggregation is used above, and the index aggregation function is not specified explicitly. The default is to use the Value Count aggregation index to count the total number of documents. The whole statistics means to count the sales of each car color.
The equivalent SQL is as follows:

select count(color) from order group by color
{
    
    
...
   "hits": {
    
     // 因为size=0,所以query查询结果为空
      "hits": [] 
   },
   "aggregations": {
    
     // 聚合查询结果
      "popular_colors": {
    
     // 这个就是popular_colors聚合查询的结果,这就是为什么需要给聚合查询取个名字的原因,如果有多个聚合查询,可以通过名字查找结果
         "buckets": [ // 因为是桶聚合,所以看到返回一个buckets数组,代表分组的统计情况,下面可以看到每一种颜色的销量情况
            {
    
    
               "key": "red", 
               "doc_count": 4 // 红色的汽车销量为4
            },
            {
    
    
               "key": "blue",
               "doc_count": 2
            },
            {
    
    
               "key": "green",
               "doc_count": 2
            }
         ]
      }
   }
}

4. Index aggregation

ES indicator aggregation is a statistical function similar to SQL. Indicator aggregation can be used alone or together with bucket aggregation.
Commonly used statistical functions are as follows:
Value Count - count function similar to sql, count the total
Cardinality - count (DISTINCT field) similar to SQL, count the total number of non-repeated data
Avg - calculate the average
Sum - sum
Max - calculate the maximum value
Min - Finding the minimum value
The following describes the usage of the commonly used statistical functions in Elasticsearch.

4.1、Value Count

Value aggregation, mainly used to count the total number of documents, similar to the count function of SQL.
example:

GET /sales/_search?size=0
{
    
    
  "aggs": {
    
    
    "types_count": {
    
     // 聚合查询的名字,随便取个名字
      "value_count": {
    
     // 聚合类型为:value_count
        "field": "type" // 计算type这个字段值的总数
      }
    }
  }
}

等价SQL:
select count(type) from sales

return result:

{
    
    
    ...
    "aggregations": {
    
    
        "types_count": {
    
     // 聚合查询的名字
            "value": 7 // 统计结果
        }
    }
}

4.2、Cardinality

Cardinality aggregation is also used to count the total number of documents. The difference from Value Count is that cardinality aggregation will deduplicate and will not count duplicate values, similar to the usage of count (DISTINCT field) in SQL.
example:

POST /sales/_search?size=0
{
    
    
    "aggs" : {
    
    
        "type_count" : {
    
     // 聚合查询的名字,随便取一个
            "cardinality" : {
    
     // 聚合查询类型为:cardinality
                "field" : "type" // 根据type这个字段统计文档总数
            }
        }
    }
}

Equivalent SQL:

select count(DISTINCT type) from sales
{
    
    
    ...
    "aggregations" : {
    
    
        "type_count" : {
    
     // 聚合查询的名字
            "value" : 3 // 统计结果
        }
    }
}

Tip: As mentioned earlier, the function of cardinality aggregation is equivalent to the usage of SQL's count (DISTINCT field), which is not very accurate, because the result of SQL's count statistics is accurate statistics without losing precision, but the total number of cardinality cardinality aggregation statistics of ES It is an approximate value, and there will be a certain error. The purpose of this is for performance, because accurate statistics of the total number in massive data is very performance-consuming, but many business scenarios do not require accurate results, only approximate values, for example: statistics website It doesn't matter if there is a little error in the number of visits in a day.

4.3、Avg

Average
example:

POST /exams/_search?size=0
{
    
    
  "aggs": {
    
    
    "avg_grade": {
    
     // 聚合查询名字,随便取一个名字
      "avg": {
    
     // 聚合查询类型为: avg
        "field": "grade" // 统计grade字段值的平均值
      }
    }
  }
}

return result:

{
    
    
    ...
    "aggregations": {
    
    
        "avg_grade": {
    
     // 聚合查询名字
            "value": 75.0 // 统计结果
        }
    }
}

4.4、Sum

Sum calculation
example:

POST /sales/_search?size=0
{
    
    
  "aggs": {
    
    
    "hat_prices": {
    
     // 聚合查询名字,随便取一个名字
      "sum": {
    
     // 聚合类型为:sum
        "field": "price" // 计算price字段值的总和
      }
    }
  }
}

return result:

{
    
    
    ...
    "aggregations": {
    
    
        "hat_prices": {
    
     // 聚合查询名字
           "value": 450.0 // 统计结果
        }
    }
}

4.5.Max

Find the maximum
Example:

POST /sales/_search?size=0
{
    
    
  "aggs": {
    
    
    "max_price": {
    
     // 聚合查询名字,随便取一个名字
      "max": {
    
     // 聚合类型为:max
        "field": "price" // 求price字段的最大值
      }
    }
  }
}

return result:

{
    
    
    ...
    "aggregations": {
    
    
        "max_price": {
    
     // 聚合查询名字
            "value": 200.0 // 最大值
        }
    }
}

4.6.Min

Find the minimum value
Example:

POST /sales/_search?size=0
{
    
    
  "aggs": {
    
    
    "min_price": {
    
     // 聚合查询名字,随便取一个
      "min": {
    
     // 聚合类型为: min
        "field": "price" // 求price字段值的最小值
      }
    }
  }
}

return:

{
    
    
    ...

    "aggregations": {
    
    
        "min_price": {
    
     // 聚合查询名字
            "value": 10.0 // 最小值
        }
    }
}

4.7. Comprehensive example

The preceding example only introduces the case of using the aggregation index alone. In practical applications, the data in the index is often searched through the query query first, and then statistical analysis is performed on the query query results.
example:

GET /sales/_search
{
    
    
  "size": 0, // size = 0,代表不想返回query查询结果,只要统计结果
  "query": {
    
     // 设置query查询条件,后面的aggs统计,仅对query查询结果进行统计
    "constant_score": {
    
    
      "filter": {
    
    
        "match": {
    
    
          "type": "hat"
        }
      }
    }
  },
  "aggs": {
    
     // 统计query查询结果, 默认情况如果不写query语句,则代表统计所有数据
    "hat_prices": {
    
     // 聚合查询名字,计算price总和
      "sum": {
    
    
        "field": "price"
      }
    },
    "min_price": {
    
     // 聚合查询名字,计算price最小值
      "min": {
    
     
        "field": "price" 
      }
    },
    "max_price": {
    
     // 聚合查询名字,计算price最大值
      "max": {
    
     
        "field": "price"
      }
    }
  }
}

return:

{
    
    
    ...
    "aggregations": {
    
    
        "hat_prices": {
    
     // 求和
           "value": 450.0
        },
        "min_price": {
    
     // 最小值
            "value": 10.0 
        },
        "max_price": {
    
     // 最大值
            "value": 200.0 
        }
    }
}

5. Group aggregation query

The purpose of Elasticsearch bucket aggregation is to group data. First, the data is divided into multiple groups according to the specified conditions, and then statistics are made for each group. The concept of a group is equivalent to a bucket, and the term bucket is used uniformly in ES.

The function of ES bucket aggregation is the same as that of SQL group by. The difference is that ES supports more powerful data grouping capabilities. SQL can only group by unique values ​​of fields. The number of groups is equal to the number of unique values ​​of fields. For example: group by shop id, after removing duplicate shop IDs, there are as many groups as there are shops.

Bucket aggregation commonly used by ES is as follows:
Terms aggregation - similar to SQL group by, grouping according to the unique value of the field
Histogram aggregation - grouping according to numerical intervals, for example: price is grouped by 100 intervals, 0, 100, 200, 300, etc.
Date histogram aggregation - Group by time interval, for example: group by month, day, hour
Range aggregation - group by value range, for example: 0-150 group, 150-200 group, 200-500 group.
Tip: Bucket aggregation is generally not used alone, it is used together with index aggregation. After grouping the data, the data in the bucket must be counted. If the index aggregation is not explicitly specified in ES, the Value Count index aggregation is used by default to count the total number of documents in the bucket. .

5.1, Terms Aggregation

The function of terms aggregation is the same as that of group by in SQL. Data is grouped (bucketed) according to the unique value of the field, and documents with equal field values ​​are grouped into the same bucket.
example:

GET /order/_search?size=0
{
    
    
  "aggs": {
    
    
    "shop": {
    
     // 聚合查询的名字,随便取个名字
      "terms": {
    
     // 聚合类型为: terms
        "field": "shop_id" // 根据shop_id字段值,分桶
      }
    }
  }
}

等价SQL:
select shop_id, count(*) from order group by shop_id

return result:

{
    
    
    ...
    "aggregations" : {
    
    
        "shop" : {
    
     // 聚合查询名字
            "buckets" : [ // 桶聚合结果,下面返回各个桶的聚合结果
                {
    
    
                    "key" : "1", // key分桶的标识,在terms聚合中,代表的就是分桶的字段值
                    "doc_count" : 6 // 默认的指标聚合是统计桶内文档总数
                },
                {
    
    
                    "key" : "5",
                    "doc_count" : 3
                },
                {
    
    
                    "key" : "9",
                    "doc_count" : 2
                }
            ]
        }
    }
}

5.2, Histogram aggregation

Histogram (histogram) aggregation, mainly grouped according to numerical intervals, uses histogram aggregation bucket statistical results, usually used to draw bar chart reports.
example:

POST /sales/_search?size=0
{
    
    
    "aggs" : {
    
    
        "prices" : {
    
     // 聚合查询名字,随便取一个
            "histogram" : {
    
     // 聚合类型为:histogram
                "field" : "price", // 根据price字段分桶
                "interval" : 50 // 分桶的间隔为50,意思就是price字段值按50间隔分组
            }
        }
    }
}

return result:

{
    
    
    ...
    "aggregations": {
    
    
        "prices" : {
    
     // 聚合查询名字
            "buckets": [ // 分桶结果
                {
    
    
                    "key": 0.0, // 桶的标识,histogram分桶,这里通常是分组的间隔值
                    "doc_count": 1 // 默认按Value Count指标聚合,统计桶内文档总数
                },
                {
    
     
                    "key": 50.0,
                    "doc_count": 1
                },
                {
    
    
                    "key": 100.0,
                    "doc_count": 0
                },
                {
    
    
                    "key": 150.0,
                    "doc_count": 2
                }
            ]
        }
    }
}

5.3, Date histogram aggregation

Similar to histogram aggregation, the difference is that Date histogram can handle time type fields very well, and is mainly used for bucketing scenarios based on time and date.
example:

POST /sales/_search?size=0
{
    
    
    "aggs" : {
    
    
        "sales_over_time" : {
    
     // 聚合查询名字,随便取一个
            "date_histogram" : {
    
     // 聚合类型为: date_histogram
                "field" : "date", // 根据date字段分组
                "calendar_interval" : "month", // 分组间隔:month代表每月、支持minute(每分钟)、hour(每小时)、day(每天)、week(每周)、year(每年)
                "format" : "yyyy-MM-dd" // 设置返回结果中桶key的时间格式
            }
        }
    }
}

return result:

{
    
    
    ...
    "aggregations": {
    
    
        "sales_over_time": {
    
     // 聚合查询名字
            "buckets": [ // 桶聚合结果
                {
    
    
                    "key_as_string": "2015-01-01", // 每个桶key的字符串标识,格式由format指定
                    "key": 1420070400000, // key的具体字段值
                    "doc_count": 3 // 默认按Value Count指标聚合,统计桶内文档总数
                },
                {
    
    
                    "key_as_string": "2015-02-01",
                    "key": 1422748800000,
                    "doc_count": 2
                },
                {
    
    
                    "key_as_string": "2015-03-01",
                    "key": 1425168000000,
                    "doc_count": 2
                }
            ]
        }
    }
}

5.4, ​​Range aggregation

Range aggregation, bucketing by value range.
example:

GET /_search
{
    
    
    "aggs" : {
    
    
        "price_ranges" : {
    
     // 聚合查询名字,随便取一个
            "range" : {
    
     // 聚合类型为: range
                "field" : "price", // 根据price字段分桶
                "ranges" : [ // 范围配置
                    {
    
     "to" : 100.0 }, // 意思就是 price <= 100的文档归类到一个桶
                    {
    
     "from" : 100.0, "to" : 200.0 }, // price>100 and price<200的文档归类到一个桶
                    {
    
     "from" : 200.0 } // price>200的文档归类到一个桶
                ]
            }
        }
    }
}

return result:

{
    
    
    ...
    "aggregations": {
    
    
        "price_ranges" : {
    
     // 聚合查询名字
            "buckets": [ // 桶聚合结果
                {
    
    
                    "key": "*-100.0", // key可以表达分桶的范围
                    "to": 100.0, // 结束值
                    "doc_count": 2 // 默认按Value Count指标聚合,统计桶内文档总数
                },
                {
    
    
                    "key": "100.0-200.0",
                    "from": 100.0, // 起始值
                    "to": 200.0, // 结束值
                    "doc_count": 2
                },
                {
    
    
                    "key": "200.0-*",
                    "from": 200.0,
                    "doc_count": 3
                }
            ]
        }
    }
}

If you observe carefully, you will find that the default key value for range buckets is not very friendly. Especially during development, if you don’t know what the key looks like, it is cumbersome to deal with. We can assign a meaningful name to each bucket.
example:

GET /_search
{
    
    
    "aggs" : {
    
    
        "price_ranges" : {
    
    
            "range" : {
    
    
                "field" : "price",
                "keyed" : true,
                "ranges" : [
                    // 通过key参数,配置每一个分桶的名字
                    {
    
     "key" : "cheap", "to" : 100 },
                    {
    
     "key" : "average", "from" : 100, "to" : 200 },
                    {
    
     "key" : "expensive", "from" : 200 }
                ]
            }
        }
    }
}

5.5. Comprehensive example

In the previous examples, the aggs aggregation statement is used alone, which means that all documents are directly counted. In practical applications, it is often necessary to cooperate with the query statement to search for the target document first, and then use the aggs aggregation statement to perform statistical analysis on the search results.
example:

GET /cars/_search
{
    
    
    "size": 0, // size=0代表不需要返回query查询结果,仅仅返回aggs统计结果
    "query" : {
    
     // 设置查询语句,先赛选文档
        "match" : {
    
    
            "make" : "ford"
        }
    },
    "aggs" : {
    
     // 然后对query搜索的结果,进行统计
        "colors" : {
    
     // 聚合查询名字
            "terms" : {
    
     // 聚合类型为:terms 先分桶
              "field" : "color"
            },
            "aggs": {
    
     // 通过嵌套聚合查询,设置桶内指标聚合条件
              "avg_price": {
    
     // 聚合查询名字
                "avg": {
    
     // 聚合类型为: avg指标聚合
                  "field": "price" // 根据price字段计算平均值
                }
              },
              "sum_price": {
    
     // 聚合查询名字
                "sum": {
    
     // 聚合类型为: sum指标聚合
                  "field": "price" // 根据price字段求和
                }
              }
            }
        }
    }
}

Aggregation queries support multiple levels of nesting.

6. Sorting after aggregation

Bucket aggregations such as terms, histogram, and date_histogram will dynamically generate multiple buckets. If there are too many buckets generated, how do we determine the sort order of these buckets and limit the number of returned buckets.

6.1. Multi-bucket sorting

By default, ES will sort in descending order according to the total number of doc_count documents.
ES bucket aggregation supports two sorting methods:
built-in sorting
and sorting by metrics

Built-in sorting, built-in sorting parameters:
_count - sort by number of documents. Valid for terms, histogram, date_histogram_term
- Sort alphabetically by the string value of the term. Only use _key in terms
- sort by the key value of each bucket, only valid for histogram and date_histogram
Example:

GET /cars/_search
{
    
    
    "size" : 0,
    "aggs" : {
    
    
        "colors" : {
    
     // 聚合查询名字,随便取一个
            "terms" : {
    
     // 聚合类型为: terms
              "field" : "color", 
              "order": {
    
     // 设置排序参数
                "_count" : "asc"  // 根据_count排序,asc升序,desc降序
              }
            }
        }
    }
}

6.2. Sorting by measure

Usually, after bucket aggregation and bucketing, we will aggregate indicators of multiple dimensions in the bucket, so we can also sort according to the results of indicator aggregation in the bucket.
Example:
GET

GET /cars/_search
{
    
    
    "size" : 0,
    "aggs" : {
    
    
        "colors" : {
    
     // 聚合查询名字
            "terms" : {
    
     // 聚合类型: terms,先分桶
              "field" : "color", // 分桶字段为color
              "order": {
    
     // 设置排序参数
                "avg_price" : "asc"  // 根据avg_price指标聚合结果,升序排序。
              }
            },
            "aggs": {
    
     // 嵌套聚合查询,设置桶内聚合指标
                "avg_price": {
    
     // 聚合查询名字,前面排序引用的就是这个名字
                    "avg": {
    
    "field": "price"} // 计算price字段平均值
                }
            }
        }
    }
}

6.3. Limit the number of returned buckets

If there are too many buckets, you can limit the number of returned buckets by adding a size parameter to the bucket aggregation.
example:

GET /_search
{
    
    
    "aggs" : {
    
    
        "products" : {
    
     // 聚合查询名字
            "terms" : {
    
     // 聚合类型为: terms
                "field" : "product", // 根据product字段分桶
                "size" : 5 // 限制最多返回5个桶
            }
        }
    }
}

Guess you like

Origin blog.csdn.net/chuige2013/article/details/129635792