聚合分析简介
聚合分析是数据库中重要的功能特性,完成对一个查询的数据集中数据的聚合计算,如:找出某字段(或计算表达式的结果)的最大值、最小值,计算和、平均值等。ES作为搜索引擎兼数据库,同样提供了强大的聚合分析能力。
- 指标聚合metric:是对一个数据集求最大、最小、和、平均值等指标的聚合
- 桶聚合bucketing:关系型数据库中除了有聚合函数外,还可以对查询出的数据进行分组group by,再在组上进行指标聚合,在 ES 中group by 称为分桶
- ES中还提供了矩阵聚合(matrix)、管道聚合(pipleline),但还在完善中。
在查询请求体中以aggregations节点按如下语法定义聚合分析(aggregations可以简写成aggs):
"aggregations" : {
"<aggregation_name>" : {
"<aggregation_type>" : {
<aggregation_body>
}
[,"meta" : { [<meta_data_body>] } ]?
[,"aggregations" : { [<sub_aggregation>]+ } ]?
}
[,"<aggregation_name_2>" : { ... } ]*
}
聚合计算的值可以取字段的值,也可是脚本计算的结果。
指标聚合
max min sum avg
查询所有客户中余额最大值(size=0表示不返回其他字段):
POST /bank/_search?
{
"size": 0,
"aggs": {
"masssbalance": {
"max": {
"field": "balance"
}
}
}
}
年龄为24岁的客户中余额最大值:
POST /bank/_search?
{
"size": 2,
"query": {
"match": {
"age": 24
}
},
"sort": [
{
"balance": {
"order": "desc"
}
}
],
"aggs": {
"max_balance": {
"max": {
"field": "balance"
}
}
}
}
查询所有客户的平均年龄是多少(值来源于脚本):
POST /bank/_search?size=0
{
"aggs" : {
"avg_age" : {
"avg" : {
"script" : {
"source" : "doc.age.value"
}
}
},
"avg_age10" : {
"avg" : {
"script" : {
"source" : "doc.age.value + 10"
}
}
}
}
}
指定字段field,然后在脚本中用_value取字段的值:
POST /bank/_search?size=0
{
"aggs": {
"sum_balance": {
"sum": {
"field": "balance",
"script": {
"source": "_value * 1.03"
}
}
}
}
}
为缺失字段指定值,如未指定,缺失字段的值将被忽略:
POST /bank/_search?size=0
{
"aggs": {
"avg_age": {
"avg": {
"field": "age",
"missing": 18
}
}
}
}
文档计数
文档计数count:
POST /bank/_doc/_count
{
"query": {
"match": {
"age" : 24
}
}
}
cardinality值去重计数:
POST /bank/_search?size=0
{
"aggs": {
"age_count": {
"cardinality": {
"field": "age"
}
},
"state_count": {
"cardinality": {
"field": "state.keyword"
}
}
}
}
统计某字段有值的文档数:
POST /bank/_search?size=0
{
"aggs" : {
"age_count" : { "value_count" : { "field" : "age" } }
}
}
stats可以统计count、max、min、avg、sum5个值:
POST /bank/_search?size=0
{
"aggs": {
"age_stats": {
"stats": {
"field": "age"
}
}
}
}
高级统计,比stats多4个统计结果:平方和、方差、标准差、平均值加/减两个标准差的区间
POST /bank/_search?size=0
{
"aggs": {
"age_stats": {
"extended_stats": {
"field": "age"
}
}
}
}
占比百分位对应的值统计
对指定字段(脚本)的值按从小到大累计每个值对应的文档数的占比(占所有命中文档数的百分比),返回指定占比比例对应的值。默认返回[ 1, 5, 25, 50, 75, 95, 99 ]分位上的值。如下中间的结果,可以理解为:占比为50%的文档的age值 <= 31,或反过来:age<=31的文档数占总命中文档数的50%
POST /bank/_search?size=0
{
"aggs": {
"age_percents": {
"percentiles": {
"field": "age"
}
}
}
}
#返回结果
"aggregations": {
"age_percents": {
"values": {
"1.0": 20,
"5.0": 21,
"25.0": 25,
"50.0": 31,
"75.0": 35,
"95.0": 39,
"99.0": 40
}
}
}
也可以指定分位值:
POST /bank/_search?size=0
{
"aggs": {
"age_percents": {
"percentiles": {
"field": "age",
"percents" : [95, 99, 99.9]
}
}
}
}
#结果
"aggregations": {
"age_percents": {
"values": {
"95.0": 39,
"99.0": 40,
"99.9": 40
}
}
}
统计值小于等于指定值的文档占比
POST /bank/_search?size=0
{
"aggs": {
"gge_perc_rank": {
"percentile_ranks": {
"field": "age",
"values": [
25,
30
]
}
}
}
}
#结果
"aggregations": {
"gge_perc_rank": {
"values": {
"25.0": 26.1,
"30.0": 49.3
}
}
}
求文档几种的坐标点范围
求中心点坐标值
桶聚合
Terms Aggregation 根据字段值项分组聚合
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age" #根据age值项进行分组聚合
}
}
}
}
#返回结果
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": 0, #文档计数的最大偏差值
"sum_other_doc_count": 463, #未返回的其他项的文档数
"buckets": [
{
"key": 31, #age的值
"doc_count": 61 #出现的文档总数
},
{
"key": 39,
"doc_count": 60
},
{
"key": 26,
"doc_count": 59
},
….
]
}
}
默认情况下返回按文档计数从高到低的前10个分组
size可以指定返回多少个分组
shard_size可以指定每个分片上返回多少个分组,默认值如下:
- 索引只有一个分片的情况下,shard_size=size
- 索引有多个分片的情况下,shard_size=size*1.5+10
show_term_doc_count_error可以指定每个分组上是否显示偏差值
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"size": 5,
"shard_size":20,
"show_term_doc_count_error": true
}
}
}
}
order可以指定根据文档计数排序或根据分组值排序
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"order" : { "_count" : "asc" } #根据文档计数排序
}
}
}
}
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"order" : { "_key" : "asc" } #根据分组值排序
}
}
}
}
取分组指标值,比如按年龄age分组,然后显示出该年龄的最小收入balance和最大收入balance:
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"order": {
"max_balance": "asc"
}
},
"aggs": {
"max_balance": {
"max": {
"field": "balance"
}
},
"min_balance": {
"min": {
"field": "balance"
}
}
}
}
}
}
#返回结果
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 511,
"buckets": [
{
"key": 27,
"doc_count": 39,
"min_balance": {
"value": 1110
},
"max_balance": {
"value": 46868
}
},
{
"key": 39,
"doc_count": 60,
"min_balance": {
"value": 3589
},
"max_balance": {
"value": 47257
}
},
.....
]
}
}
根据分组指标值排序,比如按最大收入进行排序
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"order": {
"max_balance": "asc"
}
},
"aggs": {
"max_balance": {
"max": {
"field": "balance"
}
}
}
}
}
}
还可以统计收入的最大、最小、平均、总数,并按照任意一个值进行排序:
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"order": {
"stats_balance.max": "asc"
}
},
"aggs": {
"stats_balance": {
"stats": {
"field": "balance"
}
}
}
}
}
}
筛选分组,可以过滤文档计数最小值达到多少,还可以筛选指定的key值列表:
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"min_doc_count": 60 #文档数60或以上的显示出来
}
}
}
}
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"include": [20,24] #只显示年龄为20和24的数据
}
}
}
}
还可以指定字段中包含或不包含哪些内容,或者使用正则表达式进行匹配值:
GET /_search
{
"aggs" : {
"JapaneseCars" : {
"terms" : {
"field" : "make",
"include" : ["mazda", "honda"] #make中包含这些字段的
}
},
"ActiveCarManufacturers" : {
"terms" : {
"field" : "make",
"exclude" : ["rover", "jensen"] #make中不包含这些字段的
}
}
}
}
GET /_search
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"include" : ".*sport.*",
"exclude" : "water_.*"
}
}
}
}
对缺失值处理,比如有的文档中tags字段是不存在或没有值的,那么我们可以为这些字段指定这种情况下应该返回什么纸:
GET /_search
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"missing": "N/A"
}
}
}
}
Filter Aggregation 对满足过滤查询的文档进行聚合
在查询命中的文档中选取符合过滤条件的文档进行聚合
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"filter": {"match":{"gender":"F"}},
"aggs": {
"avg_age": {
"avg": {
"field": "age"
}
}
}
}
}
}
Filters Aggregation 多个过滤组聚合计算
索引一段数据:
PUT /logs/_doc/_bulk?refresh
{ "index" : { "_id" : 1 } }
{ "body" : "warning: page could not be rendered" }
{ "index" : { "_id" : 2 } }
{ "body" : "authentication error" }
{ "index" : { "_id" : 3 } }
{ "body" : "warning: connection timed out" }
然后进行多个过滤组统计查询
GET logs/_search
{
"size": 0,
"aggs" : {
"messages" : {
"filters" : {
"filters" : {
"errors" : { "match" : { "body" : "error" }},
"warnings" : { "match" : { "body" : "warning" }}
}
}
}
}
}
Range Aggregation 范围分组聚合
POST /bank/_search?size=0
{
"aggs": {
"age_range": {
"range": {
"field": "age",
"ranges": [
{"to":25},
{"from": 25,"to": 35},
{"from": 35}
]
},
"aggs": {
"bmax": {
"max": {
"field": "balance"
}
}
}
}
}
}
#返回结果,分成三组,to、from to、from
"aggregations": {
"age_range": {
"buckets": [
{
"key": "*-25.0",
"to": 25,
"doc_count": 225,
"bmax": {
"value": 49587
}
},
{
"key": "25.0-35.0",
"from": 25,
"to": 35,
"doc_count": 485,
"bmax": {
"value": 49795
}
},
{
"key": "35.0-*",
"from": 35,
"doc_count": 290,
"bmax": {
"value": 49989
}
}
]
}
}
Date Range Aggregation 时间范围分组聚合
POST /sales/_search?size=0
{
"aggs": {
"range": {
"date_range": {
"field": "date",
"format": "MM-yyy",
"ranges": [
{ "to": "now-10M/M" },
{ "from": "now-10M/M" }
]
}
}
}
}
Date Histogram Aggregation 时间直方图(柱状)聚合
就是按天、月、年等进行聚合统计。可按 year (1y), quarter (1q), month (1M), week (1w), day (1d), hour (1h), minute (1m), second (1s) 间隔聚合或指定的时间间隔聚合。
POST /sales/_search?size=0
{
"aggs" : {
"sales_over_time" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
}
}
}
}
POST /sales/_search?size=0
{
"aggs" : {
"sales_over_time" : {
"date_histogram" : {
"field" : "date",
"interval" : "90m"
}
}
}
}
Missing Aggregation 缺失值的桶聚合
指定缺失字段值的文档作为一个桶进行聚合分析
POST /bank/_search?size=0
{
"aggs" : {
"account_without_a_age" : {
"missing" : { "field" : "age" }
}
}
}