文章目录

聚合分析简介
指标聚合

max min sum avg
文档计数
占比百分位对应的值统计
统计值小于等于指定值的文档占比
求文档几种的坐标点范围
求中心点坐标值

桶聚合

Terms Aggregation 根据字段值项分组聚合
Filter Aggregation 对满足过滤查询的文档进行聚合
Filters Aggregation 多个过滤组聚合计算
Range Aggregation 范围分组聚合
Date Range Aggregation 时间范围分组聚合
Date Histogram Aggregation 时间直方图（柱状）聚合
Missing Aggregation 缺失值的桶聚合
Geo Distance Aggregation 地理距离分区聚合

聚合分析简介

聚合分析是数据库中重要的功能特性，完成对一个查询的数据集中数据的聚合计算，如：找出某字段（或计算表达式的结果）的最大值、最小值，计算和、平均值等。ES作为搜索引擎兼数据库，同样提供了强大的聚合分析能力。

指标聚合metric：是对一个数据集求最大、最小、和、平均值等指标的聚合

桶聚合bucketing：关系型数据库中除了有聚合函数外，还可以对查询出的数据进行分组group by，再在组上进行指标聚合，在 ES 中group by 称为分桶

ES中还提供了矩阵聚合（matrix）、管道聚合（pipleline），但还在完善中。

在查询请求体中以aggregations节点按如下语法定义聚合分析（aggregations可以简写成aggs）：

"aggregations" : {
    "<aggregation_name>" : {
        "<aggregation_type>" : {
            <aggregation_body>
        }
        [,"meta" : {  [<meta_data_body>] } ]?
        [,"aggregations" : { [<sub_aggregation>]+ } ]?
    }
    [,"<aggregation_name_2>" : { ... } ]*
}

聚合计算的值可以取字段的值，也可是脚本计算的结果。

指标聚合

max min sum avg

查询所有客户中余额最大值（size=0表示不返回其他字段）：

POST /bank/_search?
{
  "size": 0, 
  "aggs": {
    "masssbalance": {
      "max": {
        "field": "balance"
      }
    }
  }
}

年龄为24岁的客户中余额最大值：

POST /bank/_search?
{
  "size": 2, 
  "query": {
    "match": {
      "age": 24
    }
  },
  "sort": [
    {
      "balance": {
        "order": "desc"
      }
    }
  ],
  "aggs": {
    "max_balance": {
      "max": {
        "field": "balance"
      }
    }
  }
}

查询所有客户的平均年龄是多少（值来源于脚本）：

POST /bank/_search?size=0
{
    "aggs" : {
        "avg_age" : {
            "avg" : {
                "script" : {
                    "source" : "doc.age.value"
                }
            }
        },
        "avg_age10" : {
            "avg" : {
                "script" : {
                    "source" : "doc.age.value + 10"
                }
            }
        }
    }
}

指定字段field，然后在脚本中用_value取字段的值：

POST /bank/_search?size=0
{
  "aggs": {
    "sum_balance": {
      "sum": {
        "field": "balance",
        "script": {
            "source": "_value * 1.03"
        }
      }
    }
  }
}

为缺失字段指定值，如未指定，缺失字段的值将被忽略：

POST /bank/_search?size=0
{
  "aggs": {
    "avg_age": {
      "avg": {
        "field": "age",
        "missing": 18
      }
    }
  }
}

文档计数

文档计数count：

扫描二维码关注公众号，回复： 4205361 查看本文章

POST /bank/_doc/_count
{
  "query": {
    "match": {
      "age" : 24
    }
  }
}

cardinality值去重计数：

POST /bank/_search?size=0
{
  "aggs": {
    "age_count": {
      "cardinality": {
        "field": "age"
      }
    },
    "state_count": {
      "cardinality": {
        "field": "state.keyword"
      }
    }
  }
}

统计某字段有值的文档数：

POST /bank/_search?size=0
{
    "aggs" : {
        "age_count" : { "value_count" : { "field" : "age" } }
    }
}

stats可以统计count、max、min、avg、sum5个值：

POST /bank/_search?size=0
{
  "aggs": {
    "age_stats": {
      "stats": {
        "field": "age"
      }
    }
  }
}

高级统计，比stats多4个统计结果：平方和、方差、标准差、平均值加/减两个标准差的区间

POST /bank/_search?size=0
{
  "aggs": {
    "age_stats": {
      "extended_stats": {
        "field": "age"
      }
    }
  }
}

占比百分位对应的值统计

对指定字段（脚本）的值按从小到大累计每个值对应的文档数的占比（占所有命中文档数的百分比），返回指定占比比例对应的值。默认返回[ 1, 5, 25, 50, 75, 95, 99 ]分位上的值。如下中间的结果，可以理解为：占比为50%的文档的age值 <= 31，或反过来：age<=31的文档数占总命中文档数的50%

POST /bank/_search?size=0
{
  "aggs": {
    "age_percents": {
      "percentiles": {
        "field": "age"
      }
    }
  }
}

#返回结果
 "aggregations": {
    "age_percents": {
      "values": {
        "1.0": 20,
        "5.0": 21,
        "25.0": 25,
        "50.0": 31,
        "75.0": 35,
        "95.0": 39,
        "99.0": 40
      }
    }
  }

也可以指定分位值：

POST /bank/_search?size=0
{
  "aggs": {
    "age_percents": {
      "percentiles": {
        "field": "age",
        "percents" : [95, 99, 99.9] 
      }
    }
  }
}

#结果
"aggregations": {
    "age_percents": {
      "values": {
        "95.0": 39,
        "99.0": 40,
        "99.9": 40
      }
    }
}

统计值小于等于指定值的文档占比

POST /bank/_search?size=0
{
  "aggs": {
    "gge_perc_rank": {
      "percentile_ranks": {
        "field": "age",
        "values": [
          25,
          30
        ]
      }
    }
  }
}

#结果
"aggregations": {
    "gge_perc_rank": {
      "values": {
        "25.0": 26.1,
        "30.0": 49.3
      }
    }
  }

求文档几种的坐标点范围

参考官网：
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-geobounds-aggregation.html

求中心点坐标值

参考官网：
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-geocentroid-aggregation.html

桶聚合

Terms Aggregation 根据字段值项分组聚合

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age"  #根据age值项进行分组聚合
      }
    }
  }
}

#返回结果
"aggregations": {
	"age_terms": {
	  "doc_count_error_upper_bound": 0,  #文档计数的最大偏差值
	  "sum_other_doc_count": 463,  #未返回的其他项的文档数
	  "buckets": [
		{
		  "key": 31,  #age的值
		  "doc_count": 61  #出现的文档总数
		},
		{
		  "key": 39,
		  "doc_count": 60
		},
		{
		  "key": 26,
		  "doc_count": 59
		},
		….
	   ]
	}
}

默认情况下返回按文档计数从高到低的前10个分组

size可以指定返回多少个分组

shard_size可以指定每个分片上返回多少个分组，默认值如下：

索引只有一个分片的情况下，shard_size=size

索引有多个分片的情况下，shard_size=size*1.5+10

show_term_doc_count_error可以指定每个分组上是否显示偏差值

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "size": 5,
        "shard_size":20,
        "show_term_doc_count_error": true
      }    
	 }  
   }
}

order可以指定根据文档计数排序或根据分组值排序

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "order" : { "_count" : "asc" }  #根据文档计数排序
      }
    }
  }
}

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "order" : { "_key" : "asc" }  #根据分组值排序
      }
    }
  }
}

取分组指标值，比如按年龄age分组，然后显示出该年龄的最小收入balance和最大收入balance：

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "order": {
          "max_balance": "asc"
        }
      },
      "aggs": {
        "max_balance": {
          "max": {
            "field": "balance"
          }
        },
        "min_balance": {
          "min": {
            "field": "balance"
          }
        }      
      }    
    }  
  }
}

#返回结果
"aggregations": {
    "age_terms": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 511,
      "buckets": [
        {
          "key": 27,
          "doc_count": 39,
          "min_balance": {
            "value": 1110
          },
          "max_balance": {
            "value": 46868
          }
        },
        {
          "key": 39,
          "doc_count": 60,
          "min_balance": {
            "value": 3589
          },
          "max_balance": {
            "value": 47257
          }
        },
        .....
      ]
    }
  }

根据分组指标值排序，比如按最大收入进行排序

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "order": {
          "max_balance": "asc"
        }
      },
      "aggs": {
        "max_balance": {
          "max": {
            "field": "balance"
          }
        }
      }
    }  
  }
}

还可以统计收入的最大、最小、平均、总数，并按照任意一个值进行排序：

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "order": {
          "stats_balance.max": "asc"
        }
      },
      "aggs": {
        "stats_balance": {
          "stats": {
            "field": "balance"
          }
        }
      }
    }  
  }
}

筛选分组，可以过滤文档计数最小值达到多少，还可以筛选指定的key值列表：

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "min_doc_count": 60  #文档数60或以上的显示出来
      }
    }
  }
}

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "include": [20,24]  #只显示年龄为20和24的数据
      }
    }
  }
}

还可以指定字段中包含或不包含哪些内容，或者使用正则表达式进行匹配值：

GET /_search
{
    "aggs" : {
        "JapaneseCars" : {
             "terms" : {
                 "field" : "make",
                 "include" : ["mazda", "honda"]  #make中包含这些字段的
             }
         },
        "ActiveCarManufacturers" : {
             "terms" : {
                 "field" : "make",
                 "exclude" : ["rover", "jensen"]  #make中不包含这些字段的
             }
         }
    }
}

GET /_search
{
    "aggs" : {
        "tags" : {
            "terms" : {
                "field" : "tags",
                "include" : ".*sport.*",
                "exclude" : "water_.*"
            }
        }
    }
}

对缺失值处理，比如有的文档中tags字段是不存在或没有值的，那么我们可以为这些字段指定这种情况下应该返回什么纸：

GET /_search
{
    "aggs" : {
        "tags" : {
             "terms" : {
                 "field" : "tags",
                 "missing": "N/A" 
             }
         }
    }
}

Filter Aggregation 对满足过滤查询的文档进行聚合

在查询命中的文档中选取符合过滤条件的文档进行聚合

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "filter": {"match":{"gender":"F"}},
      "aggs": {
        "avg_age": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}

Filters Aggregation 多个过滤组聚合计算

索引一段数据：

PUT /logs/_doc/_bulk?refresh
{ "index" : { "_id" : 1 } }
{ "body" : "warning: page could not be rendered" }
{ "index" : { "_id" : 2 } }
{ "body" : "authentication error" }
{ "index" : { "_id" : 3 } }
{ "body" : "warning: connection timed out" }

然后进行多个过滤组统计查询

GET logs/_search
{
  "size": 0,
  "aggs" : {
    "messages" : {
      "filters" : {
        "filters" : {
          "errors" :   { "match" : { "body" : "error"   }},
          "warnings" : { "match" : { "body" : "warning" }}
        }
      }   
    }  
  }
}

Range Aggregation 范围分组聚合

POST /bank/_search?size=0
{
  "aggs": {
    "age_range": {
      "range": {
        "field": "age",
        "ranges": [
          {"to":25},
          {"from": 25,"to": 35},
          {"from": 35}
        ]
      },
      "aggs": {
        "bmax": {
          "max": {
            "field": "balance"
          }
        }
      }    
    }  
  }
}

#返回结果，分成三组，to、from to、from
"aggregations": {
    "age_range": {
      "buckets": [
        {
          "key": "*-25.0",
          "to": 25,
          "doc_count": 225,
          "bmax": {
            "value": 49587
          }
        },
        {
          "key": "25.0-35.0",
          "from": 25,
          "to": 35,
          "doc_count": 485,
          "bmax": {
            "value": 49795
          }
        },
        {
          "key": "35.0-*",
          "from": 35,
          "doc_count": 290,
          "bmax": {
            "value": 49989
          }
        }
      ]
    }
  }

Date Range Aggregation 时间范围分组聚合

POST /sales/_search?size=0
{
    "aggs": {
        "range": {
            "date_range": {
                "field": "date",
                "format": "MM-yyy",
                "ranges": [
                    { "to": "now-10M/M" }, 
                    { "from": "now-10M/M" } 
                ]
            }
        }
    }
}

Date Histogram Aggregation 时间直方图（柱状）聚合

就是按天、月、年等进行聚合统计。可按 year (1y), quarter (1q), month (1M), week (1w), day (1d), hour (1h), minute (1m), second (1s) 间隔聚合或指定的时间间隔聚合。

POST /sales/_search?size=0
{
    "aggs" : {
        "sales_over_time" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            }
        }
    }
}

POST /sales/_search?size=0
{
    "aggs" : {
        "sales_over_time" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "90m"
            }
        }
    }
}

Missing Aggregation 缺失值的桶聚合

指定缺失字段值的文档作为一个桶进行聚合分析

POST /bank/_search?size=0
{
    "aggs" : {
        "account_without_a_age" : {
            "missing" : { "field" : "age" }
        }
    }
}

Geo Distance Aggregation 地理距离分区聚合

参考官网：
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-geodistance-aggregation.html

Elasticsearch搜索引擎第十二篇-聚合分析