37. In-depth explanation of Elasticsearch aggregation - comparison with Mysql implementation (lasitcsearch aggregation advanced)

Aggregate Cognitive Premise

Buckets - Collection 
metrics of documents that meet certain conditions (Metrics) - Statistical calculation of documents in buckets

SELECT COUNT(color) 
FROM table 
GROUP BY color

COUNT(color) is equivalent to the index. 
GROUP BY color is equivalent to buckets.

write picture description here

First, the start of aggregation

1. Create an index

1.1 Create index DSL implementation

put cars
POST /cars/transactions/_bulk
{ "index": {}}
{ "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }
{ "index": {}}
{ "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" }
{ "index": {}}
{ "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" }
{ "index": {}}
{ "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" }

1.2 Create mysql library table sql implementation

CREATE TABLE `cars` (
  `id` int(11) NOT NULL,
  `price` int(11) DEFAULT NULL,
  `color` varchar(255) DEFAULT NULL,
  `make` varchar(255) DEFAULT NULL,
  `sold` date DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

2. Count the number of cars of different colors

2.1 DSL implementation of counting cars of different colors

GET /cars/transactions/_search
{
  "size":0,
  "aggs":{
  "popular_colors" : {
  "terms":{
  "field": "color.keyword"
  }
  }
  }
}

Return result:

lve

2.2 mysql implementation of statistics of different colors

select color, count(color) as cnt from cars group by color order by cnt desc;
  • 1

Return result:

red 4
green 2
blue 2

3. Calculate the average price of cars of different colors

3.1 Statistics of the average price of cars of different colors DSL implementation:

GET /cars/transactions/_search
{
  "size":0,
  "aggs":{
  "colors" : {
  "terms":{
  "field": "color.keyword"
  },
  "aggs":{
  "avg_price":{
  "avg": {
  "field": "price"
  }
  }
  }
  }
  }
}

Return the aggregated result:

lve

3.2 Statistical sql implementation of the average price of cars of different colors:

select color, count(color) as cnt, avg(price) as avg_price from cars group by color order by cnt desc;

color cnt avg_price
red 4 32500.0000
green 2 21000.0000
blue 2 20000.0000

4. Distribution of car manufacturers by color

4.1 Statistical distribution dsl implementation of each color car manufacturer

GET /cars/transactions/_search
{
  "size":0,
  "aggs":{
  "colors" : {
  "terms":{
  "field": "color.keyword"
  },
  "aggs":{
  "make":{
  "terms":{
  "field": "make.keyword"
  }
  }
  }
  }
  }
}

Return result:

4.2 Statistical sql implementation of the distribution of each color car manufacturer

Description: It does not strictly correspond to the implementation of dsl

select color, make from cars order by color;

color make
blue toyota
blue ford
green ford
green toyota
red bmw
red honda
red honda
red honda

5. Statistics of the lowest price and highest price of each manufacturer

5.1 Statistics of the lowest and highest price DSL implementations per manufacturer

GET /cars/transactions/_search
{
  "size":0,
  "aggs":{
  "make_class" : {
  "terms":{
  "field": "make.keyword"
  },
  "aggs":{
  "min_price":{
  "min":{
  "field": "price"
  }
  },
  "max_price":{
  "max":{
  "field": "price"
  }
  }
  }
  }
  }
}

Aggregate result:

5.2 The sql implementation that counts the lowest and highest prices for each manufacturer

select make, min(price) as min_price, max(price) as max_price from cars group by make;

make min_price max_price
bmw 80000 80000
ford 25000 30000
honda 10000 20000
toyota 12000 15000

2. Advanced Aggregation

1. Bar chart aggregation

1.1 Segment the sum of the car sales prices in each interval

GET /cars/transactions/_search
{
  "size":0,
  "aggs":{
  "price" : {
  "histogram":{
  "field": "price",
  "interval": 20000
  },
  "aggs":{
  "revenue":{
  "sum":{
  "field": "price"
  }
  }
  }
  }
  }
}

Car sales price range: defined as 20000; 
segmented statistics price and sum statistics.

1.2 Multi-dimensional measurement of automobile indicators of different manufacturers

GET /cars/transactions/_search
{
  "size" : 0,
  "aggs": {
  "makes": {
  "terms": {
  "field": "make.keyword",
  "size": 10
  },
  "aggs": {
  "stats": {
  "extended_stats": {
  "field": "price"
  }
  }
  }
  }
  }
}

A snippet of the output:

{
  "key": "ford",
  "doc_count": 2,
  "stats": {
  "count": 2,
  "min": 25000,
  "max": 30000,
  "avg": 27500,
  "sum": 55000,
  "sum_of_squares": 1525000000,
  "variance": 6250000,
  "std_deviation": 2500,
  "std_deviation_bounds": {
  "upper": 32500,
  "lower": 22500
  }
  }
  }

2. Aggregate by time statistics

2.1 Statistical Manufacturer's Auto Sales DSL Achievement by Month

GET /cars/transactions/_search
{
  "size" : 0,
  "aggs": {
  "sales":{
  "date_histogram":{
  "field":"sold",
  "interval":"month",
  "format":"yyyy-MM-dd"
  }
  }
  }
}

Return result:

2.2 Statistical sql implementation of manufacturers' car sales by month

SELECT make, count(make) as cnt, CONCAT(YEAR(sold),',',MONTH(sold)) AS data_time
FROM `cars`
GROUP BY YEAR(sold) DESC,MONTH(sold)

查询结果如下:
make cnt data_time
bmw 1 2014,1
ford 1 2014,2
ford 1 2014,5
toyota 1 2014,7
toyota 1 2014,8
honda 1 2014,10
honda 2 2014,11

2.3 Contains the December implementation of the Processing DSL

There are no statistics for December in 2.1 above.

GET /cars/transactions/_search
{
  "size" : 0,
  "aggs": {
  "sales":{
  "date_histogram":{
  "field":"sold",
  "interval":"month",
  "format":"yyyy-MM-dd",
  "min_doc_count": 0,
  "extended_bounds":{
  "min":"2014-01-01",
  "max":"2014-12-31"
  }
  }
  }
  }
}

2.4 Statistical DSL implementation on a quarterly basis

GET /cars/transactions/_search
{
  "size" : 0,
  "aggs": {
  "sales":{
  "date_histogram":{
  "field":"sold",
  "interval":"quarter",
  "format":"yyyy-MM-dd",
  "min_doc_count": 0,
  "extended_bounds":{
  "min":"2014-01-01",
  "max":"2014-12-31"
  }
  },
  "aggs":{
  "per_make_sum":{
  "terms":{
  "field": "make.keyword"
  },
  "aggs":{
  "sum_price":{
  "sum":{ "field": "price"}
  }
  }
  },
  "top_sum": {
  "sum": {"field":"price"}
  }
  }
  }
  }
  }

2.5 Search-based (scoped) aggregation operations

2.5.1 Basic query aggregation

GET /cars/transactions/_search
{
  "query" : {
  "match" : {
  "make.keyword" : "ford"
  }
  },
  "aggs" : {
  "colors" : {
  "terms" : {
  "field" : "color.keyword"
  }
  }
  }
}

The corresponding sql implementation:

select make, color from cars
where make = "ford";

结果返回如下:
make color
ford green
ford blue

3. Filter polymerization

1. Filter operation

统计全部汽车的平均价钱以及单品平均价钱;

GET /cars/transactions/_search
{
  "size" : 0,
  "query" : {
  "match" : {
  "make.keyword" : "ford"
  }
  },
  "aggs" : {
  "single_avg_price": {
  "avg" : { "field" : "price" }
  },
  "all": {
  "global" : {},
  "aggs" : {
  "avg_price": {
  "avg" : { "field" : "price" }
  }

  }
  }
  }
}

等价于:

select make, color, avg(price) from cars
where make = "ford" ;
select avg(price) from cars;

2、范围限定过滤(过滤桶)

我们可以指定一个过滤桶,当文档满足过滤桶的条件时,我们将其加入到桶内。

GET /cars/transactions/_search
{
  "size" : 0,
  "query":{
  "match": {
  "make": "ford"
  }
  },
  "aggs":{
  "recent_sales": {
  "filter": {
  "range": {
  "sold": {
  "from": "now-100M"
  }
  }
  },
  "aggs": {
  "average_price":{
  "avg": {
  "field": "price"
  }
  }
  }
  }
  }
}

mysql的实现如下:

select *, avg(price) from cars where period_diff(date_format(now() , '%Y%m') , date_format(sold, '%Y%m')) > 30
and make = "ford";


mysql查询结果如下:
id price color make sold avg
3 30000 green ford 2014-05-18 27500.0000

3、后过滤器

只过滤搜索结果,不过滤聚合结果——post_filter实现

GET /cars/transactions/_search
{
  "query": {
  "match": {
  "make": "ford"
  }
  },
  "post_filter": {
  "term" : {
  "color.keyword" : "green"
  }
  },
  "aggs" : {
  "all_colors": {
  "terms" : { "field" : "color.keyword" }
  }
  }
}

post_filter 会过滤搜索结果,只展示绿色 ford 汽车。这在查询执行过 后 发生,所以聚合不受影响。

小结 
选择合适类型的过滤(如:搜索命中、聚合或两者兼有)通常和我们期望如何表现用户交互有关。选择合适的过滤器(或组合)取决于我们期望如何将结果呈现给用户。

  • 在 filter 过滤中的 non-scoring 查询,同时影响搜索结果和聚合结果。
  • filter 桶影响聚合。
  • post_filter 只影响搜索结果。

四、多桶排序

4.1 内置排序

GET /cars/transactions/_search
{
  "size" : 0,
  "aggs" : {
  "colors" : {
  "terms" : {
  "field" : "color.keyword",
  "order": {
  "_count" : "asc"
  }
  }
  }
  }
}

4.2 按照度量排序

以下是按照汽车平均售价的升序进行排序。 
过滤条件:汽车颜色; 
聚合条件:平均价格; 
排序条件:汽车的平均价格升序。

GET /cars/transactions/_search
{
  "size" : 0,
  "aggs" : {
  "colors" : {
  "terms" : {
  "field" : "color.keyword",
  "order": {
  "avg_price" : "asc"
  }
  },
  "aggs": {
  "avg_price": {
  "avg": {"field": "price"}
  }
  }
  }
}
}

多条件聚合后排序如下所示:

GET /cars/transactions/_search
{
  "size" : 0,
  "aggs" : {
  "colors" : {
  "terms" : {
  "field" : "color.keyword",
  "order": {
  "stats.variance" : "asc"
  }
  },
  "aggs": {
  "stats": {
  "extended_stats": {"field": "price"}
  }
  }
  }
  }
}

4.3 基于“深度”的度量排序

太复杂,不推荐!

五、近似聚合

cardinality的含义是“基数”;

5.1 统计去重后的数量

GET /cars/transactions/_search
{
  "size" : 0,
  "aggs" : {
  "distinct_colors" : {
  "cardinality" : {
  "field" : "color.keyword"
  }
  }
  }
}

类似于:

SELECT COUNT(DISTINCT color)  FROM cars;
  • 1

以下: 
以月为周期统计;

GET /cars/transactions/_search
{
  "size" : 0,
  "aggs" : {
  "months" : {
  "date_histogram": {
  "field": "sold",
  "interval": "month"
  },
  "aggs": {
  "distinct_colors" : {
  "cardinality" : {
  "field" : "color.keyword"
  }
  }
  }
  }
  }
}

六、doc values解读

在 Elasticsearch 中,doc values 就是一种列式存储结构,默认情况下每个字段的 doc values 都是激活的,doc values 是在索引时创建的,当字段索引时,Elasticsearch 为了能够快速检索,会把字段的值加入倒排索引中,同时它也会存储该字段的 doc values。 
Elasticsearch 中的 doc vaules 常被应用到以下场景:

- 1)对一个字段进行排序
- 2)对一个字段进行聚合
- 3)某些过滤,比如地理位置过滤
- 4) 某些与字段相关的脚本计算

Because document values ​​are serialized to disk, we can rely on the help of the operating system for fast access. When the working set is much smaller than the available memory of the node, the system will automatically save all document values ​​in the memory, making it very fast to read and write;

When it is much larger than the available memory, the operating system will automatically load the doc values ​​into the system's page cache, thus avoiding the jvm heap memory overflow exception.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325492903&siteId=291194637