How to learn ElasticSearch aggregation

Hello everyone, I'm Kaka不期速成,日拱一卒

While ElasticSearch is dedicated to searching, it also provides the function of aggregating real-time analysis data. Aggregation can achieve the data we want after performing a series of calculations on complex data.

Although the function of aggregation is completely different from that of search, the data structure used is exactly the same, so the execution speed of aggregation is very fast, that is to say, the same data can be searched + filtered and analyzed at the same time in one request.

Aggregation in ElasticSearch is divided into four categories:

  • Bucket Aggregation: bucket type, a collection of documents whose columns meet certain conditions

  • Metric Aggregation: Metric analysis type, performing mathematical operations on data, such as finding the largest and smallest values

  • Pipeline Aggregation: The type of pipeline analysis, the aggregated results are aggregated twice

  • Matix Aggregation: Matrix analysis type that supports operations on multiple fields and provides a result matrix

Let’s start with simplicity and look at the two types of Bucket and Metric. The result of Bucket implementation is the use of the group keyword in MySQL, and the Metric is the use of the max and min functions in MySQL.

一、Buckert Aggregation

introduce

As can be seen from the above figure, the data is divided into three buckets. The first bucket counts the height less than 300, the second bucket counts the height greater than 600, and the third bucket counts the height between 300 and 600. Yes, in this case, it is divided into different buckets according to different heights.

Using the aggregation analysis mechanism, you can also distribute by age, geographic location, gender, salary range, order growth, job position, etc. Aggregation can be used for archiving as long as there is some common data.

Common Bucket bucketing strategies

  • terms: bucket according to term, if it is text type, bucket according to the result of word segmentation

  • range: Specify the range of values ​​to set the bucketing rules

  • data range: specify the range of dates to set the bucketing rules

  • histogram: fixed interval to set bucketing rules

  • data histogram: a histogram or histogram against a date

Terms

Bucket by destination

post /kibana_sample_data_flights/_search
{
  "size":0,
  "aggs":{
    "destcountry_term":{
      "terms": {
        "field""DestCountry"
      }
    }
  },
  "profile":"true"
}

From the returned results, we can see that the flight information is classified according to the destination. At the same time, it is also found that if the size value is not manually defined in ElasticSearch, only 10 results will be returned by default.

"aggregations" : {
    "destcountry_term" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 3187,
      "buckets" : [
        {
          "key" : "IT",
          "doc_count" : 2371
        },
        {
          "key" : "US",
          "doc_count" : 1987
        },
        {
          "key" : "CN",
          "doc_count" : 1096
        },
        {
          "key" : "CA",
          "doc_count" : 944
        },
        {
          "key" : "JP",
          "doc_count" : 774
        },
        {
          "key" : "RU",
          "doc_count" : 739
        },
        {
          "key" : "CH",
          "doc_count" : 691
        },
        {
          "key" : "GB",
          "doc_count" : 449
        },
        {
          "key" : "AU",
          "doc_count" : 416
        },
        {
          "key" : "PL",
          "doc_count" : 405
        }
      ]
    }
  }

Range

Want to query the cases where the average price is below 300, between 300 and 600, and greater than 600

post /kibana_sample_data_flights/_search
{
  "size":0,
  "aggs":{
    "avgticketprice_range":{
      "range": {
        "field""AvgTicketPrice",
        "ranges": [
          {"to":300},
          {"from":300,"to":600},
          {"from":600}
        ]
      }
    }
  }
}

The returned results are as follows, you can set the key value of the three results according to different intervals

"aggregations" : {
    "avgticketprice_range" : {
      "buckets" : [
        {
          "key" : "*-300.0",
          "to" : 300.0,
          "doc_count" : 1816
        },
        {
          "key" : "300.0-600.0",
          "from" : 300.0,
          "to" : 600.0,
          "doc_count" : 4115
        },
        {
          "key" : "600.0-*",
          "from" : 600.0,
          "doc_count" : 7128
        }
      ]
    }
  }

You can make each interval return a specific name by setting keyed: true

post /kibana_sample_data_flights/_search
{
  "size":0,
  "aggs":{
    "avgticketprice_range":{
      "range": {
        "field""AvgTicketPrice",
        "keyed":"true",
        "ranges": [
          {"to":300},
          {"from":300,"to":600},
          {"from":600}
        ]
      }
    }
  }
}

You can compare the difference with the previous case.

"aggregations" : {
    "avgticketprice_range" : {
      "buckets" : {
        "*-300.0" : {
          "to" : 300.0,
          "doc_count" : 1816
        },
        "300.0-600.0" : {
          "from" : 300.0,
          "to" : 600.0,
          "doc_count" : 4115
        },
        "600.0-*" : {
          "from" : 600.0,
          "doc_count" : 7128
        }
      }
    }
  }

Of course, you can also specify the name of the interval

post /kibana_sample_data_flights/_search
{
  "size":0,
  "aggs":{
    "avgticketprice_range":{
      "range": {
        "field""AvgTicketPrice",
        "keyed":"true",
        "ranges": [
          {"key":"小于300","to":300},
          {"key":"300到600之间","from":300,"to":600},
          {"key":"大于600","from":600}
        ]
      }
    }
  }
}

return result

"aggregations" : {
    "avgticketprice_range" : {
      "buckets" : {
        "小于300" : {
          "to" : 300.0,
          "doc_count" : 1816
        },
        "300到600之间" : {
          "from" : 300.0,
          "to" : 600.0,
          "doc_count" : 4115
        },
        "大于600" : {
          "from" : 600.0,
          "doc_count" : 7128
        }
      }
    }
  }

Data Range

Set the bucketing rule by specifying the date range, such as bucketing the timestamp field according to the set time period.

post /kibana_sample_data_flights/_search
{
  "size":0,
  "aggs":{
    "data_range_timestamp":{
      "date_range":{
        "field":"timestamp",
        "format":"yyyy-MM",
        "ranges":[
          {"from":"2022-01","to":"2022-02"},
          {"from":"2022-02","to":"2022-03"}
        ]
      }
    }
  }
}

Return the result, think about how to set a fixed key value if you want to set it? Another thing to note is the date formatyyyy-MM-dd HH:mm:ss

"aggregations" : {
    "data_range_timestamp" : {
      "buckets" : [
        {
          "key" : "2022-01-2022-02",
          "from" : 1.6409952E12,
          "from_as_string" : "2022-01",
          "to" : 1.6436736E12,
          "to_as_string" : "2022-02",
          "doc_count" : 9580
        },
        {
          "key" : "2022-02-2022-03",
          "from" : 1.6436736E12,
          "from_as_string" : "2022-02",
          "to" : 1.6460928E12,
          "to_as_string" : "2022-03",
          "doc_count" : 1837
        }
      ]
    }
  }

Historgram

A histogram, which divides data with a fixed interval strategy, such as bucketing the AvgTicketPrice field at intervals of 100

  • interval : every interval is 50

  • min_doc_count : The minimum number of existing documents is 0

  • extended_bounds : This value is meaningful only when min_doc_count is 0

When implemented you will find that extended_bounds does not filter buckets. The extended_bounds.min is higher than the value extracted from the documentation, then the documentation still dictates what the first bucket will be (and the same for extended_bounds.max and the last bucket). In order to filter buckets, you should nest the histogram aggregation inside the range filter aggregation with the appropriate from/to settings

post /kibana_sample_data_flights/_search
{
  "size":0,
  "aggs":{
    "price_histogram":{
      "histogram": {
        "field""AvgTicketPrice",
        "interval": 50,
        "min_doc_count":"0",
        "extended_bounds":{
          "min":0,
          "max":600
        }
      }
    }
  }
}

return result

"aggregations" : {
    "price_histogram" : {
      "buckets" : [
        {
          "key" : 0.0,
          "doc_count" : 0
        },
        {
          "key" : 50.0,
          "doc_count" : 0
        },
        {
          "key" : 100.0,
          "doc_count" : 380
        },
        {
          "key" : 150.0,
          "doc_count" : 369
        },
        {
          "key" : 200.0,
          "doc_count" : 398
        }
      ]
    }
  }

Data histogram

Histograms or histograms for dates are commonly used aggregation analysis types in time series data analysis, such as bucketing the timestamp field according to monthly intervals

post /kibana_sample_data_flights/_search
{
  "size":0,
  "aggs":{
    "timestamp_data_histogram":{
      "date_histogram": {
        "field""timestamp",
        "interval""month",
        "min_doc_count": 0,
        "format""yyyy-MM-dd",
        "extended_bounds": {
          "min""2021-10-10",
          "max""2022-01-19"
        }
      }
    }
  }
}

return result

"aggregations" : {
    "timestamp_data_histogram" : {
      "buckets" : [
        {
          "key_as_string" : "2021-10-01",
          "key" : 1633046400000,
          "doc_count" : 0
        },
        {
          "key_as_string" : "2021-11-01",
          "key" : 1635724800000,
          "doc_count" : 0
        },
        {
          "key_as_string" : "2021-12-01",
          "key" : 1638316800000,
          "doc_count" : 1642
        },
        {
          "key_as_string" : "2022-01-01",
          "key" : 1640995200000,
          "doc_count" : 9580
        },
        {
          "key_as_string" : "2022-02-01",
          "key" : 1643673600000,
          "doc_count" : 1837
        }
      ]
    }
  }

2. Nested query

The implementation of five buckets is listed above. In actual development, it is very rare to perform a single aggregation query. In most cases, nested operations are performed.

First divide the buckets according to the air tickets, and then take the total, minimum, maximum, average, and sum of the bucketed data.

post /kibana_sample_data_flights/_search
{
  "size":0,
  "aggs":{
    "price_range":{
      "range": {
        "field""AvgTicketPrice",
        "ranges": [
          {"to":300},
          {"from":300,"to":600},
          {"from":600}
        ]
      },
      "aggs":{
        "price_status":{
          "stats": {
            "field""AvgTicketPrice"
          }
        }
      }
    }
  }
}

Return the result (the return result is intercepted and displayed)

"aggregations" : {
    "price_range" : {
      "buckets" : [
        {
          "key" : "*-300.0",
          "to" : 300.0,
          "doc_count" : 1816,
          "price_status" : {
            "count" : 1816,
            "min" : 100.0205307006836,
            "max" : 299.9529113769531,
            "avg" : 212.5348257619379,
            "sum" : 385963.2435836792
          }
        }
      ]
    }
  }

There are more operations waiting for us to dig, first get the basics done,不期速成,日拱一卒

Persistence in learning, perseverance in writing, perseverance in sharing are the beliefs that Kaka has upheld since her career. I hope the article can bring you a little help on the huge Internet, I am Kaka, see you in the next issue.

 

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3828348/blog/5518210