Hello everyone, I'm Kaka不期速成,日拱一卒
While ElasticSearch is dedicated to searching, it also provides the function of aggregating real-time analysis data. Aggregation can achieve the data we want after performing a series of calculations on complex data.
Although the function of aggregation is completely different from that of search, the data structure used is exactly the same, so the execution speed of aggregation is very fast, that is to say, the same data can be searched + filtered and analyzed at the same time in one request.
Aggregation in ElasticSearch is divided into four categories:
-
Bucket Aggregation: bucket type, a collection of documents whose columns meet certain conditions
-
Metric Aggregation: Metric analysis type, performing mathematical operations on data, such as finding the largest and smallest values
-
Pipeline Aggregation: The type of pipeline analysis, the aggregated results are aggregated twice
-
Matix Aggregation: Matrix analysis type that supports operations on multiple fields and provides a result matrix
Let’s start with simplicity and look at the two types of Bucket and Metric. The result of Bucket implementation is the use of the group keyword in MySQL, and the Metric is the use of the max and min functions in MySQL.
一、Buckert Aggregation
introduce
As can be seen from the above figure, the data is divided into three buckets. The first bucket counts the height less than 300, the second bucket counts the height greater than 600, and the third bucket counts the height between 300 and 600. Yes, in this case, it is divided into different buckets according to different heights.
Using the aggregation analysis mechanism, you can also distribute by age, geographic location, gender, salary range, order growth, job position, etc. Aggregation can be used for archiving as long as there is some common data.
Common Bucket bucketing strategies
-
terms: bucket according to term, if it is text type, bucket according to the result of word segmentation
-
range: Specify the range of values to set the bucketing rules
-
data range: specify the range of dates to set the bucketing rules
-
histogram: fixed interval to set bucketing rules
-
data histogram: a histogram or histogram against a date
Terms
Bucket by destination
post /kibana_sample_data_flights/_search
{
"size":0,
"aggs":{
"destcountry_term":{
"terms": {
"field": "DestCountry"
}
}
},
"profile":"true"
}
From the returned results, we can see that the flight information is classified according to the destination. At the same time, it is also found that if the size value is not manually defined in ElasticSearch, only 10 results will be returned by default.
"aggregations" : {
"destcountry_term" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 3187,
"buckets" : [
{
"key" : "IT",
"doc_count" : 2371
},
{
"key" : "US",
"doc_count" : 1987
},
{
"key" : "CN",
"doc_count" : 1096
},
{
"key" : "CA",
"doc_count" : 944
},
{
"key" : "JP",
"doc_count" : 774
},
{
"key" : "RU",
"doc_count" : 739
},
{
"key" : "CH",
"doc_count" : 691
},
{
"key" : "GB",
"doc_count" : 449
},
{
"key" : "AU",
"doc_count" : 416
},
{
"key" : "PL",
"doc_count" : 405
}
]
}
}
Range
Want to query the cases where the average price is below 300, between 300 and 600, and greater than 600
post /kibana_sample_data_flights/_search
{
"size":0,
"aggs":{
"avgticketprice_range":{
"range": {
"field": "AvgTicketPrice",
"ranges": [
{"to":300},
{"from":300,"to":600},
{"from":600}
]
}
}
}
}
The returned results are as follows, you can set the key value of the three results according to different intervals
"aggregations" : {
"avgticketprice_range" : {
"buckets" : [
{
"key" : "*-300.0",
"to" : 300.0,
"doc_count" : 1816
},
{
"key" : "300.0-600.0",
"from" : 300.0,
"to" : 600.0,
"doc_count" : 4115
},
{
"key" : "600.0-*",
"from" : 600.0,
"doc_count" : 7128
}
]
}
}
You can make each interval return a specific name by setting keyed: true
post /kibana_sample_data_flights/_search
{
"size":0,
"aggs":{
"avgticketprice_range":{
"range": {
"field": "AvgTicketPrice",
"keyed":"true",
"ranges": [
{"to":300},
{"from":300,"to":600},
{"from":600}
]
}
}
}
}
You can compare the difference with the previous case.
"aggregations" : {
"avgticketprice_range" : {
"buckets" : {
"*-300.0" : {
"to" : 300.0,
"doc_count" : 1816
},
"300.0-600.0" : {
"from" : 300.0,
"to" : 600.0,
"doc_count" : 4115
},
"600.0-*" : {
"from" : 600.0,
"doc_count" : 7128
}
}
}
}
Of course, you can also specify the name of the interval
post /kibana_sample_data_flights/_search
{
"size":0,
"aggs":{
"avgticketprice_range":{
"range": {
"field": "AvgTicketPrice",
"keyed":"true",
"ranges": [
{"key":"小于300","to":300},
{"key":"300到600之间","from":300,"to":600},
{"key":"大于600","from":600}
]
}
}
}
}
return result
"aggregations" : {
"avgticketprice_range" : {
"buckets" : {
"小于300" : {
"to" : 300.0,
"doc_count" : 1816
},
"300到600之间" : {
"from" : 300.0,
"to" : 600.0,
"doc_count" : 4115
},
"大于600" : {
"from" : 600.0,
"doc_count" : 7128
}
}
}
}
Data Range
Set the bucketing rule by specifying the date range, such as bucketing the timestamp field according to the set time period.
post /kibana_sample_data_flights/_search
{
"size":0,
"aggs":{
"data_range_timestamp":{
"date_range":{
"field":"timestamp",
"format":"yyyy-MM",
"ranges":[
{"from":"2022-01","to":"2022-02"},
{"from":"2022-02","to":"2022-03"}
]
}
}
}
}
Return the result, think about how to set a fixed key value if you want to set it? Another thing to note is the date formatyyyy-MM-dd HH:mm:ss
"aggregations" : {
"data_range_timestamp" : {
"buckets" : [
{
"key" : "2022-01-2022-02",
"from" : 1.6409952E12,
"from_as_string" : "2022-01",
"to" : 1.6436736E12,
"to_as_string" : "2022-02",
"doc_count" : 9580
},
{
"key" : "2022-02-2022-03",
"from" : 1.6436736E12,
"from_as_string" : "2022-02",
"to" : 1.6460928E12,
"to_as_string" : "2022-03",
"doc_count" : 1837
}
]
}
}
Historgram
A histogram, which divides data with a fixed interval strategy, such as bucketing the AvgTicketPrice field at intervals of 100
-
interval : every interval is 50
-
min_doc_count : The minimum number of existing documents is 0
-
extended_bounds : This value is meaningful only when min_doc_count is 0
When implemented you will find that extended_bounds does not filter buckets. The extended_bounds.min is higher than the value extracted from the documentation, then the documentation still dictates what the first bucket will be (and the same for extended_bounds.max and the last bucket). In order to filter buckets, you should nest the histogram aggregation inside the range filter aggregation with the appropriate from/to settings
post /kibana_sample_data_flights/_search
{
"size":0,
"aggs":{
"price_histogram":{
"histogram": {
"field": "AvgTicketPrice",
"interval": 50,
"min_doc_count":"0",
"extended_bounds":{
"min":0,
"max":600
}
}
}
}
}
return result
"aggregations" : {
"price_histogram" : {
"buckets" : [
{
"key" : 0.0,
"doc_count" : 0
},
{
"key" : 50.0,
"doc_count" : 0
},
{
"key" : 100.0,
"doc_count" : 380
},
{
"key" : 150.0,
"doc_count" : 369
},
{
"key" : 200.0,
"doc_count" : 398
}
]
}
}
Data histogram
Histograms or histograms for dates are commonly used aggregation analysis types in time series data analysis, such as bucketing the timestamp field according to monthly intervals
post /kibana_sample_data_flights/_search
{
"size":0,
"aggs":{
"timestamp_data_histogram":{
"date_histogram": {
"field": "timestamp",
"interval": "month",
"min_doc_count": 0,
"format": "yyyy-MM-dd",
"extended_bounds": {
"min": "2021-10-10",
"max": "2022-01-19"
}
}
}
}
}
return result
"aggregations" : {
"timestamp_data_histogram" : {
"buckets" : [
{
"key_as_string" : "2021-10-01",
"key" : 1633046400000,
"doc_count" : 0
},
{
"key_as_string" : "2021-11-01",
"key" : 1635724800000,
"doc_count" : 0
},
{
"key_as_string" : "2021-12-01",
"key" : 1638316800000,
"doc_count" : 1642
},
{
"key_as_string" : "2022-01-01",
"key" : 1640995200000,
"doc_count" : 9580
},
{
"key_as_string" : "2022-02-01",
"key" : 1643673600000,
"doc_count" : 1837
}
]
}
}
2. Nested query
The implementation of five buckets is listed above. In actual development, it is very rare to perform a single aggregation query. In most cases, nested operations are performed.
First divide the buckets according to the air tickets, and then take the total, minimum, maximum, average, and sum of the bucketed data.
post /kibana_sample_data_flights/_search
{
"size":0,
"aggs":{
"price_range":{
"range": {
"field": "AvgTicketPrice",
"ranges": [
{"to":300},
{"from":300,"to":600},
{"from":600}
]
},
"aggs":{
"price_status":{
"stats": {
"field": "AvgTicketPrice"
}
}
}
}
}
}
Return the result (the return result is intercepted and displayed)
"aggregations" : {
"price_range" : {
"buckets" : [
{
"key" : "*-300.0",
"to" : 300.0,
"doc_count" : 1816,
"price_status" : {
"count" : 1816,
"min" : 100.0205307006836,
"max" : 299.9529113769531,
"avg" : 212.5348257619379,
"sum" : 385963.2435836792
}
}
]
}
}
There are more operations waiting for us to dig, first get the basics done,不期速成,日拱一卒
“Persistence in learning, perseverance in writing, perseverance in sharing are the beliefs that Kaka has upheld since her career. I hope the article can bring you a little help on the huge Internet, I am Kaka, see you in the next issue.
”