Elasticsearch: aggregation Introduction

Aggregation (aggregation) feature set is the entire Elasticsearch one of the most exciting products and useful function, mainly because it offers a very attractive alternative to the previous facets.

In this tutorial, we will explain Elasticsearch polymerization (aggregation) and gradually introduce some examples. We compared the index aggregation and buckets polymerization, and shows how to use nested aggregation (in terms of facets for this is impossible). You are welcome to copy all the sample code in this article.

A little background on Elastic Facets of

If you have ever used Elasticsearch of facets, then you'll understand their usefulness. After extensive experience, we are here to tell you that Elasticsearch polymerization (aggregations) even better. facets allows you to quickly calculate and summarize query results and can be used for various tasks, such as running count of the result value or create a histogram. Although facets is very powerful, but they realize there are some limitations in Elasticsearch cores. Since the facets only performs a calculation of depth, so combining them is not easy.

Polymerization (Aggregation) the API ( https://www.elastic.co/guide/en/elasticsearch/client/java-api/7.4/java-aggs.html) to solve these problems, and also provides a simple way when a query for a very accurate calculation of the multi-stage (in a single request) . In short: Elasticsearch polymerization is one pair of facets of a more comprehensive improvement.

Prepare data

In order to complete our practice today, let's prepare some data. We want to create an index called the sports. To this end, we first create a mapping:

    PUT sports
    {
      "mappings": {
        "properties": {
          "birthdate": {
            "type": "date",
            "format": "dateOptionalTime"
          },
          "location": {
            "type": "geo_point"
          },
          "name": {
            "type": "keyword"
          },
          "rating": {
            "type": "integer"
          },
          "sport": {
            "type": "keyword"
          }
        }
      }
    }

In the above, we define a mapping sports index. In the following, we want to import our data through the bulk API to the index.

    POST _bulk/
    {"index":{"_index":"sports"}}
    {"name":"Michael","birthdate":"1989-10-1","sport":"Baseball","rating":["5","4"],"location":"46.22,-68.45"}
    {"index":{"_index":"sports"}}
    {"name":"Bob","birthdate":"1989-11-2","sport":"Baseball","rating":["3","4"],"location":"45.21,-68.35"}
    {"index":{"_index":"sports"}}
    {"name":"Jim","birthdate":"1988-10-3","sport":"Baseball","rating":["3","2"],"location":"45.16,-63.58"}
    {"index":{"_index":"sports"}}
    {"name":"Joe","birthdate":"1992-5-20","sport":"Baseball","rating":["4","3"],"location":"45.22,-68.53"}
    {"index":{"_index":"sports"}}
    {"name":"Tim","birthdate":"1992-2-28","sport":"Baseball","rating":["3","3"],"location":"46.22,-68.85"}
    {"index":{"_index":"sports"}}
    {"name":"Alfred","birthdate":"1990-9-9","sport":"Baseball","rating":["2","2"],"location":"45.12,-68.35"}
    {"index":{"_index":"sports"}}
    {"name":"Jeff","birthdate":"1990-4-1","sport":"Baseball","rating":["2","3"],"location":"46.12,-68.55"}
    {"index":{"_index":"sports"}}
    {"name":"Will","birthdate":"1988-3-1","sport":"Baseball","rating":["4","4"],"location":"46.25,-68.55"}
    {"index":{"_index":"sports"}}
    {"name":"Mick","birthdate":"1989-10-1","sport":"Baseball","rating":["3","4"],"location":"46.22,-68.45"}
    {"index":{"_index":"sports"}}
    {"name":"Pong","birthdate":"1989-11-2","sport":"Baseball","rating":["1","3"],"location":"45.21,-68.35"}
    {"index":{"_index":"sports"}}
    {"name":"Ray","birthdate":"1988-10-3","sport":"Baseball","rating":["2","2"],"location":"45.16,-63.58"}
    {"index":{"_index":"sports"}}
    {"name":"Ping","birthdate":"1992-5-20","sport":"Baseball","rating":["4","3"],"location":"45.22,-68.53"}
    {"index":{"_index":"sports"}}
    {"name":"Duke","birthdate":"1992-2-28","sport":"Baseball","rating":["5","2"],"location":"46.22,-68.85"}
    {"index":{"_index":"sports"}}
    {"name":"Hal","birthdate":"1990-9-9","sport":"Baseball","rating":["4","2"],"location":"45.12,-68.35"}
    {"index":{"_index":"sports"}}
    {"name":"Charge","birthdate":"1990-4-1","sport":"Baseball","rating":["3","2"],"location":"46.12,-68.55"}
    {"index":{"_index":"sports"}}
    {"name":"Barry","birthdate":"1988-3-1","sport":"Baseball","rating":["5","2"],"location":"46.25,-68.55"}
    {"index":{"_index":"sports"}}
    {"name":"Bank","birthdate":"1988-3-1","sport":"Golf","rating":["6","4"],"location":"46.25,-68.55"}
    {"index":{"_index":"sports"}}
    {"name":"Bingo","birthdate":"1988-3-1","sport":"Golf","rating":["10","7"],"location":"46.25,-68.55"}
    {"index":{"_index":"sports"}}
    {"name":"James","birthdate":"1988-3-1","sport":"Basketball","rating":["10","8"],"location":"46.25,-68.55"}
    {"index":{"_index":"sports"}}
    {"name":"Wayne","birthdate":"1988-3-1","sport":"Hockey","rating":["10","10"],"location":"46.25,-68.55"}
    {"index":{"_index":"sports"}}
    {"name":"Brady","birthdate":"1988-3-1","sport":"Football","rating":["10","10"],"location":"46.25,-68.55"}
    {"index":{"_index":"sports"}}
    {"name":"Lewis","birthdate":"1988-3-1","sport":"Football","rating":["10","10"],"location":"46.25,-68.55"}

Through the above bulk API interface, we can put what we want to index data input in sports. We can get me the number of data via the following interfaces:

GET sports/_count

Show results:

    {
      "count" : 22,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      }
    }

In this database, we can see that there have data of 22.

Hands

Aggregation of two main series is an indicator of polymerization (Metric Aggregations) ( https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-metrics.html) and buckets polymerization (bucket aggregation keyword) ( https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket.html ).
Polymerization calculate some indicators value (e.g., average value) of a set of documents; documents bucket polymerization grouped into buckets. Before details, let's look at the general structure of polymeric request. In addition before the polymerization as well as the Matrix ( https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-matrix.html) and Pipleline (https://www.elastic.co/ guide / en / elasticsearch / reference / current / search-aggregations-pipeline.html polymerization .

Aggregation Structure

    "aggregations" : {
        "<aggregation_name>" : {
            "<aggregation_type>" : { 
                <aggregation_body>
            },
            ["aggregations" : { [<sub_aggregation>]* } ]
        }
        [,"<aggregation_name_2>" : { ... } ]*
    }

The polymerization json request (You can also use AGGS) polymeric object contains the name, type and body. Is a user-defined name (without brackets), the name uniquely identifies the response polymerization NAME / KEY.

Usually the first bond polymerization. It may be terms, stats polymerization or geo-distance, but it is the starting point. in our We have a . in We specify the properties required for the polymerization. Available properties depend on the type of aggregate.

You can choose to provide the sub-polymerized to the polymerization result will be a nested element to another element of the polymerization. In addition, you can enter multiple polymerization (aggregation_name_2) in the query, in order to have more separate top-level aggregation. Although there is no limit on the nesting level, but you can not nest in the metric metrics polymerization, the following reasons. After the value of different types of research can be aggregated, we will learn barrels polymerization and measure the difference between the polymerization.

example

Some polymerization using values ​​obtained from the polymerization document. These values ​​can be obtained from the document specified field (field) can also be obtained from the script with each document generation values. The following example provides the first term of polymerization (terms aggregation) in the name field, the sub-sequence are given on the polymerization rating_avg value. As you can see, we use nested aggregation of indicators bucket polymerization sort the results.

Although we use the index given above, but we encourage you to run this query (as well as the following additional query). You can get a direct result of the work, and then modify it to match your data set.

In addition, carefully to see if we include "size": 0, because our focus is the result of polymerization, rather than document the results. Here is set to 0, which means we do not want to get any documents.

    GET sports/_search
    {
      "size": 0, 
      "aggregations": {
        "the_name": {
          "terms": {
            "field": "name",
            "order": {
              "rating_avg": "desc"
            }
          },
          "aggregations": {
            "rating_avg": {
              "avg": {
                "field": "rating"
              }
            }
          }
        }
      }
    }

The results showed as follows:

    {
      "took" : 1,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 22,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
      },
      "aggregations" : {
        "the_name" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 12,
          "buckets" : [
            {
              "key" : "Brady",
              "doc_count" : 1,
              "rating_avg" : {
                "value" : 10.0
              }
            },
            {
              "key" : "Lewis",
              "doc_count" : 1,
              "rating_avg" : {
                "value" : 10.0
              }
            },
            {
              "key" : "Wayne",
              "doc_count" : 1,
              "rating_avg" : {
                "value" : 10.0
              }
            },
            {
              "key" : "James",
              "doc_count" : 1,
              "rating_avg" : {
                "value" : 9.0
              }
            },
            {
              "key" : "Bingo",
              "doc_count" : 1,
              "rating_avg" : {
                "value" : 8.5
              }
            },
            {
              "key" : "Bank",
              "doc_count" : 1,
              "rating_avg" : {
                "value" : 5.0
              }
            },
            {
              "key" : "Michael",
              "doc_count" : 1,
              "rating_avg" : {
                "value" : 4.5
              }
            },
            {
              "key" : "Will",
              "doc_count" : 1,
              "rating_avg" : {
                "value" : 4.0
              }
            },
            {
              "key" : "Barry",
              "doc_count" : 1,
              "rating_avg" : {
                "value" : 3.5
              }
            },
            {
              "key" : "Bob",
              "doc_count" : 1,
              "rating_avg" : {
                "value" : 3.5
              }
            }
          ]
        }
      }
    }

The above results show: We got polymerization categorized according to each person, and their order is in accordance with rating_avg average scores obtained by polymerizing sort of.

We can also provide a script script to generate value used in the polymerization:

    GET sports/_search
    {
      "size": 0,
      "aggs": {
        "age_range": {
          "range": {
            "script": {
              "source": 
                """
                ZonedDateTime dob = doc['birthdate'].value;
                return params.now - dob.getYear()
                """
                ,
              "params": {
                "now": 2019
              }
            },
            "ranges": [
              {
                "from": 30,
                "to": 31
              }
            ]
          }
        }
      }
    }

In the above, we produce value source through a script, and make it count.

The results show that:

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 22,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
      },
      "aggregations" : {
        "age_range" : {
          "buckets" : [
            {
              "key" : "30.0-31.0",
              "from" : 30.0,
              "to" : 31.0,
              "doc_count" : 4
            }
          ]
        }
      }
    }

Shown above there are 4 people between 30-31 years of age.

Metric Aggregations

Indicator Type Indicator polymerization for calculating the entire document set. Polymerizing a single value of the index (e.g. avg) and a multi-value index polymeric (e.g. stats). A simple example of the polymerization index is value_count polymerization, it only returns the total number of the established value of the index for a given field. To find the number of values ​​in the athletes dataset "sport" field, we can use the following query:

    GET sports/_search
    {
      "size": 0,
      "aggs": {
        "sport_count": {
          "value_count": {
            "field": "sport"
          }
        }
      }
    }

Show results:

    {
      "took" : 2,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 22,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
      },
      "aggregations" : {
        "sport_count" : {
          "value" : 22
        }
      }
    }

Note that this will return the total value of the field, rather than the number of unique values. Thus, in this case (since each document has a word value of "sport" field), the result is equal to only the number of documents in the index.

Bucket Aggregations

Bucket aggregation is a mechanism for grouping documents. Each type has its own bucket polymerization method for dividing documentation set. Perhaps the simplest type is the term polymerization. This feature is very much like the terminology, returns the number of a given field index term and the only matching documents. If we want to find all the values ​​in a dataset "sport" field, you can use the following method:

    GET sports/_search
    {
      "size": 0,
      "aggs": {
        "sport": {
          "terms": {
            "field": "sport",
            "size": 10
          }
        }
      }
    }

return value:

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 22,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
      },
      "aggregations" : {
        "sport" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 0,
          "buckets" : [
            {
              "key" : "Baseball",
              "doc_count" : 16
            },
            {
              "key" : "Football",
              "doc_count" : 2
            },
            {
              "key" : "Golf",
              "doc_count" : 2
            },
            {
              "key" : "Basketball",
              "doc_count" : 1
            },
            {
              "key" : "Hockey",
              "doc_count" : 1
            }
          ]
        }
      }
    }

You may find geo_distance polymerization more attractive. Although it has many options, but in the simplest case, it takes an origin and a distance range, and then how many documents given geo_point calculated fields based on round there.

Suppose we need to know how many athletes live in "46.12, -68.55" Twenty miles from the location. We can use the following aggregation:

    GET sports/_search
    {
      "size": 0,
      "aggregations": {
        "baseball_player_ring": {
          "geo_distance": {
            "field": "location",
            "origin": "46.12,-68.55",
            "unit": "mi",
            "ranges": [
              {
                "from": 0,
                "to": 20
              }
            ]
          }
        }
      }
    }

Return result:

    {
      "took" : 4,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 22,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
      },
      "aggregations" : {
        "baseball_player_ring" : {
          "buckets" : [
            {
              "key" : "*-20.0",
              "from" : 0.0,
              "to" : 20.0,
              "doc_count" : 14
            }
          ]
        }
      }
    }

Embedded Bucket Aggregations

Many developers would agree that the most powerful aspects of the bucket polymerization is their ability to nest. You can define top buckets polymerization, and a second stage polymerization operation results for each bucket in its interior is defined. This nesting can be extended to multiple levels as needed.

Continuing with our example, we can use a nested range of age-based aggregation (according to the "Date of Birth" script calculated) to further subdivide geo_distance aggregation of results. Suppose we want to know each athlete belonging to two age groups in the number of players (they live in on a circle defined). We can use the following polymerization to obtain this information:

    GET sports/_search
    {
       "size": 0,
       "aggregations": {
          "baseball_player_ring": {
             "geo_distance": {
                "field": "location",
                "origin": "46.12,-68.55",
                "unit": "mi",
                "ranges": [
                   {
                      "from": 0,
                      "to": 20
                   }
                ]
             },
             "aggregations": {
                "ring_age_ranges": {
                   "range": {
                     "script": {
                        "source": 
                        """
                        ZonedDateTime dob = doc['birthdate'].value;
                        return params.now - dob.getYear()
                        """
                      ,
                      "params": {
                        "now": 2019
                      }                 
                     }, 
                      "ranges": [
                          { "from": 30, "to": 31 },
                          { "from": 31, "to": 32 }
                      ]
                   }
                }
             }
          }
       }
    }

The results showed as follows:

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 22,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
      },
      "aggregations" : {
        "baseball_player_ring" : {
          "buckets" : [
            {
              "key" : "*-20.0",
              "from" : 0.0,
              "to" : 20.0,
              "doc_count" : 14,
              "ring_age_ranges" : {
                "buckets" : [
                  {
                    "key" : "30.0-31.0",
                    "from" : 30.0,
                    "to" : 31.0,
                    "doc_count" : 2
                  },
                  {
                    "key" : "31.0-32.0",
                    "from" : 31.0,
                    "to" : 32.0,
                    "doc_count" : 8
                  }
                ]
              }
            }
          ]
        }
      }
    }

Now, let's use stats (multi-value index summary is) to calculate some statistics of most internal result. For each age group athletes and two age groups living in our circle, we now want to calculate the statistics "rating" field according to the results of the document:

    GET sports/_search
    {
       "size": 0,
       "aggregations": {
          "baseball_player_ring": {
             "geo_distance": {
                "field": "location",
                "origin": "46.12,-68.55",
                "unit": "mi",
                "ranges": [
                   {
                      "from": 0,
                      "to": 20
                   }
                ]
             },
             "aggregations": {
                "ring_age_ranges": {
                   "range": {
                     "script": {
                        "source": 
                        """
                        ZonedDateTime dob = doc['birthdate'].value;
                        return params.now - dob.getYear()
                        """
                      ,
                      "params": {
                        "now": 2019
                      }                 
                     }, 
                      "ranges": [
                          { "from": 30, "to": 31 },
                          { "from": 31, "to": 32 }
                      ]
                   },
                  "aggregations": {
                    "rating_stats": {
                      "stats": {
                          "field": "rating"
                        }
                    }
                  }
                }
             }
          }
       }
    }

We get a response statistical information we need:

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 22,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
      },
      "aggregations" : {
        "baseball_player_ring" : {
          "buckets" : [
            {
              "key" : "*-20.0",
              "from" : 0.0,
              "to" : 20.0,
              "doc_count" : 14,
              "ring_age_ranges" : {
                "buckets" : [
                  {
                    "key" : "30.0-31.0",
                    "from" : 30.0,
                    "to" : 31.0,
                    "doc_count" : 2,
                    "rating_stats" : {
                      "count" : 4,
                      "min" : 3.0,
                      "max" : 5.0,
                      "avg" : 4.0,
                      "sum" : 16.0
                    }
                  },
                  {
                    "key" : "31.0-32.0",
                    "from" : 31.0,
                    "to" : 32.0,
                    "doc_count" : 8,
                    "rating_stats" : {
                      "count" : 16,
                      "min" : 2.0,
                      "max" : 10.0,
                      "avg" : 7.5,
                      "sum" : 120.0
                    }
                  }
                ]
              }
            }
          ]
        }
      }
    }

As you can see, you can create more than one store more buckets contain large bucket. You can also obtain indicators for each bucket (metrics), as well as the increasing complexity. By these simple building blocks, you can use a nested polymerization of deep and complex insights from the data.

Guess you like

Origin www.cnblogs.com/sanduzxcvbnm/p/12090671.html