Elasticsearch polymerization of Terms

Before summed up metric aggregated content herein, this is what bucket aggregated knowledge. Bucket can be understood as a bucket, he goes through the contents of the document, those who meet the requirements placed on the barrel accordance with the requirements created.

Benpian that focused on the terms of polymerization, which is in accordance with the value of a field in the classified:

Such as gender have male and female, will create two buckets, men and women were stored information. The default will gather information doc_count, that record how many boys and how many girls, and then returned to the client, thus completing a statistical terms too.

Terms polymerization

{
    "aggs" : {
        "genders" : {
            "terms" : { "field" : "gender" }
        }
    }
}

The results obtained were as follows:

{
    ...

    "aggregations" : {
        "genders" : {
            "doc_count_error_upper_bound": 0, 
            "sum_other_doc_count": 0, 
            "buckets" : [ 
                {
                    "key" : "male",
                    "doc_count" : 10
                },
                {
                    "key" : "female",
                    "doc_count" : 10
                },
            ]
        }
    }
}

Uncertainties in the data

Use terms of polymerization, the results possibly with some deviation to error.

for example:

We want to get the name field in the top 5 most frequently occur.

At this time, the client sends a request to the polymerization ES, the master node after receiving the request, the request will be sent to each individual slice.
Own independent computing slices 5 minutes ago on the sheet name, and then return. When all the results are returned fragments, in the combined result of the master node, then calculated the highest frequency of the first five, returned to the client.

This will result in some errors, such as the last five front returned, there is one called A, there are 50 documents; B 49. However, because each tile store information independent of the distribution of information is uncertain. The first possible partial information sheet B has two, but not routed to the top 5, it does not appear in the final combined result. This leads to less calculated the total number of B 2, he could have been discharged to the first, but lined up behind the A's.

size与shard_size

In order to improve the above problem, you can use size and shard_size parameters.

  • The size parameter specifies the number of the last term of return (default 10)
  • shard_size parameter specifies the number of each slice to return
  • If the size is less than shard_size, then the fragment will be calculated in accordance with the number of specified size

With these two parameters, if we want to return to the previous five, size = 5; shard_size greater than 5 may be provided, so that each slice will increase return translation information, the corresponding error probability is also reduced.

Sorting order

order specifies the final returns the result of sort, the default sort is in accordance with doc_count.

{
    "aggs" : {
        "genders" : {
            "terms" : {
                "field" : "gender",
                "order" : { "_count" : "asc" }
            }
        }
    }
}

You can also be sorted according to the dictionary way:

{
    "aggs" : {
        "genders" : {
            "terms" : {
                "field" : "gender",
                "order" : { "_term" : "asc" }
            }
        }
    }
}

Of course, the polymerization can also specify a single metric value by order, sort.

{
    "aggs" : {
        "genders" : {
            "terms" : {
                "field" : "gender",
                "order" : { "avg_height" : "desc" }
            },
            "aggs" : {
                "avg_height" : { "avg" : { "field" : "height" } }
            }
        }
    }
}

Metric aggregation also supports multi-valued, but you want to specify the use of multi-value field:

{
    "aggs" : {
        "genders" : {
            "terms" : {
                "field" : "gender",
                "order" : { "height_stats.avg" : "desc" }
            },
            "aggs" : {
                "height_stats" : { "stats" : { "field" : "height" } }
            }
        }
    }
}

min_doc_count与shard_min_doc_count

Aggregation fields may be some very low frequency entry, if a large number of these entries proportion, then it will cause a lot of unnecessary calculations.
Min_doc_count and thus can set to a predetermined minimum number shard_min_doc_count document, only the number of parameters to meet the entry requirements will be logged returned.

By name can be seen:

  • min_doc_count: the provisions of the final results of the screening
  • shard_min_doc_count: screening at a predetermined slice calculates return

script

Barrel polymerization also supports the use of scripts:

{
    "aggs" : {
        "genders" : {
            "terms" : {
                "script" : "doc['gender'].value"
            }
        }
    }
}

As well as external script files:

{
    "aggs" : {
        "genders" : {
            "terms" : {
                "script" : {
                    "file": "my_script",
                    "params": {
                        "field": "gender"
                    }
                }
            }
        }
    }
}

filter

field provides a filter filtering function, two ways: include filtering out a document containing the value; instead use exclude.
E.g:

{
    "aggs" : {
        "tags" : {
            "terms" : {
                "field" : "tags",
                "include" : ".*sport.*",
                "exclude" : "water_.*"
            }
        }
    }
}

In the above example, the final result should contain sport and does not contain water.
Also supports an array of fashion, the definition contains information excluded:

{
    "aggs" : {
        "JapaneseCars" : {
             "terms" : {
                 "field" : "make",
                 "include" : ["mazda", "honda"]
             }
         },
        "ActiveCarManufacturers" : {
             "terms" : {
                 "field" : "make",
                 "exclude" : ["rover", "jensen"]
             }
         }
    }
}

Multi-field polymerization

Typically, terms for the polymerization are polymerized only to one field. This is because the polymerization requires the entry into a hash table, if a plurality of memory fields will result in consumption of n ^ 2.

However, for multi-field, ES also provides the following two ways:

  • 1 using a script merge fields
  • 2 copy_to method using combined two fields to create a new field, performing the polymerization of a single field of the new field.

collect mode

For the calculation of the sub-polymerization, in two ways:

  • depth_first directly calculating sub polymerization
  • breadth_first to calculate the current result of the polymerization, the polymerization is calculated for the sub-result.

By default, the ES will use the depth-first, but you can manually set to breadth-first, such as:

{
    "aggs" : {
        "actors" : {
             "terms" : {
                 "field" : "actors",
                 "size" : 10,
                 "collect_mode" : "breadth_first"
             },
            "aggs" : {
                "costars" : {
                     "terms" : {
                         "field" : "actors",
                         "size" : 5
                     }
                 }
            }
         }
    }
}

The default value Missing value

The default value specifies the default handling of field:

{
    "aggs" : {
        "tags" : {
             "terms" : {
                 "field" : "tags",
                 "missing": "N/A" 
             }
         }
    }
}

Reproduced in: https: //my.oschina.net/u/204616/blog/545040

Guess you like

Origin blog.csdn.net/weixin_34254823/article/details/91990087