Understand the parent-child relationship of elasticsearch



The previous article introduced several data organization relationships in es, including array[object], nested, and Parent-Child to be said today.

Parent-Child is very similar to Nested, and can be used to handle one-to-many relationships. If there are many-to-many relationships, split it into one-to-many relationships. The disadvantage of nested mentioned above is that the update of data needs to reindex all the data under the entire nested structure, so it is destined to be used in a scenario with more queries and fewer updates. If it is a scenario with many updates, then the performance of nested may not be better. Very good, and Parent-Child is very suitable for scenarios with many updates, because the data storage of Parent-Child is independent, only the parent and child documents are required to be distributed in the same shard. A shard must also be under the same block in the same segment. This mode is destined to have better performance of nested queries than Parent-Child, but the update performance is not as good as Parent-Child. Compared with nested mode, Parent-Child It mainly has the following characteristics:


(1) The parent document can be updated without rebuilding all subdocuments


(2) The addition, modification, or deletion of a subdocument does not affect its parent document and other subdocuments, especially Parent-Child can achieve better performance in scenarios where the number of subdocuments is huge and needs to be added and updated frequently.


(3) Subdocuments can be returned in the search results.


ElasticSearch maintains a parent-child relationship mapping table in memory. In order to speed up the query, this mapping uses doc-value. If the amount of data cannot be stored in the memory, it will be automatically saved to the disk. Of course, the performance will also decrease at this time.




Let's look at an example, first we have to define the mapping:

{
  "order": 0,
  "template": "pc_test*",
  "settings": {
    "index": {
      "number_of_replicas": "0",
      "number_of_shards": "3"
    }
  },
  "mappings": {
    "employee": {
      "_parent": {
        "type": "branch"
      }
    },
    "branch": {}
  },
  "aliases": {}
}







branch: represents a branch

employee: represents an employee


Relationship : A company can contain multiple employees



Let's start inserting data, first we insert company data:

POST /company/branch/_bulk
{ "index": { "_id": "london" }}
{ "name": "London Westminster", "city": "London", "country": "UK" }
{ "index": { "_id": "liverpool" }}
{ "name": "Liverpool Central", "city": "Liverpool", "country": "UK" }
{ "index": { "_id": "paris" }}
{ "name": "Champs Élysées", "city": "Paris", "country": "France" }



Note that the type of the inserted company data is branch, and the id of the data is the city field.

When adding employee data, which parent document should be specified, so that the parent and child data can be associated with the same machine.

PUT /company/employee/1?parent=london
{
  "name":  "Alice Smith",
  "dob":   "1970-10-24",
  "hobby": "hiking"
}



The parent id field has two purposes:

(1) It creates a relationship between the parent and child documents and ensures that the child document must exist in a shard with the parent document

(2) By default, es uses the document's id field for hash modulo For sharding, if the id field of the parent document is specified, the routing field is the id, and in the child document, the value of the parent we specify is also the id field of the parent document, so it must be ensured that the parent and child documents are all in one shard. In the relationship between parent and child documents, index, update, add, delete, including search, must set the routing field when using, otherwise the query result will be wrong.


Continue inserting subdocuments:

POST /company/employee/_bulk
{ "index": { "_id": 2, "parent": "london" }}
{ "name": "Mark Thomas", "dob": "1982-05-16", "hobby": "diving" }
{ "index": { "_id": 3, "parent": "liverpool" }}
{ "name": "Barry Smith", "dob": "1979-04-01", "hobby": "hiking" }
{ "index": { "_id": 4, "parent": "paris" }}
{ "name": "Adrien Grand", "dob": "1987-05-11", "hobby": "horses" }




Note: If the value of parent changes, you must delete all subdocuments under the parent and then delete itself, and finally add a new parent document, and then add a new subdocument, otherwise after the parent value changes, the parent of the parent document changes, and the child If there is no change, the father and son will not be in the same shard, resulting in query errors.



Let's take a look at how to query the data of the parent-child relationship. There are mainly two query methods:



(1) has_child

uses the fields of the child document as query conditions to query the data of the parent document that meets the conditions.

A query example is as follows:
GET /company/branch/_search
{
  "query": {
    "has_child": {
      "type": "employee",
      "query": {
        "range": {
          "dob": {
            "gte": "1980-01-01"
          }
        }
      }
    }
  }
}


The score of the parent document here is obtained from the scores of all sub-documents through a calculation method. It can be set here. There are 5 strategies:

none: ignore the score
avg: average score of
all sub-documents min: all sub-documents The minimum score
max: the maximum score of
all subdocuments sum: the sum of the scores of all subdocuments

Through the following query, we can see the impact of scores on sorting:
GET /company/branch/_search
{
  "query": {
    "has_child": {
      "type":       "employee",
      "score_mode": "max",
      "query": {
        "match": {
          "name": "Alice Smith"
        }
      }
    }
  }
}


Setting the score to none has faster query performance, because there is less extra calculation.

In addition has_child query can also accept two limit parameters min_children and max_children, which are filtered according to the number of child documents during the query. See an example below:


GET /company/branch/_search
{
  "query": {
    "has_child": {
      "type":         "employee",
      "min_children": 2,
      "query": {
        "match_all": {}
      }
    }
  }
}



The above query only queries the parent document with the most child documents that meet the filter conditions. Has_child can also be queried using filter.



(2) has_parent

The has_parent query is the opposite of has_child, by querying the fields of the parent document to obtain the data of the child document.

An example is as follows:

GET /company/employee/_search
{
  "query": {
    "has_parent": {
      "type": "branch",
      "query": {
        "match": {
          "country": "UK"
        }
      }
    }
  }
}


has_parent also supports score_mode. There are two settings: one is none, and one score does not need to be aggregated because each child has only one parent.


Finally, look at the aggregation of parent-child, an example:

GET /company/branch/_search
{
  "size" : 0,
  "aggs": {
    "country": {
      "terms": {
        "field": "country"
      },
      "aggs": {
        "employees": {
          "children": {
            "type": "employee"
          },
          "aggs": {
            "hobby": {
              "terms": {
                "field": "hobby"
              }
            }
          }
        }
      }
    }
  }
}

The above aggregation means:

group by country, and then count the employees in the group and then group them according to their hobbies.



Finally , the parent-child model supports multi-layer relationships

, one-to-many-to-many, and currently the official website provides a three-tier relationship. For example, from the perspective of the community, it supports infinite-level relationship mapping, but for more than 3-level mapping, the official website does not give an example of use, and the specific use has to be tested by the user, but the reality contains relationship data of more than 3 levels. Very little.



A 3-level example of the mapping:

PUT /company
{
  "mappings": {
    "country": {},
    "branch": {
      "_parent": {
        "type": "country"
      }
    },
    "employee": {
      "_parent": {
        "type": "branch"
      }
    }
  }
}




There is one more level of country mapping. The overall relationship is:

a country can have multiple branches, and each branch can have multiple employees.



Look at the data example:


(1) Insert country data first

POST /company/country/_bulk
{ "index": { "_id": "uk" }}
{ "name": "UK" }
{ "index": { "_id": "france" }}
{ "name": "France" }


(2) Insert company data
POST /company/branch/_bulk
{ "index": { "_id": "london", "parent": "uk" }}
{ "name": "London Westmintster" }
{ "index": { "_id": "liverpool", "parent": "uk" }}
{ "name": "Liverpool Central" }
{ "index": { "_id": "paris", "parent": "france" }}
{ "name": "Champs Élysées" }


Note that the parent is the parent, and the company's route uses city

(3) Insert employee data
PUT /company/employee/1?parent=london&routing=uk
{
  "name":  "Alice Smith",
  "dob":   "1970-10-24",
  "hobby": "hiking"
}




The inserted data of the third layer uses the parent field to ensure the association with the parent document, and uses the routing field to ensure that the parent document and the grandparent document are located in the same shard.

Note that if there are more than 3 layers, the routing field must be the routing value of the topmost document, and the parent field is its real associated parent document. The official website of the mapping with more than 3 layers does not give an example. Interested friends can test it by themselves. The multi-layer parent-child relationship will consume more memory and the performance will be worse, so the design should try to avoid this. In this case, in addition, if you have to design, note that the parent id field should be as short as possible, so as to obtain better compression in the doc value and reduce the memory used.



Reference article:

[url]https://discuss.elastic.co/t/would-it-be-possible-the-relation-grate-grandparent-grate-grandchild-in-elasticsearch/26875/4
[/url]

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326888021&siteId=291194637