Briefly describe the storage method of complex relational data in ElasticSearch


In traditional databases, there are no more than three descriptions of data relationships, one-to-one, one-to-many and many-to-many relationships. If there is associated data, we usually add primary and foreign keys when building a table. Establish a data connection, and then restore or complete the data through join during query or statistics, and finally get the result data we need, then convert it into ElasticSearch, how or how to deal with these relational data.


We all know that ElasticSearch is a NoSQL type of database, which weakens the processing of relationships, because full-text retrieval frameworks such as lucene, es, and solr have relatively high performance requirements. Once an operation such as join occurs, the performance will be reduced. Very poor, so when using search frameworks, we should avoid using search engines as relational databases.


Of course, real data must be related, so how to process and manage these data with relationships in es?




As we all know, ES naturally supports JSON data perfectly. As long as the data is in a standard JSON structure, no matter how complicated it is, no matter how many layers it is nested, it can be stored in ES, and then it can be queried, analyzed, and retrieved. In this mechanism, there are three main ways for es to process and manage relationships:



#### First, use the field types of objcet and array[object] to automatically store json data of multi-layer structure

This is the default mechanism of es, that is, we No mapping is set, and a piece of complex json data is directly inserted into the es server, which can also be inserted successfully and supports retrieval. (This operation is possible because es uses dynamic mapping by default, as long as the standard json structure is inserted. It will be automatically converted. Of course, we can also control the mapping type. There are dynamic mapping and static mapping in es. Static mapping is also divided into strict types, weak types, and general types. Such as the following data:
````
{
  "name" : "Zach",
  "car" : [
    {
      "make" : "Saturn",
      "model" : "SL"
    },
    {
      "make" : "Subaru",
      "model" : "Imprezza"
    }
  ]
}

````


The final converted storage structure is as follows:

````
{
  "name" : "Zach",
  "car.make" : ["Saturn", "Subaru"]
  "car.model" : ["SL", "Imprezza"]
}
````

Because the underlying lucene of es naturally supports the storage of multi-valued fields, it looks like an array structure above, but in fact, the multi-valued field of this field is stored in es.

Then when retrieving, the symbol can retrieve the corresponding content. Such a piece of data actually already contains data and relationships, it looks like a one-to-many relationship, and one person owns multiple cars. But in fact, it can not be regarded as a relationship in a strict sense, because the bottom layer of Lucene is flattened storage, so that the data of multiple cars are actually stored together and mixed together, you can't get a car of this person alone. The data, because the entire data is a whole, no matter what operation the entire data will be returned.




### Second, use the nested[object] type to store data with multi-level relationships

In solution 1, we pointed out that the array objects stored in array are not strictly related, because the data in the second level is not separated , if you want to separate, you must use the nested type to explicitly define the data structure. Only in this way, the multiple car data in the second layer are independent and independent of each other, that is to say, the data of a certain car can be obtained or queried independently.


The same json data:
````
{
  "name" : "Zach",
  "car" : [
    {
      "make" : "Saturn",
      "model" : "SL"
    },
    {
      "make" : "Subaru",
      "model" : "Imprezza"
    }
  ]
}
````


In scheme 1, a piece of data will eventually be stored in es. In the second type, if the car type is declared to be nested, then the final number stored in es will display 3. Here is an explanation of how 3 comes from = 1 root document + 2 car documents, nested declaration type, each instance is a new document, so it can be queried independently, and the performance is not bad, because the bottom layer of es will store the entire data in the same document. In a shard lucene sengment, the disadvantage is that the cost of updating is relatively high, and the index of the entire structure must be rebuilt for each subdocument update, so nested is suitable for scenarios of nested multi-level relationships that are not updated frequently.


Nested type data needs to use its specified query and aggregation method to take effect. Ordinary es query can only query the first-level, that is, root-level attributes, and nested attributes cannot be searched. If you want to search, you must use embedded A set of queries or aggregations will do.


There are two modes of nested application:


the first: nested query

Each query is valid within a single document, including sorting, the

second: nested aggregation or filtering It is globally valid for all documents at

the same level, including filtering Sort


### Third, parent/children parent-child relationship The


parent/children mode is very similar to nested, but the focus of the application scenarios is different.

When using parent/children to manage the relationship, es will maintain a relationship table in the memory of each shard. When retrieving, the related data is obtained through the has_parent and has_child filters. In this mode, the parent document and the child document are It is also independent, and the query performance will be slightly lower than the nested mode, because the parent document and the child document will be distributed in the same shard through the route when they are inserted, but they are not guaranteed to be in the same lucene sengment index segment, so The retrieval performance is slightly lower. In addition, each retrieval of es needs to obtain data-related information from the relational table in the memory, and it also takes a certain amount of time. Compared with nested, the advantage is that the update of the parent document or child document, It does not affect other documents, so for frequently updated multi-level relationships, it is most appropriate to use the parent/children mode.

The mapping type of the parent document:

{
  "mappings":{
    "person":{
      "name":{
        "type":"string"
      }
    }
  }
}

The mapping type of the subdocument:
{
  "homes":{
    "_parent":{
      "type" : "person"
    },
    "state" : {
      "type" : "string"
    }
  }
}


When inserting data, you need to insert the parent document first:
curl -XPUT localhost:9200/test/person/zach/ -d'
{
  "name" : "Zach"
}

Then when inserting a subdocument, you need to add a routing field:
$ curl -XPOST localhost:9200/homes?parent=zach -d'
{
  "state" : "Ohio"
}
$ curl -XPOST localhost:9200/test/homes?parent=zach -d'
{
  "state" : "South Carolina"
}


Finally, the parent document zach is associated with two child documents, and data can be obtained through parent/children specific queries during query.




Summary:



Method 1:

(1) Simple, fast, and high performance

(2) Good at maintaining one-to-one relationships

(3) No special query is required


Method 2:

(1) Because the underlying storage is in the same lucene sengment Therefore, the comparison method of reading and query performance is faster

(2) Updating a single subdocument will rebuild the entire data structure, so it is not suitable for nested scenarios with frequent updates

(3) It can maintain one-to-many and many-to-many storage Relation


Method 3:

(1) Multiple relational data are stored completely independently, but exist in the same shard, so the read and query performance is slightly lower than that of method 2

(2) Additional memory is required to maintain and manage the relationship list

(3) Update The document does not affect other sub-documents, so it is suitable for scenarios with frequent updates.

(4) Sorting and scoring operations are cumbersome and require additional script function support




. Reference document:

https://www.elastic.co/blog/managing-relations-inside -If you

have any questions about elasticsearch, you can scan the code and follow the WeChat public account: I am the siege division (woshigcs), and leave a message in the background for consultation. Technical debts cannot be owed, and health debts cannot be owed. On the road of seeking the Tao, walk with you.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326955348&siteId=291194637