How to deal with the presence of data Elasticsearch affiliated?

Three paradigms of relational databases

What is a paradigm? Rule is data modeling paradigm.

  • The first paradigm: ensuring atomicity of each column.
    All the fields in the database table are indivisible atomic value.
  • The second paradigm: to ensure that each row in the table are the primary key and relevant.
    A database table can only preserve a data, can not put a variety of data stored in the same database table, such as order-related information will design three. table: orders table, table line items, merchandise table.
  • The third paradigm: Make sure all directly related to the primary key and each column, rather than indirectly related.
    For example, an order table just save userId, do not need to save the entire user information.

Three relational database paradigm simplifies the write operation, a read operation but performance is not high (join consuming operation performance), and scalability is poor, while the anti-paradigm design data stored redundancy in the document, without having to deal join operation, data read performance is very good, but the anti-paradigm design is not suitable for frequent changes of scene data.

There is data in the association process Elasticsearch

Non-relational data storage engine Elasticsearch use, namely anti-paradigm design, there is data that Elasticsearch how to deal with relationships it? There are three methods, namely three types of data.

  • Object type (Object)
  • Nested type (the Nested)
  • Join type (Join)

Object type (Object)

Object data type to use the information to store movies and actors in a doc.

(1) Mapping defined

PUT /my_movies
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "actors": {
        "properties": {
          "first_name": {
            "type": "keyword"
          },
          "last_name": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

(2) Adding data

PUT /my_movies/_doc/1
{
  "title": "Speed",
  "actors": [
    {
      "first_name": "Keanu",
      "last_name": "Reeves"
    },
    {
      "first_name": "Dennis",
      "last_name": "Hopper"
    }
  ]
}

(3) Search

GET /my_movies/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "actors.first_name": "Keanu"
          }
        },
        {
          "match": {
            "actors.last_name": "Hopper"
          }
        }
      ]
    }
  }
}

result:

"hits" : [
  {
    "_index" : "my_movies",
    "_type" : "_doc",
    "_id" : "1",
    "_score" : 0.723315,
    "_source" : {
      "title" : "Speed",
      "actors" : [
        {
          "first_name" : "Keanu",
          "last_name" : "Reeves"
        },
        {
          "first_name" : "Dennis",
          "last_name" : "Hopper"
        }
      ]
    }
  }
]

We want the search results should be returned empty, but Elasticsearch has returned a result, Why is this so because the array of objects to be processed become the key to the flat structure?:

"title":"Speed"
"actors.first_name":["Keanu","Dennis"]
"actors.last_name":["Reeves","Hopper"]

So when a search is performed not return the results we want. That is not suited to handle the type of object relationship.

Nested type (the Nested)

We know from the above example, an array of objects when building inverted index object is not independent, eventually leading to inaccurate results, and Nested data types when creating an index for the array of objects, each object is independent, through nested query you can get the results we want.

(1) Definition Maping

PUT /my_movies
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "actors": {
        "type": "nested",
        "properties": {
          "first_name": {
            "type": "keyword"
          },
          "last_name": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

(2) Adding data

PUT /my_movies/_doc/1
{
  "title": "Speed",
  "actors": [
    {
      "first_name": "Keanu",
      "last_name": "Reeves"
    },
    {
      "first_name": "Dennis",
      "last_name": "Hopper"
    }
  ]
}

(3) Search

GET /my_movies/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "actors",
            "query": {
              "bool": {
                "must": [
                  {
                    "match": {
                      "actors.first_name": "Keanu"
                    }
                  },
                  {
                    "match": {
                      "actors.last_name": "Hopper"
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }
}

result:

"hits" : {
	"total" : {
	  "value" : 0,
	  "relation" : "eq"
	},
	"max_score" : null,
	"hits" : [ ]
}

Join type (Join)

Nested type association process has a limitation, i.e., each update need to re-index the entire object (and nested objects including a root object).

Providing Elasticsearch similar Join in a relational database implemented, i.e. Join data type. Join the data type defines the parent-child relationship between the document to separate the two objects.

  • Parent document and subdocuments are two separate documents.
  • Update parent document without having to re-index sub-documents.
  • Sub-document is added, updated or deleted will not affect the parent document and other sub-documents.

An example of blog comments and we look.

(1) Mapping defined

PUT /my_blogs
{
  "settings": {
    "number_of_shards": 2
  }, 
  "mappings": {
    "properties": {
      "title": {
        "type": "keyword"
      },
      "content": {
        "type": "text"
      },
      "comment": {
        "type": "text"
      },
      "username": {
        "type": "keyword"
      },
      "blog_comments_relation" : {
        "type": "join",
        "relations": {
          "blog": "comment"
        }
      }
    }
  }
}

Note that the master slice is defined as the number 2, and between blog comment paternity.

(2) Adding data

a. Adding blog data

PUT /my_blogs/_doc/blog1
{
  "title": "Learning Elasticsearch",
  "content": "learning ELK @ tyshawn",
  "blog_comments_relation": {
    "name": "blog"
  }
}

PUT /my_blogs/_doc/blog2
{
  "title": "Learning Hadoop",
  "content": "learning Hadoop @ tyshawn",
  "blog_comments_relation": {
    "name": "blog"
  }
}

blog1 and blog2 is _id, pay attention to _id is not necessarily the numbers.

b. Add comment data

PUT /my_blogs/_doc/comment1?routing=blog1
{
  "comment": "I am learning ELK",
  "username": "Jack",
  "blog_comments_relation": {
    "name": "comment",
    "parent": "blog1"
  }
}

PUT /my_blogs/_doc/comment2?routing=blog2
{
  "comment": "I like Hadoop!!!!!",
  "username": "Jack",
  "blog_comments_relation": {
    "name": "comment",
    "parent": "blog2"
  }
}

When you add a comment to specify the route, make sure his son to the same document index fragmentation. The purpose is to ensure the performance of join queries.

(3) query

Join the unique type of inquiry:

  • parent_id
    by querying the parent document id, returns all related child documents.
  • has_child
    sub-document query returns the parent document with the relevant sub-documents. Documents father and son on the same slice, so high Join efficiency.
  • has_parent
    the parent document query returns all related child documents.

a. parent_id

GET /my_blogs/_search
{
  "query": {
    "parent_id": {
      "type": "comment",
      "id": "blog2"
    }
  }
}

result:

"hits" : [
  {
    "_index" : "my_blogs",
    "_type" : "_doc",
    "_id" : "comment2",
    "_score" : 0.6931472,
    "_routing" : "blog2",
    "_source" : {
      "comment" : "I like Hadoop!!!!!",
      "username" : "Jack",
      "blog_comments_relation" : {
        "name" : "comment",
        "parent" : "blog2"
      }
    }
  }
]

b. has_child

GET /my_blogs/_search
{
  "query": {
    "has_child": {
      "type": "comment",
      "query": {
        "match": {
          "username": "Jack"
        }
      }
    }
  }
}

result:

"hits" : [
  {
    "_index" : "my_blogs",
    "_type" : "_doc",
    "_id" : "blog1",
    "_score" : 1.0,
    "_source" : {
      "title" : "Learning Elasticsearch",
      "content" : "learning ELK @ tyshawn",
      "blog_comments_relation" : {
        "name" : "blog"
      }
    }
  },
  {
    "_index" : "my_blogs",
    "_type" : "_doc",
    "_id" : "blog2",
    "_score" : 1.0,
    "_source" : {
      "title" : "Learning Hadoop",
      "content" : "learning Hadoop @ tyshawn",
      "blog_comments_relation" : {
        "name" : "blog"
      }
    }
  }
]

c. has_parent

GET /my_blogs/_search
{
  "query": {
    "has_parent": {
      "parent_type": "blog",
      "query": {
        "match": {
          "title": "Learning Hadoop"
        }
      }
    }
  }
}

result:

"hits" : [
  {
    "_index" : "my_blogs",
    "_type" : "_doc",
    "_id" : "comment2",
    "_score" : 1.0,
    "_routing" : "blog2",
    "_source" : {
      "comment" : "I like Hadoop!!!!!",
      "username" : "Jack",
      "blog_comments_relation" : {
        "name" : "comment",
        "parent" : "blog2"
      }
    }
  }
]

(4) update the child documents

Update the child documents will not affect the parent document.

POST /my_blogs/_update/comment2?routing=blog2
{
  "doc": {
    "comment": "Hello Hadoop??"
  }
}

Queried via id and routing

GET /my_blogs/_doc/comment2?routing=blog2

result:

{
  "_index" : "my_blogs",
  "_type" : "_doc",
  "_id" : "comment2",
  "_version" : 2,
  "_seq_no" : 4,
  "_primary_term" : 1,
  "_routing" : "blog2",
  "found" : true,
  "_source" : {
    "comment" : "Hello Hadoop??",
    "username" : "Jack",
    "blog_comments_relation" : {
      "name" : "comment",
      "parent" : "blog2"
    }
  }
}

Nested Types contrast type Join

Object data types are not suitable for processing data associated with a relationship, and that Nested type Join types are suitable for what scene it? Contrast between the two is that we look at.

Compared Nested Join
advantage Since ⼀ documents stored in read performance ADVANCED Documents can be updated independently father and son
Shortcoming When updating nested sub-document, you need to update the entire document Require additional memory to maintain the relationship, the read performance is relatively poor
Applicable scene Sub-document occasional updates to the query-based Sub-document update frequently

Other ways

We can also not be used in the actual development and Nested Join type to handle data having an association, we can directly ES database tables and indexes to establish one to one relationship, then check out the ES data in the application-side processing relationship or directly the data table associated with the establishment of a relationship between the ES index combined, this approach is the simplest.

Published 324 original articles · won praise 572 · views 560 000 +

Guess you like

Origin blog.csdn.net/litianxiang_kaola/article/details/103981462