Elasticsearch --- data synchronization, cluster

1. Data synchronization

The hotel data in elasticsearch comes from the mysql database, so when the mysql data changes, the elasticsearch must also change accordingly. This is the data synchronization between elasticsearch and mysql .

 

Idea analysis:

There are three common data synchronization schemes:

  • synchronous call

  • asynchronous notification

  • monitor binlog

 

1.1. Synchronous call

Solution 1: Synchronous call

The basic steps are as follows:

  • hotel-demo provides an interface to modify the data in elasticsearch

  • After the hotel management service completes the database operation, it directly calls the interface provided by hotel-demo,

 

1.2. Asynchronous notification

Solution 2: Asynchronous notification

The process is as follows:

  • Hotel-admin sends MQ message after adding, deleting and modifying mysql database data

  • Hotel-demo listens to MQ and completes elasticsearch data modification after receiving the message

 

1.3, monitor binlog

Solution 3: Monitor binlog

The process is as follows:

  • Enable the binlog function for mysql

  • The addition, deletion, and modification operations of mysql will be recorded in the binlog

  • Hotel-demo monitors binlog changes based on canal, and updates the content in elasticsearch in real time

 

1.4. Selection

Method 1: Synchronous call

  • Advantages: simple to implement, rough

  • Disadvantages: high degree of business coupling

Method 2: Asynchronous notification

  • Advantages: low coupling, generally difficult to implement

  • Disadvantages: rely on the reliability of mq

Method 3: Monitor binlog

  • Advantages: Complete decoupling between services

  • Disadvantages: Enabling binlog increases database burden and high implementation complexity

  

 

2. Cluster

Stand-alone elasticsearch for data storage will inevitably face two problems: massive data storage and single point of failure.

  • Massive data storage problem: Logically split the index library into N shards (shards) and store them in multiple nodes

  • Single point of failure problem: back up fragmented data on different nodes (replica)

ES cluster related concepts :

  • Cluster (cluster): A group of nodes with a common cluster name.

  • Node (node) : an Elasticearch instance in the cluster

  • Shard : Indexes can be split into different parts for storage, called shards. In a cluster environment, different shards of an index can be split into different nodes

    Solve the problem: the amount of data is too large and the storage capacity of a single point is limited.

Here, we divide the data into 3 pieces: shard0, shard1, shard2

  • Primary shard (Primary shard): relative to the definition of replica shards.

  • Replica shard (Replica shard) Each primary shard can have one or more copies, and the data is the same as the primary shard

 

Data backup can ensure high availability, but if each shard is backed up, the number of nodes required will double, and the cost is too high!

In order to find a balance between high availability and cost, we can do this:

  • First shard the data and store it in different nodes

  • Then back up each shard and put it on the other node to complete mutual backup

This can greatly reduce the number of required service nodes. As shown in the figure, we take 3 shards and each shard as a backup copy as an example:

Now, each shard has 1 backup, stored on 3 nodes:

  • node0: holds shards 0 and 1

  • node1: holds shards 0 and 2

  • node2: saved shards 1 and 2

 

2.1. Cluster split-brain problem

2.1.1. Division of Cluster Responsibilities

Cluster nodes in elasticsearch have different responsibilities:

By default, any node in the cluster has the above four roles at the same time.

 

But a real cluster must separate cluster responsibilities:

  • master node: high CPU requirements, but memory requirements

  • data node: high requirements for CPU and memory

  • Coordinating node: high requirements for network bandwidth and CPU

Separation of duties allows us to allocate different hardware for deployment according to the needs of different nodes. And avoid mutual interference between services.

A typical es cluster responsibility division is shown in the figure:

 

2.1.2. Split-brain problem

A split-brain is caused by the disconnection of nodes in the cluster.

For example, in a cluster, the master node loses connection with other nodes:

At this time, node2 and node3 think that node1 is down, and they will re-elect the master:  when node3 is elected, the cluster continues to provide services to the outside world, node2 and node3 form a cluster, and node1 forms a cluster, and the data of the two clusters is not synchronized, resulting in data differences .

When the network is restored, because there are two master nodes in the cluster, the status of the cluster is inconsistent, and a split-brain situation occurs:

 The solution to split-brain is to require votes to exceed (number of eligible nodes + 1)/2 to be elected as the master, so the number of eligible nodes should preferably be an odd number. The corresponding configuration item is discovery.zen.minimum_master_nodes, which has become the default configuration after es7.0, so the problem of split brain generally does not occur

For example: for a cluster formed by 3 nodes, the votes must exceed (3 + 1) / 2, which is 2 votes. node3 gets the votes of node2 and node3, and is elected as the master. node1 has only 1 vote for itself and was not elected. There is still only one master node in the cluster, and there is no split brain.

 

summary

What is the role of the master eligible node?

  • Participate in group election

  • The master node can manage the cluster state, manage sharding information, and process requests to create and delete index libraries

What is the role of the data node?

  • CRUD of data

What is the role of the coordinator node?

  • Route requests to other nodes

  • Combine the query results and return them to the user

 

2.2. Cluster distributed storage

When a new document is added, it should be saved in different shards to ensure data balance, so how does the coordinating node determine which shard the data should be stored in?

principle:

Elasticsearch will use the hash algorithm to calculate which shard the document should be stored in:

illustrate:

  • _routing defaults to the id of the document

  • The algorithm is related to the number of shards, so once the index library is created, the number of shards cannot be modified!

The process of adding new documents is as follows:

 Interpretation:

  1. Add a document with id=1

  2. Perform a hash operation on the id, if the result is 2, it should be stored in shard-2

  3. The primary shard of shard-2 is on node3, routing data to node3

  4. save document

  5. Synchronize to replica-2 of shard-2, on the node2 node

  6. Return the result to the coordinating-node node

 

2.3. Cluster distributed query

The elasticsearch query is divided into two stages:

  • scatter phase: In the scatter phase, the coordinating node will distribute the request to each shard

  • gather phase: the gathering phase, the coordinating node summarizes the search results of the data node, and processes it as the final result set and returns it to the user

  

2.4. Cluster failover

The master node of the cluster will monitor the status of the nodes in the cluster. If a node is found to be down, it will immediately migrate the fragmented data of the down node to other nodes to ensure data security. This is called failover.

1) For example, a cluster structure as shown in the figure:

Now, node1 is the master node and the other two nodes are slave nodes.

 

2) Suddenly, node1 fails:

 The first thing after the downtime is to re-elect the master, for example, node2 is selected:

After node2 becomes the master node, it will check the cluster monitoring status and find that: shard-1 and shard-0 have no replica nodes. Therefore, the data on node1 needs to be migrated to node2 and node3:

Guess you like

Origin blog.csdn.net/a1404359447/article/details/130487267