Split-brain problem with Elasticsearch cluster

Original address: http://blog.csdn.net/cnweike/article/details/39083089, thanks

 

 

The so-called split-brain problem (similar to schizophrenia) is that different nodes in the same cluster have different understandings of the state of the cluster.

 

Today, the Elasticsearch cluster experienced extremely slow queries. Use the following command to check the cluster status:

curl -XGET 'es-1:9200/_cluster/health'

It is found that the overall status of the cluster is red, and the cluster with 9 nodes only shows 4 in the results; however, after sending the request to different nodes, I found that even though the overall status is red, it is available The number of nodes is not the same.

 

Under normal circumstances, all nodes in the cluster should be consistent in the selection of the master in the cluster, so the obtained state information should also be consistent, and inconsistent state information indicates that different nodes have an abnormal selection of the master node. - The so-called split-brain problem. Such a split-brain state directly causes the node to lose the correct state of the cluster, resulting in the cluster not working properly.

 

Possible causes:

 

1. Network: Due to the internal network communication, some nodes think that the master is dead due to network communication problems, and it is less likely to choose another master; then check the monitoring of the Ganglia cluster, and no abnormal intranet traffic is found, so the reason can be exclude.

2. Node load: Since the master node and the data node are mixed together, when the load of the worker node is large (it is indeed large), the corresponding ES instance stops responding, and if this server is acting as the The identity of the master node, then some nodes will think that the master node is invalid, so a new node is re-elected, and a split-brain occurs at this time; at the same time, due to the large memory occupied by the ES process on the data node, the large-scale memory The recycling operation can also cause the ES process to become unresponsive. Therefore, the possibility of this reason should be the greatest.

 

How to deal with the problem:

 

 

1. Corresponding to the above analysis, it is speculated that the reason should be that the master process stopped responding due to node load, which in turn led to differences in the selection of master by some nodes. To this end, an intuitive solution is to separate the master node from the data node. To this end, we added three servers into the ES cluster, but their roles are only master nodes and do not play the role of storage and search, so they are relatively lightweight processes. Its role can be restricted by the following configuration:

 

[plain]  view plain  copy
 
 
  1. node.master: true  
  2. node.data: false  

Of course, other nodes can no longer serve as masters, and the above configuration can be reversed. In this way, the master node is separated from the data node. Of course, in order for the newly added node to quickly determine the master position, the default master discovery method of the data node can be modified from multicast to unicast:

 

 

 

[plain]  view plain  copy
 
 
  1. discovery.zen.ping.multicast.enabled: false  
  2. discovery.zen.ping.unicast.hosts: ["master1", "master2", "master3"]  

 

2. There are two more intuitive parameters that can slow down the split-brain problem:

discovery.zen.ping_timeout (default value is 3 seconds): By default, a node will think that if the master node does not respond within 3 seconds, then the node is dead, and increasing this value will increase the node wait The response time will reduce misjudgment to a certain extent.

discovery.zen.minimum_master_nodes (default 1): This parameter controls the minimum number of master nodes a node needs to see before it can operate in the cluster. The official recommended value is (N/2)+1, where N is the number of master-qualified nodes (3 in our case, so this parameter is set to 2, but for only 2 nodes, setting it to 2 There are some problems. After a node goes down, you will definitely not be able to connect to the two servers. This needs to be noted).

 

The above solutions can only slow down the occurrence of this phenomenon, not fundamentally eliminate it, but it is helpful after all. If you have other better suggestions, welcome to discuss.

 

Replenish:

If ES uses zookeeper to achieve distributed state consistency, it may be better, and we look forward to the official release of an integrated version as soon as possible. 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326519237&siteId=291194637