Introduction to ElasticSearch-Routing routing-split-brain problems and solutions

What is Routing

When using the client client of ES to insert a piece of data into ES, that is, when creating a Document, ES will determine which shard on which index the Doc exists, and This process is called routing.

The role of routing

By default, ES performs routing based on the document Id to evenly distribute data.
The following is the default formula for routing:

shard_num = hash(_rounting) % num_primary_shards

Parameter Description:

parameter significance
shard_num Represents the number of the shard where the final data falls
_rounting The default is the Id of the document (the Id is unique by default)
num_primary_shards Number of main shards

Note:
num_primary_shards is used here. Therefore, the number of shards is determined when the index is defined, and cannot be changed . Because of the change, the routing algorithm here becomes invalid.

Custom routing

Next, as mentioned above, _rounting defaults to the Id of the document. Does that mean that we can also define the routing value by ourselves?
For example:
1. Create a document and specify the value of the route as hello

PUT cluster/test-type/1?routing=hello
{
    
    
	"title":"天气预报",
	"city":"杭州",
	"count":"西湖"
}

2. But when querying, you must also specify the route

GET cluster/test-type/1?routing=hello

3.term query, specify routing

GET cluster/_search
{
    
    
	"query":{
    
    
		"terms":{
    
    
			"_routing":["hello"]
		}
	}
}

4. To force query or insert data to specify the route, you need to configure the
part as follows when setting the index :

"mapping":{
    
    
	"_routing":{
    
    
		"required":true
	}
}

Problems with custom routing

Custom routing is easy to make data distribution uneven, because all specified routing has the same data, all concentrated on the same shards, causing serious data skew.

Solve the data skew caused by custom routing

Set the index attribute:

routing_partition_size:

Once set, the routing formula is no longer the above formula, but instead:

shard_num = (hash(_rounting) + hash(_id) % routing_partition_size) % num_primary_shards

And need to pay attention to:

  1. 1 < routing_partition_size < primary_shard_of_number
  2. Cannot create join-field relationship mapping after use
  3. After use, _routing must be turned on, otherwise an error will be reported.

ES-split brain problem

Split-brain problem: In the
cluster, different nodes have divergence in the choice of master, and multiple nodes competing for the master have appeared, and the identification of the master shard and the replica has diverged, which ultimately leads to operational chaos, that is, mental disorder. This is ES split brain

Causes

  1. Memory reclamation: The memory occupied by the ES process on the data node is too large, causing large-scale memory reclamation by the JVM (common)
  2. Node load: The master node plays the role of master and data storage at the same time. Once the amount of visits is too large, the master node will be overwhelmed and cause high latency, making other nodes think that your master is down, so start to elect a new master ( Actually did not hang)
  3. Network problems: The network delay between clusters caused some nodes to not be able to access the master, thinking that the master was down (much like the second one, it was not actually down)

solution

1. Change the timeout related configuration:

discovery.zen.ping.timeout

The default is 3s, which can be extended appropriately, such as 6s and 10s. Once the node does not respond within the set time, it is considered to be hung up. If it is extended appropriately, the possibility of split brain problems caused by network delays can be reduced to a certain extent.
2. Change election related configuration

discovery.zen.minimum_master_nodes:1

The default is 1, which is used to control the minimum number of cluster master nodes when the election behavior occurs.
Assuming that the number of configurations is n,
then a new election will start when the following conditions are met at the same time

  1. Number of candidate master nodes>=n
  2. The number of master nodes that think the master is down>=n

Note: The recommended value for ES official website is n/2+1, where n represents the number of standby master nodes. That is, the node that configures Node.master: true

3. Try to separate roles:
master and data roles are separated
master configuration:

node.master: true
node.data: false

data configuration:

node.master: false
node.data: true

If there is a split brain problem, how to solve it?

Restart the cluster
First, let me explain one point: After the ES cluster is started, the first node to start (node.master is true) is taken as the master by default.
Note here:
**Pay attention to the startup sequence.** ES considers that the shard on the elected node is the main shard, which is the master copy. ES will distribute the data on this shard to other nodes on the cluster. For data coverage, suppose we are the first to start a client node, and the above data is still out of date, then after the first start, ES distributes the expired data to all nodes and overwrites the data, then this consequence It is very serious and causes data loss.

Therefore it is recommended:

  1. Re-index all the data.
  2. Start each node separately, analyze whether the data on each node is important, and analyze its effectiveness
  3. Finally determine the most effective node, choose him first to start
  4. Then start the remaining nodes together.

Guess you like

Origin blog.csdn.net/Zong_0915/article/details/107717324