Detailed explanation of the split-brain problem in the disaster recovery architecture

[Abstract] For the disaster recovery architecture, split brain is a catastrophic event. This article introduces the priority solution, arbitration solution, and arbitration conflict in detail. It is of great benefit to understand related scenarios and solve related problems. Welcome to read. (For the latest parameters of related technical products involved in this article, please refer to the latest release on the official website)

1. What is the split brain problem in disaster recovery?

Split-brain is "brain division", that is, one "brain" has been split into two or more "brains" due to some reasons. We all know that if a person has multiple brains, And if they are independent of each other, then there will inevitably be problems.

In the design of disaster recovery architecture, we often use some high-availability architectures such as HA and Cluster, and generally use cross-regional L2 networks to form an independent cluster at a certain functional layer in a cross-data center manner. For example, database clusters, storage gateway clusters, etc. Assuming that two independent clusters are formed because of the communication failure of two data center nodes, working independently of each other, then this is a split brain. As shown in the figure below.

The first question: why clusters may produce split brain?

This question needs to go back to the arbitration mechanism of the cluster. Generally speaking, the arbitration algorithm of the cluster judges who is the leader of the cluster based on the amount of arbitration resources each node can obtain. The arbitration resources of the cluster are nothing more than the heartbeat information from the network level and the disk heartbeat resources of the shared storage. In the case of a common node layer failure, the failed node can obtain less arbitration resources than other nodes, and then it will not happen. Split brain problem. But in a special case (the network between the two data centers fails), the arbitration resources available to the two nodes are the same,

Guess you like

Origin blog.csdn.net/weixin_70923796/article/details/130613727