------ What is split-brain?
In a "dual-machine hot standby" high availability (HA) system, when the "heartbeat line" connecting two nodes is disconnected (that is, the two nodes are disconnected) When connected), the HA system, which was originally a whole and coordinated actions, split into two independent nodes (that is, two independent individuals). Because they lost contact with each other, they both thought that the other party was faulty. The HA software on the two nodes was like a "split-brain man", "instinctively" competing for "shared resources" and "application services". Serious consequences will occur: 1) or the shared resources are divided and the "services" on both sides cannot be started; 2) or the "services" on both sides are up, but the "shared storage" is read and written at the same time, resulting in data damage (common such as There is an error in the online log of the database polling).
Two nodes compete with each other for shared resources, resulting in system chaos and data corruption. For HA of stateless services, it does not matter whether it is split-brain or not, but for HA of stateful services (such as MySQL), split-brain must be strictly prevented [but some
systems in production environments follow the HA of stateless services] Configure stateful services and the results can be imagined].
------ Causes of cluster split-brain
Generally speaking, split-brain occurs for the following reasons:
1. The heartbeat link between the nodes of the high-availability server fails, resulting in the inability to communicate normally.
2. The heartbeat cable is broken (including broken, aging).
3. Because the network card and related drivers are broken, ip configuration and conflicts (network card direct connection).
4. The equipment connected to the heartbeat line is faulty (network card and switch).
5. There is a problem with the arbitration machine (the arbitration scheme is adopted).
6. The iptables firewall is enabled on the high-availability server to block the transmission of heartbeat messages.
7. The heartbeat NIC address and other information on the high-availability server is incorrectly configured, resulting in failure to send the heartbeat.
8. Reasons such as improper configuration of other services, such as different heartbeat methods, heartbeat wide insertion conflicts, software bugs, etc.
Tip: If the virtual_router_id parameter configuration at both ends of the same VRRP instance in the Keepalived configuration is inconsistent, it will also cause a split-brain problem.
------ How to prevent split-brain in HA cluster [There are currently four consensus measures]
The first one: Add redundant heartbeat lines [i.e., redundant communication method]
Use serial cables and Ethernet cables at the same time Connect, use two heartbeat lines at the same time (that is, the heartbeat line is also HA), so that if one line is broken, the other is still good, and the heartbeat message can still be transmitted, minimizing the chance of "split brain" phenomenon.
The second method: setting up an arbitration mechanism
When two nodes disagree, the third-party arbitrator decides who to listen to. This arbitrator may be a lock service, a shared disk or something else. For example, set the reference IP (such as the gateway IP). When the heartbeat line is completely disconnected, the two nodes will each ping the reference IP. If it fails, it means that the breakpoint is at the local end. Not only the "heartbeat", but also the local network link of the external "service" is broken, even if it is useless to start (or continue) the application service, then take the initiative to give up the competition and let the end that can ping the reference IP to start the service . To be safer, the party that cannot ping the reference IP simply restarts itself to completely release those shared resources that may still be occupied.
The third method: fence mechanism [that is, the method of sharing resources] [the premise is that there must be a reliable fence device]
When the status of a certain node cannot be determined, the heartbeat node is forcibly shut down through the fence device to ensure that the shared resources are completely released ! It is equivalent to that the backup node cannot receive the heartbeat information, and sends a shutdown command through a separate line to turn off the power of the master node.
Ideally, none of the second and third above should be missing. However, if the node does not use shared resources, such as database HA based on master-slave replication, the fence device can be safely omitted and only the arbitration is retained. In addition, in many cases, there may not be an available fence device in the online environment, such as in a cloud host. inside.
So can we omit the arbitration mechanism and only keep the fence device? This is not allowed. Because when two nodes lose contact with each other, they will fencing each other at the same time. If the fencing method is reboot, then the two machines will restart continuously. If the fencing method is power off, then the outcome may be that two nodes die together, or one may survive. But if two nodes lose contact with each other because the network card of one node fails, and the one that survives happens to be the faulty node, then the ending is also a tragedy. So: A simple double node cannot prevent split brain in any case.
The fourth method: Enable disk lock.
The server is locking the shared disk. When "split brain" occurs, the other party will be completely unable to take away the shared disk resources. However, there is a big problem with using a locked disk. If the party occupying the shared disk does not actively "unlock" it, the other party will never get the shared disk. In reality, if the service node suddenly freezes or crashes, it is impossible to execute the unlock command. The backup node cannot take over shared resources and application services. So someone designed a "smart" lock in HA. That is: the serving party only activates the disk lock when it finds that all heartbeat lines are disconnected (the peer is not aware of it). It's usually not locked.
------ Is it safe to have no fence device?
Here, MySQL data replication is taken as an example to illustrate this problem. In the replication-based scenario, the master and slave nodes have no shared resources (no VIP), so there is no problem with both nodes being alive. The question is whether the client will access the node that should have died. This again involves the issue of client routing. There are several ways of client routing: VIP-based, Proxy-based, DNS-based or simply the client maintains a server address list to determine the master-slave by itself. No matter which method is used, the route must be updated when the master-slave switches:
1) DNS-based routing is unreliable because DNS may be cached by the client and is difficult to clean.
2) VIP-based routing has some variables. If the node that should be dead does not remove its own VIP, it may come out to make trouble at any time (even if the new master has updated the arp cache on all hosts through arping, if a node If the host's arp expires and an arp query is sent, an ip conflict will occur). Therefore, it can be considered that VIP is also a special shared resource and must be removed from the faulty node. As for how to pick it up, the easiest way is to pick it up after the faulty node finds itself out of contact, if it is still alive (if it is dead, there is no need to pick it up). What if the process responsible for extracting vip fails to work? At this time, you can use the unreliable soft fence equipment (such as ssh).
3) Proxy-based routing is more reliable, because Proxy is the only service entrance, as long as the Proxy is updated in one place, the problem of client misaccess will not occur, but the high availability of Proxy should also be considered.
4) As for the method based on the server address list, the client needs to judge the master and slave through the background service (such as whether the PostgreSQL/MySQL session is in read-only mode). At this time, if there are two masters, the client will be confused. In order to prevent this problem, the original master node has to stop the service when it finds itself out of contact, which is the same as the principle of picking vip before.
Therefore, in order to prevent the faulty node from causing trouble, the faulty node should release the resources by itself after losing contact. In order to cope with the failure of the process that releases the resources itself, a soft fence can be added. Under this premise, it can be considered that it is safe without reliable physical fence equipment.
-------------------------------------------------- ----------------------------
What is a Fence device?
The Fence device is a very important part of the cluster. The "split-brain" phenomenon caused by unpredictable situations can be avoided through the Fence device. The Fence device is mainly through the hardware management interface of the server or storage itself or the external power management device. , to directly issue hardware management instructions to the server or storage, restart or shut down the server, or disconnect from the network. When a device fails, Fence is responsible for disconnecting the device occupying floating resources from the cluster.
Each node sends detection packets to each other to determine the node's survival. Generally, there will be a dedicated line for detection, which is called the "heartbeat line" (the above figure directly uses the eth0 line as the heartbeat line). Assuming that there is a problem with the heartbeat line of node1, node2 and node3 will think that there is a problem with node1, and then they will schedule resources to run on node2 or node3, but node1 will think that it is all right and not let node2 or node3 seize resources. Got "split brain". At this time, if there is a device in the whole environment that directly powers off node1, split brain can be avoided. This device is called fence or stonith (Shoot The Other Node In The Head). In the physical machine, virsh manages the virtual machine through the serial port line, such as virsh destroy nodeX. Here we treat the physical machine as a fence device.
------ Can data be guaranteed not to be lost after master-slave switching? Whether
data will be lost after master-slave switching and brain splitting can be considered two different issues. Here we also take MySQL data replication as an example to illustrate. For MySQL, even if it is configured for semi-synchronous replication, it may automatically downgrade to asynchronous replication after a timeout occurs. In order to prevent MySQL replication from being degraded, you can set an extremely large rpl_semi_sync_master_timeout while keeping rpl_semi_sync_master_wait_no_slave on (the default value). However, if the slave fails at this time, the master will also stop. The solution to this problem can be to configure one master and two slaves. As long as both slaves are not down, it will be fine, or external cluster monitoring software can dynamically switch between semi-synchronous and asynchronous. If it is originally configured asynchronous replication, it means that you are ready to lose data. At this time, it is not a big deal to lose some data when the master-slave is switched, but the number of automatic switches must be controlled. For example, the original master whose control has been failed over is not allowed to go online automatically. Otherwise, if failover occurs due to network jitter, the master and slave will keep switching back and forth, losing data, and destroying data consistency.
------ How to implement the "arbitration mechanism + fence mechanism" strategy to prevent cluster "split brain".
You can implement a script that conforms to the above logic from scratch, but it is recommended to use mature cluster software to build it, such as Pacemaker+ Corosync + suitable resource agent . Keepalived may not be suitable for HA of stateful services. Even if arbitration and fences are added to the solution, it still feels awkward.
When using the Pacemaker+Corosync solution, please note: quorum can be thought of as Pacemkaer’s own arbitration mechanism. A majority of all nodes in the cluster elects a coordinator, and all instructions in the cluster are issued by this coordinator, which can perfectly prevent split-brain. question. In order for this mechanism to work effectively, there must be at least three nodes in the cluster, and no-quorum-policy is set to stop, which is also the default value. (Note: It is best not to set no-quorum-policy to ignore. If you do this in a production environment without other arbitration mechanisms, it will be very dangerous!)
But what if there are only two nodes?
1. Borrow a machine to make up three nodes, and then set location restrictions to prevent resources from being allocated to that node.
2. Pull together multiple small clusters that do not meet the quorum to form a large cluster, and also apply location restrictions to control the location of resource allocation.
But if you have many two-node clusters, you can't find so many nodes to make up the number, and you don't want to pull these two-node clusters together to form a large cluster (for example, you find it inconvenient to manage). Then you can consider the third method.
3. The third method is to configure a preempted resource, as well as services and colocation constraints of this preempted resource. Whoever seizes the preempted resource will provide the service. This preempted resource can be a certain lock service, such as packaging one based on zookeeper, or simply making one from scratch, like the following example of "corosync+pacemaker two-node split-brain problem processing":
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
|
上面这个例子是基于http协议的短连接,更细致的做法是使用长连接心跳检测,这样服务端可以及时检出连接断开而释放锁。但是,一定要同时确保这个抢占资源的高可用,可以把提供抢占资源的服务做成lingyig高可用的,也可以简单点,部署3个服务,双节点上个部署一个,第三个部署在另外一个专门的仲裁节点上,至少获取3个锁中的2个才视为取得了锁。这个仲裁节点可以为很多集群提供仲裁服务(因为一个机器只能部署一个Pacemaker实例,否则可以用部署了N个Pacemaker实例的仲裁节点做同样的事情。)。但是,如非迫不得已,尽量还是采用前面的方法,即满足Pacemaker法定票数,这种方法更简单,可靠。
------ 如何监控"脑裂"情况
1. 在什么服务器上进行"脑裂"情况监控?
在备节点服务器上进行监控,可以使用zabbix监控。
2. 监控什么信息?
备节点服务器上面如果出现vip情况,只可能是下面两种情况
1)脑裂情况出现。
2)正常主备切换也会出现。
3. 编写监控脑裂脚本
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
4)测试。确保两个节点的负载均衡能够正常负载
1 2 3 4 5 6 7 8 9 |
|
Keepalived脑裂问题分享一
1)解决keepalived脑裂问题
检测思路:正常情况下keepalived的VIP地址是在主节点上的,如果在从节点发现了VIP,就设置报警信息。脚本(在从节点上)如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
2)keepalived脑裂的一个坑(如果启用了iptables,不设置"系统接收VRRP协议"的规则,就会出现脑裂)
曾经在做keepalived+Nginx主备架构的环境时,当重启了备用机器后,发现两台机器都拿到了VIP。这也就是意味着出现了keepalived的脑裂现象,检查了两台主机的网络连通状态,发现网络是好的。然后在备机上抓包:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
|
3)预防keepalived脑裂问题
1. 可以采用第三方仲裁的方法。由于keepalived体系中主备两台机器所处的状态与对方有关。如果主备机器之间的通信出了网题,就会发生脑裂,此时keepalived体系中会出现双主的情况,产生资源竞争。
2. 一般可以引入仲裁来解决这个问题,即每个节点必须判断自身的状态。最简单的一种操作方法是,在主备的keepalived的配置文件中增加check配置,服务器周期性地ping一下网关,如果ping不通则认为自身有问题 。
3. 最容易的是借助keepalived提供的vrrp_script及track_script实现。如下所示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
4)推荐自己写脚本
写一个while循环,每轮ping网关,累计连续失败的次数,当连续失败达到一定次数则运行service keepalived stop关闭keepalived服务。如果发现又能够ping通网关,再重启keepalived服务。最后在脚本开头再加上脚本是否已经运行的判断逻辑,将该脚本加到crontab里面。
Keepalived脑裂问题分享二
在部署Nginx+Keepalived高可用集群配置时可能会出行如下脑裂现象。处理过程如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
通过上面查看两个节点机器的日志,发现VRRP是基于报文实现的!!Master节点会设置一定时间发送一个报文给Backup节点,如果Backup没有收到就自己成为Master。由此可以推出导致出现上面"脑裂"问题的根源在于Backup没有收到来自Master发送的报文!所以它自己也成为了Master。
VRRP控制报文只有一种:VRRP通告(advertisement)。它使用IP多播数据包进行封装,组地址为224.0.0.18,发布范围只限于同一局域网内。这保证了VRID在不同网络中可以重复使用。为了减少网络带宽消耗只有主控路由器才可以周期性的发送VRRP通告报文。备份路由器在连续三个通告间隔内收不到VRRP或收到优先级为0的通告后启动新的一轮VRRP选举。
另外注意:Centos7安装Keepalived后, 如果不关闭防火墙, 则需要在防火墙中放行VRRP的组播IP 244.0.0.18。否则虚拟ip不能实现漂移,双机都为Master,不能实现双机热备的效果。[类似于上面"分享一"中的情况]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|