split-brain split brain problem (Keepalived)

Split-brain
means that in a high availability (HA) system, when two connected nodes are disconnected, the original system is divided into two independent nodes, and the two nodes start Competing for shared resources can result in system confusion and data corruption.
For HA of stateless services, it does not matter whether it is split-brain or not; but for HA of stateful services (such as MySQL), split-brain must be strictly prevented. (But some systems in production environments configure stateful services according to the stateless service HA set, and the results can be imagined...)

How to prevent split-brain in HA cluster
Generally, two methods are used:
1) Arbitration
When two nodes disagree, the arbiter of the third party decides who to listen to. The arbiter may be a lock service, a shared disk or something else.

2) fencing
When the status of a node cannot be determined, the other party is killed through fencing to ensure that the shared resources are completely released, provided that there must be a reliable fence device.

Ideally, neither of the above should be missing.
However, if the nodes do not use shared resources, such as database HA based on master-slave replication, the fence device can also be safely omitted and only the quorum is retained. And many times there are no fence devices available in our environment, such as in cloud hosts.

So, can the arbitration be omitted and only fence devices remain?
Can not. Because, when two nodes lose contact with each other, they will fencing each other at the same time. If the fencing method is reboot, then the two machines will keep restarting. If the fencing method is power off, then the ending may be that two nodes perish together, or one may survive. However, if the reason why the two nodes lose contact with each other is that the network card of one of the nodes is faulty, and it is the faulty node that survives, then the ending will be tragic.
Therefore, a simple dual node cannot prevent a split brain in any way.

How to implement the above strategy
You can implement a set of scripts that conform to the above logic from scratch. It is recommended to use mature cluster software to build, such as Pacemaker+Corosync+appropriate resource agent. Keepalived is not suitable for HA of stateful services. Even if you add arbitration and fence to the solution, it always feels awkward.

There are also some precautions when using the Pacemaker+Corosync solution
1) Understand the function and principle of the
resource agent Only by understanding the function and principle of the resource agent can you know the applicable scenarios. For example, pgsql's resource agent is relatively complete, supports synchronous and asynchronous streaming replication, and can automatically switch between the two, and can ensure that data will not be lost under synchronous replication. However, the current MySQL resource agent is very weak. Without GTID and no log compensation, it is easy to lose data. It is better not to use it and continue to use MHA (however, be sure to prevent split-brain when deploying MHA).

2) Ensure the quorum (quorum) Quorum
can be considered as the arbitration mechanism that comes with Pacemkaer. The majority of all nodes in the cluster elect a coordinator, and all instructions of the cluster are issued by this coordinator, which can perfectly eliminate the split-brain problem. . In order for this mechanism to work effectively, there must be at least 3 nodes in the cluster, and the no-quorum-policy is set to stop, which is also the default. (Many tutorials set no-quorum-policy to ignore for the convenience of demonstration. If the production environment does the same, and there is no other arbitration mechanism, it is very dangerous!)

But what if there are only 2 nodes?

  • One is to pull a machine and borrow it to make up 3 nodes, and then set the location limit to prevent resources from being allocated to that node.
  • The second is to pull together multiple small clusters that do not meet the quorum requirements to form a large cluster, and also apply location restrictions to control the allocation of resources.

But if you have a lot of two-node clusters, you can't find so many nodes for making up the number, and you don't want to pull these two-node clusters together to form a large cluster (for example, it is inconvenient to manage). Then a third method can be considered.
The third method is to configure a preemptive resource, as well as the colocation constraints between the service and this preempted resource, whoever preempts the preempted resource provides the service. This preemptive resource can be a lock service, such as wrapping one based on zookeeper, or simply making one from scratch, as in the following example . This example is a short connection based on the http protocol. A more detailed approach is to use long connection heartbeat detection, so that the server can detect the disconnection in time and release the lock.) However, we must also ensure the high availability of this preempted resource. The services that provide preemptive resources can be made into lingyig highly available, or it can be simpler to deploy 3 services, deploy one on the two nodes, and deploy the third on another dedicated arbitration node, and acquire at least 2 of the 3 locks. Only one is considered to have acquired the lock. This quorum node can provide quorum services for many clusters (because only one Pacemaker instance can be deployed on a machine, otherwise the quorum node that deploys N Pacemaker instances can do the same thing. However, if it is not necessary, try to use the previous method. , that is, to satisfy the Pacemaker quorum, this method is simpler and more reliable.

-------------------------------------------------- ------------keepalived's split brain problem --------------------------------- ----------------------------------
1) Solving the problem of keepalived split-brain
detection idea: VIP of keepalived under normal circumstances The address is on the master node. If a VIP is found on the slave node, alarm information will be set. The script (on the slave node) is as follows:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

[root@slave-ha ~]# vim split-brainc_check.sh

#!/bin/bash

# 检查脑裂的脚本,在备节点上进行部署

LB01_VIP=192.168.1.229

LB01_IP=192.168.1.129

LB02_IP=192.168.1.130

while true

do

  ping -c 2 -W 3 $LB01_VIP &>/dev/null

    if [ $? -eq 0 -a `ip add|grep "$LB01_VIP"|wc -l` -eq 1 ];then

        echo "ha is brain."

    else

        echo "ha is ok"

    fi

    sleep 5

done

 

执行结果如下:

[root@slave-ha ~]# bash check_split_brain.sh

ha is ok

ha is ok

ha is ok

ha is ok

当发现异常时候的执行结果:

[root@slave-ha ~]# bash check_split_brain.sh

ha is ok

ha is ok

ha is ok

ha is ok

ha is brain.

ha is brain.

2)曾经碰到的一个keepalived脑裂的问题(如果启用了iptables,不设置"系统接收VRRP协议"的规则,就会出现脑裂)
曾经在做keepalived+Nginx主备架构的环境时,当重启了备用机器后,发现两台机器都拿到了VIP。这也就是意味着出现了keepalived的脑裂现象,检查了两台主机的网络连通状态,发现网络是好的。然后在备机上抓包:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

[root@localhost ~]#  tcpdump -i eth0|grep VRRP 

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode 

listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 

22:10:17.146322 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:17.146577 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:17.146972 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:18.147136 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:18.147576 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:25.151399 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:25.151942 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:26.151703 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:26.152623 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:27.152456 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:27.153261 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:28.152955 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:28.153461 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:29.153766 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:29.155652 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:30.154275 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:30.154587 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:31.155042 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:31.155428 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:32.155539 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:32.155986 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:33.156357 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:33.156979 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:34.156801 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:34.156989 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

 

备机能接收到master发过来的VRRP广播,那为什么还会有脑裂现象?

接着发现重启后iptables开启着,检查了防火墙配置。发现系统不接收VRRP协议。

于是修改iptables,添加允许系统接收VRRP协议的配置:

-A INPUT -i lo -j ACCEPT  

-----------------------------------------------------------------------------------------

我自己添加了下面的iptables规则:

-A INPUT -s 192.168.1.0/24 -d 224.0.0.18 -j ACCEPT       #允许组播地址通信

-A INPUT -s 192.168.1.0/24 -p vrrp -j ACCEPT             #允许VRRP(虚拟路由器冗余协)通信

-----------------------------------------------------------------------------------------

 

最后重启iptables,发现备机上的VIP没了。

虽然问题解决了,但备机明明能抓到master发来的VRRP广播包,却无法改变自身状态。只能说明网卡接收到数据包是在iptables处理数据包之前发生的事情。

3)预防keepalived脑裂问题
     1)可以采用第三方仲裁的方法。由于keepalived体系中主备两台机器所处的状态与对方有关。如果主备机器之间的通信出了网题,就会发生脑裂,此时keepalived体系中会出现双主的情况,产生资源竞争。
     2)一般可以引入仲裁来解决这个问题,即每个节点必须判断自身的状态。最简单的一种操作方法是,在主备的keepalived的配置文件中增加check配置,服务器周期性地ping一下网关,如果ping不通则认为自身有问题 。
    3)最容易的是借助keepalived提供的vrrp_script及track_script实现。如下所示:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

# vim /etc/keepalived/keepalived.conf

   ......

   vrrp_script check_local {

    script "/root/check_gateway.sh"

    interval 5

    }

   ......

 

   track_script {    

   check_local                  

   }

 

   脚本内容:

   # cat /root/check_gateway.sh

   #!/bin/sh

   VIP=$1

   GATEWAY=192.168.1.1

   /sbin/arping -I em1 -c 5 -s $VIP $GATEWAY &>/dev/null

 

   check_gateway.sh 就是我们的仲裁逻辑,发现ping不通网关,则关闭keepalived service keepalived stop。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325165064&siteId=291194637