Solving the "split-brain" problem in HA high-availability cluster

------ What is split-brain?
In a "dual-machine hot standby" high availability (HA) system, when the "heartbeat line" connecting two nodes is disconnected (that is, the two nodes are disconnected) When connected), the HA system, which was originally a whole and coordinated actions, split into two independent nodes (that is, two independent individuals). Because they lost contact with each other, they both thought that the other party was faulty. The HA software on the two nodes was like a "split-brain man", "instinctively" competing for "shared resources" and "application services". Serious consequences will occur: 1) or the shared resources are divided and the "services" on both sides cannot be started; 2) or the "services" on both sides are up, but the "shared storage" is read and written at the same time, resulting in data damage (common such as There is an error in the online log of the database polling).

Two nodes compete with each other for shared resources, resulting in system chaos and data corruption. For HA of stateless services, it does not matter whether it is split-brain or not, but for HA of stateful services (such as MySQL), split-brain must be strictly prevented [but some
systems in production environments follow the HA of stateless services] Configure stateful services and the results can be imagined].

------ Causes of cluster split-brain
Generally speaking, split-brain occurs for the following reasons:
1. The heartbeat link between the nodes of the high-availability server fails, resulting in the inability to communicate normally.
2. The heartbeat cable is broken (including broken, aging).
3. Because the network card and related drivers are broken, ip configuration and conflicts (network card direct connection).
4. The equipment connected to the heartbeat line is faulty (network card and switch).
5. There is a problem with the arbitration machine (the arbitration scheme is adopted).
6. The iptables firewall is enabled on the high-availability server to block the transmission of heartbeat messages.
7. The heartbeat NIC address and other information on the high-availability server is incorrectly configured, resulting in failure to send the heartbeat.
8. Reasons such as improper configuration of other services, such as different heartbeat methods, heartbeat wide insertion conflicts, software bugs, etc.

Tip:  If the virtual_router_id parameter configuration at both ends of the same VRRP instance in the Keepalived configuration is inconsistent, it will also cause a split-brain problem.

------ How to prevent split-brain in HA cluster [There are currently four consensus measures]
The first one: Add redundant heartbeat lines [i.e., redundant communication method]
Use serial cables and Ethernet cables at the same time Connect, use two heartbeat lines at the same time (that is, the heartbeat line is also HA), so that if one line is broken, the other is still good, and the heartbeat message can still be transmitted, minimizing the chance of "split brain" phenomenon.

The second method: setting up an arbitration mechanism
When two nodes disagree, the third-party arbitrator decides who to listen to. This arbitrator may be a lock service, a shared disk or something else. For example, set the reference IP (such as the gateway IP). When the heartbeat line is completely disconnected, the two nodes will each ping the reference IP. If it fails, it means that the breakpoint is at the local end. Not only the "heartbeat", but also the local network link of the external "service" is broken, even if it is useless to start (or continue) the application service, then take the initiative to give up the competition and let the end that can ping the reference IP to start the service . To be safer, the party that cannot ping the reference IP simply restarts itself to completely release those shared resources that may still be occupied.

The third method: fence mechanism [that is, the method of sharing resources] [the premise is that there must be a reliable fence device] 
When the status of a certain node cannot be determined, the heartbeat node is forcibly shut down through the fence device to ensure that the shared resources are completely released ! It is equivalent to that the backup node cannot receive the heartbeat information, and sends a shutdown command through a separate line to turn off the power of the master node.

Ideally, none of the second and third above should be missing. However, if the node does not use shared resources, such as database HA based on master-slave replication, the fence device can be safely omitted and only the arbitration is retained. In addition, in many cases, there may not be an available fence device in the online environment, such as in a cloud host. inside.

So can we omit the arbitration mechanism and only keep the fence device? This is not allowed. Because when two nodes lose contact with each other, they will fencing each other at the same time. If the fencing method is reboot, then the two machines will restart continuously. If the fencing method is power off, then the outcome may be that two nodes die together, or one may survive. But if two nodes lose contact with each other because the network card of one node fails, and the one that survives happens to be the faulty node, then the ending is also a tragedy. So: A simple double node cannot prevent split brain in any case.

The fourth method: Enable disk lock.
The server is locking the shared disk. When "split brain" occurs, the other party will be completely unable to take away the shared disk resources. However, there is a big problem with using a locked disk. If the party occupying the shared disk does not actively "unlock" it, the other party will never get the shared disk. In reality, if the service node suddenly freezes or crashes, it is impossible to execute the unlock command. The backup node cannot take over shared resources and application services. So someone designed a "smart" lock in HA. That is: the serving party only activates the disk lock when it finds that all heartbeat lines are disconnected (the peer is not aware of it). It's usually not locked.

------ Is it safe to have no fence device?
Here, MySQL data replication is taken as an example to illustrate this problem. In the replication-based scenario, the master and slave nodes have no shared resources (no VIP), so there is no problem with both nodes being alive. The question is whether the client will access the node that should have died. This again involves the issue of client routing. There are several ways of client routing: VIP-based, Proxy-based, DNS-based or simply the client maintains a server address list to determine the master-slave by itself. No matter which method is used, the route must be updated when the master-slave switches:

1) DNS-based routing is unreliable because DNS may be cached by the client and is difficult to clean.
2) VIP-based routing has some variables. If the node that should be dead does not remove its own VIP, it may come out to make trouble at any time (even if the new master has updated the arp cache on all hosts through arping, if a node If the host's arp expires and an arp query is sent, an ip conflict will occur). Therefore, it can be considered that VIP is also a special shared resource and must be removed from the faulty node. As for how to pick it up, the easiest way is to pick it up after the faulty node finds itself out of contact, if it is still alive (if it is dead, there is no need to pick it up). What if the process responsible for extracting vip fails to work? At this time, you can use the unreliable soft fence equipment (such as ssh).
3) Proxy-based routing is more reliable, because Proxy is the only service entrance, as long as the Proxy is updated in one place, the problem of client misaccess will not occur, but the high availability of Proxy should also be considered.
4) As for the method based on the server address list, the client needs to judge the master and slave through the background service (such as whether the PostgreSQL/MySQL session is in read-only mode). At this time, if there are two masters, the client will be confused. In order to prevent this problem, the original master node has to stop the service when it finds itself out of contact, which is the same as the principle of picking vip before.

Therefore, in order to prevent the faulty node from causing trouble, the faulty node should release the resources by itself after losing contact. In order to cope with the failure of the process that releases the resources itself, a soft fence can be added. Under this premise, it can be considered that it is safe without reliable physical fence equipment.

-------------------------------------------------- ----------------------------
What is a Fence device?
The Fence device is a very important part of the cluster. The "split-brain" phenomenon caused by unpredictable situations can be avoided through the Fence device. The Fence device is mainly through the hardware management interface of the server or storage itself or the external power management device. , to directly issue hardware management instructions to the server or storage, restart or shut down the server, or disconnect from the network. When a device fails, Fence is responsible for disconnecting the device occupying floating resources from the cluster.

Each node sends detection packets to each other to determine the node's survival. Generally, there will be a dedicated line for detection, which is called the "heartbeat line" (the above figure directly uses the eth0 line as the heartbeat line). Assuming that there is a problem with the heartbeat line of node1, node2 and node3 will think that there is a problem with node1, and then they will schedule resources to run on node2 or node3, but node1 will think that it is all right and not let node2 or node3 seize resources. Got "split brain". At this time, if there is a device in the whole environment that directly powers off node1, split brain can be avoided. This device is called fence or stonith (Shoot The Other Node In The Head). In the physical machine, virsh manages the virtual machine through the serial port line, such as virsh destroy nodeX. Here we treat the physical machine as a fence device.

------ Can data be guaranteed not to be lost after master-slave switching? Whether
data will be lost after master-slave switching and brain splitting can be considered two different issues. Here we also take MySQL data replication as an example to illustrate. For MySQL, even if it is configured for semi-synchronous replication, it may automatically downgrade to asynchronous replication after a timeout occurs. In order to prevent MySQL replication from being degraded, you can set an extremely large rpl_semi_sync_master_timeout while keeping rpl_semi_sync_master_wait_no_slave on (the default value). However, if the slave fails at this time, the master will also stop. The solution to this problem can be to configure one master and two slaves. As long as both slaves are not down, it will be fine, or external cluster monitoring software can dynamically switch between semi-synchronous and asynchronous. If it is originally configured asynchronous replication, it means that you are ready to lose data. At this time, it is not a big deal to lose some data when the master-slave is switched, but the number of automatic switches must be controlled. For example, the original master whose control has been failed over is not allowed to go online automatically. Otherwise, if failover occurs due to network jitter, the master and slave will keep switching back and forth, losing data, and destroying data consistency.

------ How to implement the "arbitration mechanism + fence mechanism" strategy to prevent cluster "split brain".
You can implement a script that conforms to the above logic from scratch, but it is recommended to use mature cluster software to build it, such as Pacemaker+ Corosync + suitable resource agent . Keepalived may not be suitable for HA of stateful services. Even if arbitration and fences are added to the solution, it still feels awkward.

When using the Pacemaker+Corosync solution, please note: quorum can be thought of as Pacemkaer’s own arbitration mechanism. A majority of all nodes in the cluster elects a coordinator, and all instructions in the cluster are issued by this coordinator, which can perfectly prevent split-brain. question. In order for this mechanism to work effectively, there must be at least three nodes in the cluster, and no-quorum-policy is set to stop, which is also the default value. (Note: It is best not to set no-quorum-policy to ignore. If you do this in a production environment without other arbitration mechanisms, it will be very dangerous!)

But what if there are only two nodes?
1. Borrow a machine to make up three nodes, and then set location restrictions to prevent resources from being allocated to that node.
2. Pull together multiple small clusters that do not meet the quorum to form a large cluster, and also apply location restrictions to control the location of resource allocation.

But if you have many two-node clusters, you can't find so many nodes to make up the number, and you don't want to pull these two-node clusters together to form a large cluster (for example, you find it inconvenient to manage). Then you can consider the third method.
3. The third method is to configure a preempted resource, as well as services and colocation constraints of this preempted resource. Whoever seizes the preempted resource will provide the service. This preempted resource can be a certain lock service, such as packaging one based on zookeeper, or simply making one from scratch, like the following example of "corosync+pacemaker two-node split-brain problem processing":

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

corosync作为HA方案中成员管理层(membership layer),负责集群成员管理、通信方式(单播、广播、组播)等功能,pacemaker作为CRM层。

在利用corosync+pacemaker双节点主备模式的实践中,可能会遇到一个问题:脑裂问题。

何谓脑裂:

在HA集群中,节点间通过心跳线进行网络通信,一旦心跳网络异常。导致成员互不相认,各自作为集群中的DC,这样资源同时会在主、备两节点启动。

脑裂原因是corosync还是pacemaker导致的?

开始可能认为是corosync,原因在于心跳端导致corosync不能正常通信。

后来发现在pacemaker官网有找到脑裂(split-brain)的方案。pacemaker作为crm,主责是管理资源,还有一个作用是选择leader。

   

   

1. 解决方案:为pacemaker配置抢占资源。

原理在于,pacemaker 可以定义资源的执行顺序。如果将独占资源放在最前面,后面的资源的启动则依赖与它,成也独占资源,败也独占资源。

当心跳网络故障时候,谁先抢占到该资源,该节点就接管服务资源,提供服务。这种方案必须解决两个问题,一是必须定义一个抢占资源,二是自定义pacemaker RA,去抢夺资源。

   

2. 定义抢占资源

下面利用互斥锁来实现独占资源。具体由python实现一个简单的web服务,提供lock,unlock,updatelock服务。

   

__author__ = 'ZHANGTIANJIONG629'

import BaseHTTPServer

import threading

import time

lock_timeout_seconds = 8

lock = threading.Lock()

lock_client_ip = ""

lock_time = 0

class LockService(BaseHTTPServer.BaseHTTPRequestHandler):

    def do_GET(self):

        '''define url route'''

        pass

    def lock(self, client_ip):

        global lock_client_ip

        global lock_time

        # if lock is free

        if lock.acquire():

            lock_client_ip = client_ip

            lock_time = time.time()

            self.send_response(200, 'ok')

            self.close_connection

            return

            # if current client hold lock,updte lock time

        elif lock_client_ip == client_ip:

            lock_time = time.time()

            self.send_response(200, 'ok,update')

            self.close_connection

            return

        else:

            # lock timeout,grab lock

            if time.time() - lock_time > lock_timeout_seconds:

                lock_client_ip = client_ip;

                lock_time = time.time()

                self.send_response(200, 'ok,grab lock')

                self.close_connection

                return

            else:

                self.send_response(403, 'lock is hold by other')

                self.close_connection

    def update_lock(self, client_ip):

        global lock_client_ip

        global lock_time

        if lock_client_ip == client_ip:

            lock_time = time.time()

            self.send_response(200, 'ok,update')

            self.close_connection

            return

        else:

            self.send_response(403, 'lock is hold by other')

            self.close_connection

            return

    def unlock(self, client_ip):

        global lock_client_ip

        global lock_time

        if lock.acquire():

            lock.release()

            self.send_response(200, 'ok,unlock')

            self.close_connection

            return

        elif lock_client_ip == client_ip:

            lock.release()

            lock_time = 0

            lock_client_ip = ''

            self.send_response(200, 'ok,unlock')

            self.close_connection

            return

        else:

            self.send_response(403, 'lock is hold by other')

            self.close_connection

            return

if __name__ == '__main__':

    http_server = BaseHTTPServer.HTTPServer(('127.0.0.1''88888'), LockService)

    http_server.serve_forever()

上面这个例子是基于http协议的短连接,更细致的做法是使用长连接心跳检测,这样服务端可以及时检出连接断开而释放锁。但是,一定要同时确保这个抢占资源的高可用,可以把提供抢占资源的服务做成lingyig高可用的,也可以简单点,部署3个服务,双节点上个部署一个,第三个部署在另外一个专门的仲裁节点上,至少获取3个锁中的2个才视为取得了锁。这个仲裁节点可以为很多集群提供仲裁服务(因为一个机器只能部署一个Pacemaker实例,否则可以用部署了N个Pacemaker实例的仲裁节点做同样的事情。)。但是,如非迫不得已,尽量还是采用前面的方法,即满足Pacemaker法定票数,这种方法更简单,可靠。

------  如何监控"脑裂"情况
1. 在什么服务器上进行"脑裂"情况监控?
在备节点服务器上进行监控,可以使用zabbix监控。

2. 监控什么信息?
备节点服务器上面如果出现vip情况,只可能是下面两种情况
1)脑裂情况出现。
2)正常主备切换也会出现。

3. 编写监控脑裂脚本

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

[root@slave-node scripts]# vim check_keepalived.sh

#!/bin/bash

# 192.168.10.10是VIP地址

while true

do

if [ `ip a show eth0 |grep 192.168.10.10|wc -l` -ne 0 ]

then

    echo "keepalived is error!"

else

    echo "keepalived is OK !"

fi

done

[root@slave-node scripts]# chmod 755 check_keepalived.sh

4)测试。确保两个节点的负载均衡能够正常负载

1

2

3

4

5

6

7

8

9

[root@master-node ~]# curl -H Host:www.kevin.cn 192.168.10.30

server-node01 www

[root@master-node ~]# curl -H Host:www.kevin.cn 192.168.10.31

server-node01 www

[root@slave-node ~]# curl -H Host:www.bobo.cn 192.168.10.31

server-node02 bbs

[root@slave-node ~]# curl -H Host:www.kevin.cn 192.168.10.30

server-node03 www

                                                             Keepalived脑裂问题分享一                                                        
1)解决keepalived脑裂问题
检测思路:正常情况下keepalived的VIP地址是在主节点上的,如果在从节点发现了VIP,就设置报警信息。脚本(在从节点上)如下:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

[root@slave-ha ~]# vim split-brainc_check.sh

#!/bin/bash

# 检查脑裂的脚本,在备节点上进行部署

LB01_VIP=192.168.1.229

LB01_IP=192.168.1.129

LB02_IP=192.168.1.130

while true

do

  ping -c 2 -W 3 $LB01_VIP &>/dev/null

    if [ $? -eq 0 -a `ip add|grep "$LB01_VIP"|wc -l` -eq 1 ];then

        echo "ha is brain."

    else

        echo "ha is ok"

    fi

    sleep 5

done

执行结果如下:

[root@slave-ha ~]# bash check_split_brain.sh

ha is ok

ha is ok

ha is ok

ha is ok

当发现异常时候的执行结果:

[root@slave-ha ~]# bash check_split_brain.sh

ha is ok

ha is ok

ha is ok

ha is ok

ha is brain.

ha is brain.

2)keepalived脑裂的一个坑(如果启用了iptables,不设置"系统接收VRRP协议"的规则,就会出现脑裂)
曾经在做keepalived+Nginx主备架构的环境时,当重启了备用机器后,发现两台机器都拿到了VIP。这也就是意味着出现了keepalived的脑裂现象,检查了两台主机的网络连通状态,发现网络是好的。然后在备机上抓包:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

[root@localhost ~]#  tcpdump -i eth0|grep VRRP 

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode 

listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 

22:10:17.146322 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:17.146577 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:17.146972 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:18.147136 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:18.147576 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:25.151399 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:25.151942 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:26.151703 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:26.152623 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:27.152456 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:27.153261 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:28.152955 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:28.153461 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:29.153766 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:29.155652 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:30.154275 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:30.154587 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:31.155042 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:31.155428 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:32.155539 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:32.155986 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:33.156357 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:33.156979 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

22:10:34.156801 IP 192.168.1.96 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 50, authtype simple, intvl 1s, length 20 

22:10:34.156989 IP 192.168.1.54 > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 160, authtype simple, intvl 1s, length 20 

备机能接收到master发过来的VRRP广播,那为什么还会有脑裂现象?

接着发现重启后iptables开启着,检查了防火墙配置。发现系统不接收VRRP协议。

于是修改iptables,添加允许系统接收VRRP协议的配置:

-A INPUT -i lo -j ACCEPT  

-----------------------------------------------------------------------------------------

我自己添加了下面的iptables规则:

-A INPUT -s 192.168.1.0/24 -d 224.0.0.18 -j ACCEPT       #允许组播地址通信

-A INPUT -s 192.168.1.0/24 -p vrrp -j ACCEPT             #允许VRRP(虚拟路由器冗余协)通信

-----------------------------------------------------------------------------------------

最后重启iptables,发现备机上的VIP没了。

虽然问题解决了,但备机明明能抓到master发来的VRRP广播包,却无法改变自身状态。只能说明网卡接收到数据包是在iptables处理数据包之前发生的事情。

3)预防keepalived脑裂问题
1. 可以采用第三方仲裁的方法。由于keepalived体系中主备两台机器所处的状态与对方有关。如果主备机器之间的通信出了网题,就会发生脑裂,此时keepalived体系中会出现双主的情况,产生资源竞争。
2. 一般可以引入仲裁来解决这个问题,即每个节点必须判断自身的状态。最简单的一种操作方法是,在主备的keepalived的配置文件中增加check配置,服务器周期性地ping一下网关,如果ping不通则认为自身有问题 。
3. 最容易的是借助keepalived提供的vrrp_script及track_script实现。如下所示:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

# vim /etc/keepalived/keepalived.conf

   ......

   vrrp_script check_local {

    script "/root/check_gateway.sh"

    interval 5

    }

   ......

   track_script {    

   check_local                  

   }

   脚本内容:

   # cat /root/check_gateway.sh

   #!/bin/sh

   VIP=$1

   GATEWAY=192.168.1.1

   /sbin/arping -I em1 -c 5 -s $VIP $GATEWAY &>/dev/null

   check_gateway.sh 就是我们的仲裁逻辑,发现ping不通网关,则关闭keepalived服务:"service keepalived stop"

4)推荐自己写脚本
写一个while循环,每轮ping网关,累计连续失败的次数,当连续失败达到一定次数则运行service keepalived stop关闭keepalived服务。如果发现又能够ping通网关,再重启keepalived服务。最后在脚本开头再加上脚本是否已经运行的判断逻辑,将该脚本加到crontab里面。

                                                             Keepalived脑裂问题分享二                                                               
在部署Nginx+Keepalived高可用集群配置时可能会出行如下脑裂现象。处理过程如下:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

查看两个节点的日志, 发现Master节点和Backup节点机器都是Mastaer模式启动的!

[root@Master-Ha ~]# tail -f  /var/log/messages

.........

.........

Dec 22 20:24:32 Master-Ha Keepalived_healtheckers[22518]:Activating healthchecker for server [xx.xx.xx.xx]

Dec 22 20:24:35 Master-Ha Keepalived_keepalived_vrrp[22519]: VRRP_instance(VI_I) Transition to MASTER STATE

Dec 22 20:24:38 Master-Ha Keepalived_keepalived_vrrp[22519]: VRRP_instance(VI_I) Entering MASTER STATE

Dec 22 20:24:40 Master-Ha Keepalived_keepalived_vrrp[22519]: VRRP_instance(VI_I) setting protocol VIPs

Dec 22 20:24:44 Master-Ha Keepalived_healtheckers[22518]:Netlink reflector reports IP xx.xx.xx.xx added

.........

.........

[root@Slave-Ha ~]# tail -f  /var/log/messages

.........

.........

Dec 22 20:24:34 Slave-Ha Keepalived_healtheckers[29803]:Activating healthchecker for server [xx.xx.xx.xx]

Dec 22 20:24:37 Slave-Ha Keepalived_keepalived_vrrp[29804]: VRRP_instance(VI_I) Transition to MASTER STATE

Dec 22 20:24:40 Slave-Ha Keepalived_keepalived_vrrp[29804]: VRRP_instance(VI_I) Entering MASTER STATE

Dec 22 20:24:43 Slave-Ha Keepalived_keepalived_vrrp[29804]: VRRP_instance(VI_I) setting protocol VIPs

Dec 22 20:24:47 Slave-Ha Keepalived_keepalived_vrrp[29804]: VRRP_instance(VI_I) Sending gratuitous ARPS on ens160 xx.xx.xx.xx

Dec 22 20:24:49 Slave-Ha Keepalived_healtheckers[22518]:Netlink reflector reports IP xx.xx.xx.xx added

.........

.........

通过上面查看两个节点机器的日志,发现VRRP是基于报文实现的!!Master节点会设置一定时间发送一个报文给Backup节点,如果Backup没有收到就自己成为Master。由此可以推出导致出现上面"脑裂"问题的根源在于Backup没有收到来自Master发送的报文!所以它自己也成为了Master。

VRRP控制报文只有一种:VRRP通告(advertisement)。它使用IP多播数据包进行封装,组地址为224.0.0.18,发布范围只限于同一局域网内。这保证了VRID在不同网络中可以重复使用。为了减少网络带宽消耗只有主控路由器才可以周期性的发送VRRP通告报文。备份路由器在连续三个通告间隔内收不到VRRP或收到优先级为0的通告后启动新的一轮VRRP选举。

                                                                                                                                                       
另外注意:Centos7安装Keepalived后, 如果不关闭防火墙, 则需要在防火墙中放行VRRP的组播IP 244.0.0.18。否则虚拟ip不能实现漂移,双机都为Master,不能实现双机热备的效果。[类似于上面"分享一"中的情况]

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

由于不太习惯使用Centos7下默认的Firewall防火墙,可以禁用掉Centos7默认的Firewall防火墙,使用Iptables防火墙进行设置:

# systemctl stop firewalld.service

# systemctl disable firewalld.service

# yum install -y iptables-services

# vim /etc/sysconfig/iptables

在文件中添加一下内容

-A OUTPUT -o eth0 -d 224.0.0.18 -j ACCEPT   

-A OUTPUT -o eth0 -s 224.0.0.18 -j ACCEPT

-A INPUT -i eth0 -d 224.0.0.18 -j ACCEPT

-A INPUT -i eth0 -s 224.0.0.18 -j ACCEPT

# service iptables restart

# systemctl enable iptables.service

此时就能实现虚拟ip的漂移,当Master挂掉时,虚拟ip会漂移到Backup上,Master启动后虚拟ip会再次漂移回来。

Guess you like

Origin blog.csdn.net/ensp1/article/details/122021965