Basic Concepts of High Availability Clusters

1. What is a high availability cluster

High Availability Cluster (HA Cluster for short) refers to a server cluster technology aimed at reducing service interruption time. It minimizes the impact of software, hardware, and human-caused failures on the business by protecting the uninterrupted service provided by the user's business program.

Simply put: to ensure that the service runs uninterrupted, for example, you can go to buy things on Taobao at any time, and WeChat can open a message chat at any time.

2. Metrics for High Availability Clusters

It is almost an impossible task to ensure that cluster services are always fully available 100% of the time. For example, when Taobao started Double Eleven in recent years, many people came in to buy things at once, and the traffic volume was large, and there were some problems. After placing an order, they could not pay. Therefore, we can only ensure that the service is available as much as possible. Of course, in some scenarios, it is still possible to achieve 100% availability.

Usually the mean time between failures (MTTF) is used to measure the reliability of the system, and the mean time to repair (MTTR) is used to measure the maintainability of the system. Availability is then defined as: HA=MTTF/(MTTF+MTTR)*100%.

Specific HA metrics:

describe	colloquial name	Availability level	Annual downtime
Basic availability	2 9s	99%	87.6 hours
high availability	3 9s	99.9%	8.8 hours
Availability with failover capability	4 9s	99.99%	53 minutes
very high availability	5 9s	99.999%	5 minutes

3. Implementation principle of high availability cluster

High-availability clusters mainly implement automatic fault detection (Auto-Detect), automatic switchover/failover (FailOver), and automatic recovery (FailBack).

To put it simply, high-availability cluster software is used to automate failure checking and failover (failure/backup host switching). Of course, load balancing and DNS distribution can also provide high availability.

3-1. Auto-Detect/Fault Check

In the automatic detection stage, the software on the host detects each other's running conditions through redundant detection lines, through complex monitoring programs and logical judgments.

The commonly used method is to judge whether a node is faulty through heartbeat information between each node of the cluster.

3-1-1. The question is: when a node (one or more) and another node cannot receive each other's heartbeat information, how to decide which part of the node is running normally, and which part needs to be isolated if it fails? (to avoid cluster split brain)?

At this time , it is decided by quorum, that is, when there is a node failure, the nodes vote to determine which node is in question, and if the number of votes is greater than half, it is legal .

Number of votes:

Each node can set the number of votes, that is, the authority value that determines whether the node is legal (normal) in the cluster. This can be more or less. For example, some nodes have better performance or have other advantages, and more votes can be set.

Quorum:

When a node can maintain heartbeat information with another node, the node obtains the votes of the other node, and all the votes obtained by the node are the quorum.

3-1-2. What is special is a cluster with only two nodes, or the number of votes on both sides is equal

At this time, you can use another reference node, such as a ping gateway (which can be a node), you can ping the test point, but you cannot communicate with the other party, indicating that there is a problem with the other node, or the node has a problem; and through arbitration For devices such as quorum disks, each node keeps writing data to the disk at certain intervals. If it detects that the other party no longer writes data, the other node may fail.

However, it is better to make the number of nodes forming the cluster an odd number (2n+1). When the cluster partition is split-brained, the partitions with less than half of the number of nodes (>n+1) automatically stop providing external services.

3-1-3, Pasox algorithm and Zookeeper

      Regarding "voting", it is necessary to know the famous Pasox algorithm and Zookeeper:

Paxos algorithm:

      The Paxos algorithm solves the problem of ensuring that each node in the cluster executes the same sequence of operations, which can ensure data consistency in a distributed cluster.

      For example, global numbering of write operations by voting, at the same time, only one write operation is approved, and concurrent write operations are going to win votes, and only write operations that get more than half of the votes will be approved (so there will always only be One write operation is approved); and other write operations fail to compete and have to initiate another round of voting, and in this way, in the day-to-day voting, all write operations are strictly numbered and ordered.

      The number is strictly increasing. When a node accepts a write operation numbered 100, and then receives a write operation numbered 99 (because of many unforeseen reasons such as network delays), it immediately realizes that its data is inconsistent and stops automatically. External service and restart the synchronization process.

      The failure of any node will not affect the data consistency of the entire cluster (2n+1 in total, unless the failure is greater than n).

Zookeeper：

      Zookeeper is an independent component of the Hadoop big data ecosystem. It is an open source implementation of Google's Chubby, which can be said to be an implementation of the Paxos algorithm (similar).

      Zookeeper mainly provides distributed coordination services, based on which distributed applications can implement service registration (high availability), synchronization services, configuration maintenance, and naming services.

      What Zookeeper really provides is a function similar to the file system directory on our ordinary computer, but it can add/delete/modify/check atomically; to implement a distributed coordination service, you need to write your own program to operate the "" content".

      Why can Zookeeper provide coordination services as a distributed system?

      The most important thing is that Zookeeper itself runs a stable and highly available cluster consisting of multiple Zookeeper nodes.
       The high availability of the Zookeeper cluster and the consistency of the "directory" data of each node are guaranteed based on a voting mechanism similar to that implemented by the Paxos algorithm.

      Therefore, the number of Zookeeper cluster nodes is preferably singular (2n+1). When the cluster is split-brained, if the number of partition nodes does not exceed half (<n+1), external services will be automatically stopped.
       for example:

      Five ZK nodes form a ZK cluster. When it is divided into two partitions, two partitions and three partitions, the two partitions will automatically stop external services, and the three partitions will continue to provide services.

      In addition, if a ZK cluster is formed with 6 nodes, when it is divided into two partitions of 3 and 3, the two partitions will automatically stop external services. Therefore, the fault tolerance rate is the same as that of a cluster composed of 5 nodes. Instead, a cluster should be formed with a singular (2n+1) number of nodes.

3-2. Automatic switching/failover (FailOver)

      In the automatic switching stage, if a host confirms that the other party is faulty, the normal host will not only continue with the original task, but also take over the preset backup operation procedures according to various fault-tolerant backup modes, and carry out subsequent procedures and services.

      In layman's terms, that is, when A cannot serve customers, the system can automatically switch, so that B can continue to provide services to customers in a timely manner, and the customer does not feel that the object serving him has been replaced.

      After judging the node failure through the above, transfer the highly available cluster resources (such as VIP, httpd, etc., see below) from the cluster node that does not have a quorum to the failover domain (Failover Domain, a node that can receive failover resources).

3-2-1. High Availability Cluster Resource (HA Resource) and Cluster Resource Type

Cluster resources are the rules, services, and devices used in the cluster, such as VIPs, httpd services, STONITH devices, etc. The types are as follows:

        1. Primitive: the main resource, which can only run on a certain node at a certain time, such as VIP.

      2. group: group, resource container, which enables multiple resources to stop/start at the same time, generally only contains primitive resources.

        3. clone: clone, a resource that can run on multiple nodes, such as the stonith device management process, the distributed lock (dlm) of the cluster file system as a resource, and should run on all nodes.

        4. master/slave: special clone resource, running on two nodes, one master and one slave, such as: distributed replication block device drbd (integrated into the kernel after 2.6.33).

3-2-2. Which node to transfer to

Transfer according to the tendency of resources (resource stickiness, score comparison of location constraints);

The tendency of resources (the basis of resource positioning):

A. Resource stickiness : the degree of inclination of the resource to the node, whether the resource tends to the current node. score, positive values favor the current node (also combined with position constraints).

B, resource constraints (Constraint) : the relationship between resources and resources

a. Colocation: Dependency/mutual exclusivity between resources, which defines whether resources run on the same node. score, a positive value means running on the same node, a negative value does not.

b. Location constraint (location) : each node has a score value, positive values tend to this node, negative values tend to other nodes, all node scores are compared and tend to the node with the maximum value.

c. Order constraint (order) : Define the order in which resources perform actions. For example, vip should be configured first, and httpd service should be configured later. Special score value, -inf negative infinity, inf positive infinity.

That is to say , resource stickiness defines the propensity of the resource to the node where the resource is currently located, while the location constraint defines the propensity of the resource to all nodes in the cluster . For example, the resource stickiness of webip is 100, and the location constraint for node1 is 200. When webip is on node2, the online resources of node1 will be transferred to node1, because the current node node2 stickiness 100 is less than the location constraint 200 for node1; for example, the resource stickiness of webip is 200, the location constraint is 100 for node1. When webip is on node2, the online resources of node1 will not be transferred to node1, but will remain on node2, because the current node node2 stickiness 200 is greater than the location constraint 100 for node1.

3-3. Automatic recovery/FailBack

In the automatic recovery phase, after the normal host replaces the faulty host, the faulty host can be repaired offline. After the faulty host is repaired, it is connected to the original normal host through the redundant communication line, and automatically switches back to the repaired host.

3-3-1. After the fault is eliminated, should the fault be reversed?

According to the settings of resource stickiness and resource constraints, the standby device is generally only used for backup, and its performance is lower than that of the main device. Therefore, when the main device is restored, it should be switched back. However, failover requires resource transfer, which will affect the customers in use. The process cost It is higher, so whether it needs to be rotated is judged according to the actual situation.

3-4. Other concerns

3-4-1. If the node is no longer a member of the cluster node (illegal), how to deal with the resources running on the current node?

If the cluster has not been isolated before Fecning/Stonith, you can perform related configuration (without_quorum_policy), with the following configuration options:

1. stop: stop the service directly;

2. ignore: ignore, what services are running before and are running now (two-node clusters need to configure this option);

3. Freeze: Freeze, keep the connection established in advance, but no longer receive new requests;

4. suicide: kill the service.

3-4-2. Cluster split-brain (Split-Brain) and resource isolation (Fencing)

Split-brain is caused by a cluster split. When a node in the cluster temporarily stops responding due to busy processors or other reasons, the heartbeat with other nodes fails, but these nodes are still in an active state, and other nodes may mistake this node for it." is dead", thereby competing for access to shared resources (such as shared storage), splitting into two separate nodes.

       Split-brain consequences: At this time, two nodes start to compete for shared resources, resulting in system confusion and data corruption.

       Split-brain solution: The above 3-1-1 and 3-1-2 methods can also solve the split-brain problem to a certain extent, but a complete solution requires resource isolation (Fencing).

   Resource isolation (Fencing):

              When the status of a node cannot be determined, kill the other party through fencing to ensure that the shared resources are completely released, provided that there must be a reliable fence device.

   Node level:

            STONITH (shoot the other node in the head, headshot. Hardware method), directly control the power supply of the faulty node, absolutely completely.

    Resource level:

            For example: FC SAN switch (software mode) can deny access to a node at the storage resource level

4. High-availability cluster working model

4-1. Active/Passive: active/standby model

One active master node and the other inactive as a backup node. When the master node fails, it is transferred to the backup node, and the backup node becomes the master node. The standby node is completely redundant, causing a certain waste. As shown in the figure below, synchronization between the master and slave nodes of mysql and DRBD is also required:

4-2. Active/Active: Dual main model

Both nodes are active, both nodes run two different services and are also standby nodes for each other. It is also possible to provide the same service, such as ipvs, where the frontend is based on DNS round-robin. This model can use a more balanced host configuration without waste.

4-3、N+1

N active master nodes N services, one standby node. This requires that an additional backup node must be able to replace any primary node. When any primary node fails, the backup node can be responsible for its role and provide corresponding services to the outside world. As shown in the figure below, the last standby node can serve as the DRBD of the first two master nodes and the MYSQL of the third master node to provide standby functions:

4-4、N+M

N active master nodes and M standby nodes. Like the N+1 model above, a spare node may not provide sufficient spare redundancy, and the number M of spare nodes is a compromise between cost and reliability requirements.

There is also a saying: NM: N nodes with M services, N>M, the active node is N, and the standby node is NM.

4-5、N-to-1

This is the same as N+1, which is also N active master nodes and one standby node; the difference is that the standby node becomes the master node only temporarily. When the original faulty node is repaired, it must be turned back to work normally.

4-6、N-to-N

N primary nodes and N backup nodes. This is a combination of A/A dual master and N+M model, N nodes have services, and if one goes down, every remaining node can serve as an alternative. As shown in the figure below, when shared storage is available, each node may be used for failover. Pacemakers can even run multiple copies of the service to spread out the workload.

5. High availability cluster architecture level

5-1. Node host layer

This layer is mainly the services running on the physical host, the software related to the high-availability cluster runs on each host, and the cluster resources are also on each host.

5-2、Messaging and Membership Layer

The information delivery layer is a mechanism for transmitting cluster information. By monitoring UDP port 694, information can be quickly transmitted in real time through unicast, multicast, and broadcast. The transmitted content is the cluster transaction of the high-availability cluster, such as heartbeat. Information, resource transaction information, etc., are only responsible for transmitting information, not for calculating and comparing information.

Membership layer, the most important role of this layer is that the master node (DC) generates a complete membership through the information provided by the Messaging layer through the Cluster Consensus Membership Service (CCM or CCS). This layer mainly realizes the role of linking the upper and lower layers. The upper layer transmits the information production membership graph generated by the lower layer to the upper layer to notify the working status of each node; the upper layer implements the isolation of a certain device.

5-3、CRM（Cluster Resource Manager）

      The cluster resource manager layer, which is mainly used to provide high availability for those services that are not highly available. It needs to work with Messaging Layer, so it works on top of Messaging Layer.

      The main job of the resource manager is to collect the node information transmitted by the messaging layer, and is responsible for the calculation and comparison of the information, and to make corresponding actions, such as service start, stop and resource transfer, resource definition and resource allocation.

      Each node contains a CRM, and each CRM maintains this CIB (Cluster Information Base, cluster information base), only the CIB on the master node can be modified, and the CIBs on other nodes are from the master node. copied from the node.

       The CRM will select a node for calculation and comparison, which is called DC (Designated coordinator) to specify the coordination node. The calculation is implemented by the PE (Policy Engine) policy engine, and the action control after the calculated result is implemented by the TE (Transition Engine) transaction engine. .

       There is an LRM (local resource manager) local resource manager on each node, which is a sub-function of CRM. It receives the transactions passed by TE and takes corresponding actions on the node, such as running RA scripts.

5-4、RA（Resource Rgent）

The resource agent layer is simply a script that can manage cluster resources, such as scripts for starting start, stopping stop, restarting restart, and querying status information status. The LRM local resource manager is responsible for running.

Resource agents are divided into:

1. Legacy heartbeat (resource management of heatbeat v1 version);

2. LSB (Linux Standard Base), mainly the feet in the /etc/init.d/* directory, start/stop/restart/status;

3. OCF (Open Cluster Famework) is more professional and general than LSB. In addition to the above four operations, it also includes cluster operations such as monitor and validate-all. The specification of OCF is at http://www.opencf.org/cgi -bin/viewcvs.cgi/specs/ra/resource-agent-api.txt?rev=HEAD .

4. STONITH: achieve node isolation

6. High availability cluster software

6-1. Messaging Layer cluster information layer software

1、heartbeat (v1, v2)

2、heartbeat v3

Can be split into: heartbeat, pacemaker, cluster-glue

3、corosync

Project detached from OpenAIS.

4、cman

5、keepalived

Typically used for two-node clusters

6、ultramokey

6-2. CRM Cluster Resource Manager Software

1, Haresource

heartbeat v1 v2 included, use the text config interface haresources

2、crm

Heartbeat v2 is included and can be configured using crmsh or heartbeat-gui

3、pacemaker

Projects separated from heartbeat v3, configuration interface: CLI: crm, pcs and GUI: hawk (WEB-GUI), LCMC, pacemaker-mgmt, pcs

4、rgmanager

Cman includes, uses rgmanager (resource group manager) to achieve management, has the feature of Failover Domain failover domain, and can also use RHCS (Redhat Cluster Suite) suite for management: Conga's full life cycle interface, Conga (luci/ricci) After installation, you can use it to install high-availability software, and then configure it.

6-3. Commonly used combinations

heartbeat v2+haresource (or crm) (Note: generally used in CentOS 5.X)

heartbeat v3+pacemaker (Note: generally used in CentOS 6.X)

corosync+pacemaker (Note: the most commonly used combination now)

cman + rgmanager (Note: components in the Red Hat Cluster Suite, including gfs2, clvm)

keepalived+lvs (Note: commonly used for high availability of lvs)

7. Shared storage

Multiple nodes in a high-availability cluster need to access data. If each node accesses the same data file in the same storage space, that means only one copy of the data is shared, and this storage space is shared storage.

Such as Web or Mysql high-availability clusters, their data generally needs to be placed in shared storage, which can be accessed by the master node and slave nodes. Of course, this is not necessary. For example, the block data stored on the master and slave nodes can be synchronized through rsync and DRBD, and the implementation cost is lower than that of shared storage. The specific use needs to be selected according to the actual scene. Let's briefly talk about the types of shared storage:

7-1. DAS (Direct attached storage)

The storage device is directly connected to the host bus, the distance is limited, and it needs to be remounted, and there is a delay in data transmission between them;

This is a sharing implemented on the device block level driver. The lock is held locally on the node host and cannot be notified to other nodes. Therefore, if the cluster of the multi-node active model writes data at the same time, a serious data crash error will occur. There will be problems when the cluster with the two-node model is split;

Common storage devices: RAID arrays, SCSI arrays.

7-2. NAS (network attached storage)

File-level interactive sharing. Each storage device provides shared storage services to each node of the cluster through the file system. It is an application layer service that uses the C/S framework protocol to implement communication.

Common file systems: NFS, FTP, CIFS, etc., such as shared storage implemented by NFS, each node requests files from the shared storage through the NFS protocol.

7-3. SAN (storage area network, storage area network)

At the block level, the communication transmission network is simulated as a SCSI (Small Computer System Interface) bus. Both the node host (initiator) and the SAN host (target) need SCSI drivers, and use network tunnels to transmit SAN packets, so access Storage devices to SAN hosts do not necessarily need to be of the SCSI type.

Commonly used SAN: FC optical network (the optical interface of the switch is too expensive and the cost is too high), IPSAN (iscsi, fast access, block level, cheap).

Basic Concepts of High Availability Clusters

1. What is a high availability cluster

2. Metrics for High Availability Clusters

3. Implementation principle of high availability cluster

3-1. Auto-Detect/Fault Check

3-1-1. The question is: when a node (one or more) and another node cannot receive each other's heartbeat information, how to decide which part of the node is running normally, and which part needs to be isolated if it fails? (to avoid cluster split brain)?

3-1-2. What is special is a cluster with only two nodes, or the number of votes on both sides is equal

3-1-3, Pasox algorithm and Zookeeper

3-2. Automatic switching/failover (FailOver)

3-2-1. High Availability Cluster Resource (HA Resource) and Cluster Resource Type

3-2-2. Which node to transfer to

3-3. Automatic recovery/FailBack

3-3-1. After the fault is eliminated, should the fault be reversed?

3-4. Other concerns

3-4-1. If the node is no longer a member of the cluster node (illegal), how to deal with the resources running on the current node?

3-4-2. Cluster split-brain (Split-Brain) and resource isolation (Fencing)

4. High-availability cluster working model

4-1. Active/Passive: active/standby model

4-2. Active/Active: Dual main model

4-3、N+1

4-4、N+M

4-5、N-to-1

4-6、N-to-N

5. High availability cluster architecture level

5-1. Node host layer

5-2、Messaging and Membership Layer

5-3、CRM（Cluster Resource Manager）

5-4、RA（Resource Rgent）

6. High availability cluster software

6-1. Messaging Layer cluster information layer software

6-2. CRM Cluster Resource Manager Software

6-3. Commonly used combinations

7. Shared storage

7-1. DAS (Direct attached storage)

7-2. NAS (network attached storage)

7-3. SAN (storage area network, storage area network)

Guess you like