Four levels of load balancing SLB high availability

Load balancing supports traffic distribution to multiple ECSs to improve the service capability of application systems, and has long been the entry point for critical business systems. Taobao, Tmall, Alibaba Cloud, etc. all rely on load balancing products, and the double 11 traffic peak also depends on the scheduling and processing capabilities of load balancing.

A brief introduction to load balancing SLB

The following figure is a simple schematic diagram of load balancing. The user's access request is forwarded to the back-end ECS through a monitor (port) of the SLB instance. An SLB instance corresponds to an IP address, and a monitor is a port of the IP address on the instance. Traffic scheduling is performed based on the monitor (port), and ECS actually processes service requests.



 

 

Load Balancing SLB Architecture

The following figure is the architecture diagram of the load balancing SLB from the point of view of the traffic forwarding path. It can be called a load balancing SLB cluster. This cluster is deployed in two availability zones in East China 1, and each availability zone is deployed with LVS cluster and Tengine Cluster, where the LVS cluster is responsible for receiving all traffic requests, including TCP/UDP/HTTP/HTTPS. For TCP and UDP requests, the LVS cluster will directly forward the traffic to the backend ECS, and for HTTP/HTTPS requests, it will forward the traffic to Tengine The cluster is then forwarded to the backend ECS.



 

 

Four levels of load balancing SLB high availability

In order to facilitate understanding and explanation, the load balancing SLB high availability can be divided into four layers. As shown in the figure above, the first layer is the application processing layer, that is, the ECS layer that actually processes requests. The second layer is the cluster forwarding layer, that is, high availability must be guaranteed at the cluster forwarding layer. The third layer is the cross-availability zone disaster recovery layer, which can be switched in the event of an availability zone-level failure. The fourth layer is the cross-region disaster recovery layer. The picture above is not marked. The following is an analysis and description from these four levels.

In particular, SLB will be as high-availability as possible in product design, but if the user system does not have a high-availability design, it will ultimately be impossible to achieve true high-availability, and in addition to load balancing, there are also high-availability issues such as databases. , so the high availability of product design and the high availability of user system design must be combined. The following 4-level descriptions will also be analyzed and interpreted from the perspectives of product design and user usage.

The first layer of application processing layer

The application processing layer actually processes user requests, and generally deploys applications on ECS.

From a product design perspective:

The high availability of the application processing layer mainly includes the following two points. One is to ensure that the ECS can be shielded in time when a fault occurs, so as to avoid forwarding traffic to the faulty node and affecting user access. SLB products are implemented through the health check function. Second, multiple ECSs can be added to an SLB instance, especially ECSs in different availability zones, so as to avoid affecting user access when none of the ECSs in one availability zone is available. These two points can be achieved by SLB products at present.

From the user's point of view:

First, users must enable and configure health checks correctly. SLB supports four protocols: TCP/UDP/HTTP/HTTPS, of which two health check methods, TCP and HTTP, are provided for the TCP protocol, and users can choose according to their needs. For the UDP protocol, the user can define the UDP health check port, and can also judge the health check result according to the health check request and return value defined by himself. For HTTP and HTTPS protocols, HTTP health check is used by default. Users can define a health check URL, and the load balancing health check module will detect and obtain the status through HTTP HEAD. For details about health check, please refer to the description of health check principle .

Second, users need to consider the situation where ECSs in a single availability zone are unavailable, and select ECSs in multiple availability zones to add to the load balancing SLB instance. Users may have questions, if the ECS of an availability zone is unavailable, will the load balancing of this availability zone also be unavailable? This problem is explained in the third layer later.

Layer 2 cluster forwarding layer

The cluster forwarding layer refers to SLB clusters that load forward user requests, including LVS clusters and Tengine clusters. Whether it is a machine failure or a cluster upgrade, it is necessary to ensure that user requests are not interrupted as much as possible.

From a product design perspective:

First of all, a single point of failure must be avoided. If the traditional active-standby switching mode is adopted, on the one hand, user requests may be affected during the switchover. On the other hand, the active-standby switchover still relies on the processing capability of a single machine, which cannot be well scaled horizontally. Therefore, forwarding The layer is deployed in a cluster whether it is an LVS or a Tengine cluster, and the traffic forwarding of user requests to multiple machines in the cluster is also realized through the ECMP equal-cost routing of the upper-layer switch. Looking at the above architecture diagram, users may have questions. The TENGINE cluster is connected behind the LVS cluster. How does the ECMP of the upper-layer switch work? In fact, the LVS cluster also has a health check mechanism for the Tengine cluster, that is, if the machine in the Tengine cluster is abnormal, the LVS health check will remove the abnormal Tengine machine from the cluster after it is found.

Second, try not to affect user requests when an exception such as a machine down occurs in the cluster. After the cluster is deployed, it is unavoidable for the machines in the cluster to have software and hardware failures. It is possible that the machine is completely unavailable or there is an abnormality. The machine needs to be removed from the cluster and repaired by means of operation and maintenance. At this time, this machine User requests on the machine should be as uninterrupted as possible.

The SLB cluster uses session synchronization to ensure that user requests are not affected as much as possible when there is a machine failure in the cluster. As shown in the figure below, when a user's access request passes through an LVS in the cluster, the cluster will synchronize the session to other LVS machines according to preset rules, so that all machines in the LVS cluster have the session. When the LVS1 machine fails or needs maintenance, user requests will be forwarded through other LVS machines and the user will not be aware of it. However, Session synchronization does not mean that all problems are solved. Session synchronization can solve most of the long connection problems, but for short links, when the connection is not established (the three-way handshake is not completed) or the connection is established but the session synchronization rule has not been triggered, Machine failures may still affect user requests. Therefore, the cooperation of the user's system and program is required.

 

 



From the user's point of view, be sure to add a corresponding retry mechanism to the code! In this way, when the above situation occurs, the impact on user access will be further reduced.

Layer 3 Cross-AZ Disaster Recovery Layer

The above is about the high availability of the load balancing forwarding cluster in one availability zone. The problem of cross-availability disaster tolerance layer is that when one availability zone is unavailable, the load balancer of another availability zone can continue to provide services.

From a product design perspective:

Or look at the SLB architecture diagram. First, the load balancing cluster needs to be deployed across availability zones. In addition, a mechanism is needed to discover one of the availability zones in time. For example, if availability zone A fails, it switches to another availability zone B to continue the service. Technically speaking, route detection and route priority mechanisms are used to implement, that is, a segment of addresses in Availability Zone A (corresponding to a batch of SLB instances 1-N) is not only configured on the cluster in this Availability Zone, but the bottom layer is still available. The cluster in zone B also has this address (corresponding to a batch of SLB instances 1-N), but under normal circumstances, since this address in Availability Zone A has a higher routing priority, all traffic flows from the cluster in Availability Zone A Forwarding is performed. If the router detects that the entire Availability Zone A is unavailable (the route is unreachable) or a certain segment of the address in Availability Zone A is unavailable (the route is unreachable), it routes to the cluster in Availability Zone B for traffic forwarding.

From the user's perception of the product form, it is the primary availability zone and the backup availability zone, but the primary and secondary availability zones on the instance cannot be changed after the user selects, and the user cannot switch the primary/standby relationship by himself. That is to say, the primary and secondary AZs are the underlying mechanism, and once the user chooses, they cannot intervene. The reason for exposing the concept of active and standby AZs is to hope that users can combine on-demand when using SLB and other products with AZ attributes, such as ECS AZs, RDS active and standby AZs, etc. On the other hand, it is also hoped that users will be more aware of the value of the product in terms of high availability. However, in actual use, users often have misunderstandings. These misunderstandings make us consider whether to close the concept of the standby availability zone and only expose the primary availability zone to users. I will explain these misunderstandings later.

Looking back at high availability across Availability Zones, in fact, the ideal situation is that when an SLB instance in one Availability Zone is abnormal (the route is unreachable), the system can automatically switch or the user can manually switch to another Availability Zone SLB The instance continues to serve. But the main reason for not being able to do this is the need to advertise detailed routes, i.e. 32-bit routes, which is unacceptable in a cloud environment.



 

 

From the user's point of view:

First, for cross-AZ disaster recovery, it is necessary to ensure that the backend server ECSs of an SLB instance are distributed in multiple availability zones, that is, when an availability zone is unavailable, the backend ECSs of the SLB cannot be used and user access is affected. This is at the first layer. The application processing layer has been explained. Of course, if you also use products such as DB, you also need to consider the cross-availability zone disaster recovery of DB. Users can refer to the description of DB-related products. Here we mainly talk about the load balancing itself and the high availability of the backend server ECS that is closely related to load balancing.

Secondly, for important business, it is very important to use at least two instances with different primary availability zones for disaster recovery. For example, in the important user registration system, as shown in the figure below, an SLB instance is deployed in Availability Zone A and Availability Zone B in East China 1, and the same ECS in the two Availability Zones are mounted.



 

 

Users may have two questions. First, the load balancing SLB product itself has the ability to switch across availability zones. Why does the registration system need to create two instances in two availability zones for cross-availability zone disaster recovery? The cross-availability zone switchover of the load balancing SLB product itself is switched in very extreme cases (the entire availability zone is unavailable/all IPs on the LVS cluster cannot be routed/a certain address segment cannot be routed), and for example, only one available The Tengine in the area has some exceptions, the LVS cluster has exceptions, but the IP can be pinged, etc. It does not work if it is relatively less extreme and will affect the user's business. Therefore, it is very necessary for the important business system. Instance disaster recovery is established in the Availability Zone.

The second is how to schedule the two SLB instances used by the registration system. Users can use cloud DNS to resolve the domain name of the registration system to the two SLB instances above. Other related domain name resolution systems also support this function. If it is a private network SLB instance, Alibaba Cloud is developing a private network DNS system. Currently, users can build a private network DNS system or implement scheduling through programs.

Once again, it is necessary for users of important business systems to establish instance disaster recovery in two availability zones. Even if one instance is configured and not used under normal circumstances, although the public IP fee of 2 cents per hour needs to be paid, when the cluster where one of the instances is located is abnormal and the recovery time is long, the user can also modify the DNS resolution by himself. Or the program calls the address to quickly resume business.

Finally, let's talk about the possible misunderstandings that users may have about the primary and secondary AZs

1) It is considered that the primary and secondary availability zones are automatically switched to the secondary availability zone as long as one instance has an abnormal load balancing

2) It is considered that the abnormality of switching between the active and standby availability zones includes many conditions such as the inability of individual instances to ping, service exceptions on the instances, etc.

In fact, the design of load balancing active and standby availability zone switching is performed automatically in extreme cases (the entire availability zone is unavailable / all IPs on the LVS cluster cannot be routed / a certain address segment cannot be routed), it is neither a certain The IP address can be switched if it cannot be routed, and it is not that one or some IP addresses can be pinged but the service of these IPs is abnormal. Therefore, it is very necessary for the user system to realize cross-availability zone disaster by establishing instances in multiple availability zones.

The fourth layer cross-regional disaster recovery layer

Finally, let's talk about the cross-regional disaster recovery layer. With the development of business, users have higher and higher requirements for high availability of business systems. They are no longer satisfied with disaster recovery that can only be achieved across availability zones. Users hope that even if systems in a certain region are unavailable, they can still Systems in other regions continue to provide services, which is cross-regional disaster recovery.

Cross-regional disaster recovery is a very big topic, not only involving the network level, but also involving the transformation and adaptation of application systems, data synchronization and consistency and other difficult issues. Only the cross-regional disaster recovery at the network level is described here. From a product point of view, cross-availability zone disaster recovery is generally implemented through DNS, as shown in the following figure



 

 

Traditional global load balancing such as F5 (formerly called GTM, now called BIG-IP DNS) has a relatively complete solution, or some systems that provide DNS services also have similar functions. The load balancing SLB product itself does not provide such a capability. The cross-regional disaster recovery capability is achieved through the DNS product of Cloud DNS. The DNS product of Cloud DNS provides the capability of global load balancing, as well as functions such as health check and routing scheduling optimization. You can refer to  the global load balancing cross-region disaster recovery solution

In addition, for cross-availability zone disaster recovery, it may be necessary to synchronize data between different regions or make cross-region private network calls, and high-speed channel products can be used to build communication links in different regions

From the user's point of view, the cross-availability zone disaster recovery network layer mainly resolves DNS products through the cloud, and high-speed channel products can be used for communication between different regions. These two products are not the focus of this article. You can refer to the relevant product documentation on Alibaba Cloud's official website.

This article is the original content of Yunqi Community, and may not be reproduced without permission. If you need to reprint, please send an email to [email protected]

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326105619&siteId=291194637