High concurrency system-high availability

Article from: Alibaba's billion-level concurrent system design (2021 version)

Link: https://pan.baidu.com/s/1lbqQhDWjdZe1CBU-6U4jhA Extraction code: 8888 

table of Contents

High Availability (HA)

Usability measurement

Ideas for high-availability system design

Course summary


After the class started, some students reported that there were more explanations of theoretical knowledge in the course, and they hope to see examples. I have been paying attention to these voices and thank you for your suggestions. At the beginning of 04, I would like to respond to this.

When designing the course, I mainly want to use the first five lectures in the basic chapter to introduce you to some basic concepts about high-concurrency system design. I hope to help you establish an overall framework, so that it will be convenient for the subsequent evolution chapters and actual combat chapters. Expand and extend the involved knowledge points one by one. For example, this lesson mentioned downgrading, so I will introduce the types of downgrading schemes and applicable scenarios in detail in the operation and maintenance chapter in the form of cases. The reason for this design is to hope that the courses will be linked together through a small amount of space in the front. Take the point to the surface and gradually expand. Of course, different voices are my motivation to continuously optimize the content of the course. I will take every suggestion seriously, continuously optimize the course, and work hard and progress with you.

Next, let us formally enter the course. In this lesson, I will continue to take you to understand the second goal of high concurrency system design-high availability. You need to have an intuitive understanding of the ideas and methods for improving the usability of the system in this lesson, so that when you explain these content to the point in the follow-up, you can immediately react, and you can also refer to it when your system encounters usability problems. These methods are optimized.

High Availability (HA)

High Availability (HA) is a term you will often hear when designing a system . It refers to the system's ability to operate without failure .

The HA solution we see in many open source component documents is a solution to improve component availability and prevent the system from downtime and unserviceable. For example, you know that the NameNode in Hadoop 1.0 is a single point. Once a failure occurs, the entire cluster will be unavailable; and the NameNode HA solution proposed in Hadoop 2 is to start two NameNodes at the same time, one in Active state and the other in Standby State, the two share storage. Once the Active NameNode fails, the Standby NameNode can be switched to the Active state to continue to provide services, which enhances the ability of Hadoop to run continuously without failure, which also improves its availability.

Generally speaking, for a system with high concurrency and large traffic, the system failure will damage the user experience more than the low system performance. Imagine a system with more than one million daily active users, and a one-minute failure may affect thousands of users. And as the daily activity of the system increases, the number of users affected by the one-minute failure time also increases, and the system has higher requirements for availability.

So today, I will take you to understand how we can ensure the high availability of the system under high concurrency, so as to provide some ideas for your system design.

Usability measurement

Usability is an abstract concept. You need to know how to measure it. The related concepts are: MTBF and MTTR .

MTBF (Mean Time Between Failure): Means the mean time between failures, representing the time between two failures , which is the average time for the system to operate normally. The longer this time, the higher the system stability .

MTTR (Mean Time To Repair): Means the mean recovery time of the failure, and can also be understood as the mean time to failure. The smaller the value, the smaller the impact of the fault on the user .

Availability is closely related to the values ​​of MTBF and MTTR. We can use the following formula to express the relationship between them: Availability = MTBF / (MTBF + MTTR) The result calculated by this formula is a ratio, and this ratio represents the availability of the system. Generally speaking, we will use a few nines to describe the availability of the system.

In fact, from this picture, you can find that the availability of one nine and two nines is very easy to achieve. As long as there is no forklift from the Lanxiang Technical School, it can basically be achieved through human operation and maintenance.

After three nines, the annual failure time of the system dropped sharply from 3 days to 8 hours. After four nines, the annual failure time was reduced to less than one hour. With this level of availability, you may need to establish a complete operation and maintenance duty system, troubleshooting procedures, and business change procedures. You may also need to have more consideration in system design. For example, during development, you have to consider whether it can automatically recover without manual intervention if a failure occurs. Of course, in terms of tool construction, you also need to make more improvements in order to quickly troubleshoot the cause of the failure and quickly restore the system.

After reaching five nines, the fault cannot be recovered by manpower. Imagine that from the occurrence of a fault to when you receive an alarm, to when you turn on your computer and log in to the server to deal with the problem, it may have been ten minutes long. Therefore, this level of availability examines the system's disaster tolerance and automatic recovery capabilities. Only by allowing machines to handle failures will the availability indicators be raised to a higher level .

 

Generally speaking, the availability of our core business system needs to reach four nines, and the availability of non-core systems is tolerated at most three nines. In actual work, you may have heard similar statements, but systems of different levels and different business scenarios have different availability requirements.

At present, you have a certain degree of understanding of the evaluation indicators of availability. Next, let's take a look at what factors need to be considered in the design of a highly available system.

Ideas for high-availability system design

The availability of a mature system needs to be guaranteed from two aspects: system design and system operation and maintenance , both of which work together and are indispensable. So how to start from these two aspects to solve the problem of system high availability?

1. System Design

"Design for failure" is our first principle when designing high-availability systems. In a high-concurrency system that bears one million QPS, the number of machines in the cluster is hundreds or thousands, and the failure of a single machine is normal, and it is possible to fail almost every day. Plan ahead to win a thousand miles. When we are designing the system, we must take the occurrence of failures as an important consideration, and consider in advance how to find the failures automatically and how to solve them after the failures occur . Of course, in addition to thinking ahead, we also need to master some specific optimization methods, such as failover (failover), timeout control, and downgrade and current limiting .

Generally speaking, there may be two situations for a node that fails over:

  • It is a failover between completely peer nodes.
  • It is between unequal nodes, that is, there are primary nodes and standby nodes in the system.

Failover between peer nodes is relatively simple. In this type of system, all nodes are responsible for read and write traffic, and no state is stored in the nodes, and each node can be a mirror image of another node. In this case, if access to a certain node fails, then simply randomly visit another node . For example, Nginx can be configured to retry to request another Tomcat node when a Tomcat request greater than 500 appears, as follows:

 

The failover mechanism for non-peer nodes is much more complicated. For example, we have a primary node and multiple standby nodes. These standby nodes can be hot standby (standby nodes that also provide services online) or cold standby (only used as a backup), then we need to include in the code Control how to detect whether the main and standby machines are faulty, and how to switch between the main and standby machines. The most widely used failure detection mechanism is the "heartbeat" . You can periodically send heartbeat packets to the master node on the client, or you can periodically send heartbeat packets from the backup node. When the heartbeat packet is not received for a period of time, it can be considered that the master node has failed, and the master election operation can be triggered. The result of the election of the master needs to be agreed on multiple backup nodes, so a certain distributed consensus algorithm will be used, such as Paxos and Raft .

In addition to failover, the control of call timeouts between systems is also an important consideration in the design of highly available systems .

Complex high-concurrency systems usually consist of many system modules, and they also rely on many components and services, such as cache components, queue services, and so on. The most feared call between them is delay rather than failure , because failure is usually instantaneous and can be resolved by retrying. Once a relatively large delay occurs in calling a certain module or service, the caller will be blocked on this call, and the resources it has occupied cannot be released . When there are a large number of such blocking requests, the caller will hang up because it runs out of resources. In the early stage of system development, timeout control is usually ignored, or there is no way to determine the correct timeout period.

I have experienced a project before, in which modules are called through the RPC framework, and the timeout period is 30 seconds by default. Usually the system runs very stably, but once it encounters a relatively large amount of traffic and a certain number of slow requests appear on the RPC server, the RPC client thread will be blocked on these slow requests for 30 seconds, causing the RPC client to use Hang up as long as the thread is called. Later, after we discovered this problem during the failure review, we adjusted the timeout period of RPC, database, cache, and calling third-party services, so that a timeout can be triggered when a slow request occurs, without causing an overall system avalanche. Since timeout control is to be done, how do we determine the timeout period? This is a more difficult problem.

If the timeout period is short, it will cause a lot of timeout errors and affect the user experience; if the timeout period is long, it will not work. I suggest that you collect the call logs between systems to calculate the 99% response time, and then specify the timeout period based on this time. If there is no call log, then you can only specify the timeout period based on experience. However, no matter which method you use, the timeout period is not static and needs to be continuously modified in the subsequent system maintenance process.

The timeout control is actually not to keep the request forever, but to fail the request after a certain period of time, and release the resources for the next request . This is detrimental to users, but it is necessary because it sacrifices a small number of requests while ensuring the availability of the overall system.

And we have two other lossy solutions to ensure the high availability of the system, they are degradation and current limiting.

Downgrading is a practice of sacrificing non-core services in order to ensure the stability of core services . For example, if we post a Weibo, we will first go through the anti-spam service inspection to detect whether the content is an advertisement, and then complete logic such as writing to the database after passing it. Anti-spam detection is a relatively heavy operation, because it involves a lot of policy matching. Although it will be time-consuming under daily traffic, it can still respond normally. However, when the concurrency is high, it may become a bottleneck, and it is not the main process of publishing Weibo, so we can temporarily turn off the anti-spam service detection, so as to ensure that the main process is more stable.

Current limiting is another way of thinking, it protects the system by limiting the rate of concurrent requests. For example, for web applications, I limit a single machine to only process 1,000 requests per second, and the exceeding part will directly return an error to the client. Although this approach harms the user's experience, it is a helpless act under extreme concurrency and is a short-lived behavior, so it is acceptable. In fact, whether it is downgrading or current limiting, there is still a lot to discuss in details. I will analyze it in depth as the system continues to evolve in later courses, and I won't talk about it in the basics.

2. System operation and maintenance

In the system design stage, in order to ensure the availability of the system, the above methods can be adopted. What can be done at the level of system operation and maintenance? In fact, we can consider how to improve the system from the two aspects of gray release and fault drill . Availability.

You should know that during the smooth operation of the business, the system rarely fails, and 90% of the failures occur during the online change phase. For example, if you have a new feature, the number of slow database requests doubled due to a design problem, causing system requests to be slowed down and malfunctions. If there is no change, how can the database generate so many slow requests for no reason? Therefore, in order to improve the availability of the system, it is important to pay attention to change management. In addition to providing the necessary rollback plan to quickly roll back and recover in the event of a problem, another major method is gray release.

Gray release refers to the fact that system changes are not pushed online all at once, but gradually advanced according to a certain proportion. In general, grayscale publishing is performed in the machine dimension. For example, we first make changes on 10% of the machines and observe the system performance indicators and error logs on the Dashboard. If the system indicators are relatively stable after running for a period of time and there are no large number of error logs, then the full amount of change is promoted.

Gray release provides an excellent opportunity for development and operation and maintenance students, allowing them to observe the impact of changes on the online traffic, which is an important level to ensure the high availability of the system.

Gray release is an operation and maintenance method to ensure the high availability of the system under the normal operating conditions of the system. So how do we know the performance of the system when a failure occurs? Here we rely on another method: failure drill.

Fault drill refers to some destructive methods to the system to observe how the overall system behaves when a partial failure occurs, so as to discover potential availability problems in the system.

A complex high-concurrency system relies on too many components, such as disks, databases, network cards, etc. These components may fail anytime and anywhere, and once they fail, will the overall service be unavailable like a butterfly effect? What? We don’t know, therefore, fault drills are particularly important.

In my opinion, fault drills are the same as the more popular "Chaos Engineering" thinking. As the originator of chaos engineering, the "Chaos Monkey" tool launched by Netfix in 2010 is an excellent tool for fault drills . It simulates failures by randomly shutting down online nodes on the online system, so that engineers can understand the impact of such failures.

Of course, all this is based on the premise that your system can withstand some abnormal situations. If your system has not yet achieved this, then I suggest you build another offline system that is exactly the same as the online deployment structure, and then perform fault drills on this system to avoid affecting the production system.

Course summary

In this lesson, I have taken you to understand how to measure the availability of the system and how to ensure high availability when designing a highly concurrent system. Having said that, you can see that from the perspective of development and operation, the ways to improve usability are different:

Development focuses on how to deal with failures, and the keywords are redundancy and trade-offs . Redundancy refers to the fact that there are spare nodes and clusters to replace failed services, such as the failover mentioned in the article, and the multi-active architecture, etc.: The trade-off refers to the loss of the vehicle and the security of the main service.

From the perspective of operation and maintenance, it is more conservative, focusing on how to avoid failures, such as paying more attention to change management and how to perform failure drills.

The combination of the two can form a complete high-availability system.

You also need to note that improving the usability of the system is sometimes based on sacrificing user experience or sacrificing system performance . It also requires a lot of manpower to build the corresponding system and improve the mechanism. Therefore, we must grasp a degree and should not do excessive optimization. As I mentioned in the article, the availability of four nines in the core system can already meet the demand, there is no need to blindly pursue the availability of five nines or even six nines.

In addition, general systems or components pursue the ultimate performance, so are there those who do not pursue performance but only pursue the ultimate usability? The answer is yes. For example, for a system that is configured to be issued, it only needs to provide a configuration when other systems are started, so it can be returned in seconds, and it is OK in ten seconds, which is nothing more than increasing the startup speed of other systems. However, its requirements for availability are extremely high, even reaching six nines. The reason is that the configuration can be obtained slowly, but it cannot be obtained.

I give you this example to let you understand that there are sometimes trade-offs between usability and performance, but the choice depends on different systems, and cannot be generalized.

Guess you like

Origin blog.csdn.net/sanmi8276/article/details/113087906