System stability and high availability guarantee

I. Introduction

High concurrency, high availability, and high performance are called the Internet's three-high architecture, and these three are one of the factors that engineers and architects must consider in system architecture design. Today we will talk about the high availability in the three H, which is also the system stability we often say.

> This article only talks about ideas, without too many in-depth details. It takes about 5-10 minutes to read the full text.

Second, the definition of high availability

N nines are commonly used in the industry to quantify the degree of system availability, which can be directly mapped to the percentage of website uptime.

1.png

The formula for calculating availability is:

2.png

Most companies require four nines, that is, the annual downtime cannot exceed 53 minutes. It is still very difficult to achieve this goal, and each sub-module needs to cooperate with each other.

To improve the availability of a system, you first need to know what factors affect the stability of the system.

3. Factors Affecting Stability

First, let's sort out some common problem scenarios that affect system stability, which can be roughly divided into three categories:

  • Human factors Unreasonable changes, external attacks, etc.

  • Software factors Code bugs, design loopholes, GC problems, thread pool exceptions, upstream and downstream exceptions

  • Hardware factors Network failure, machine failure, etc.

The following is the right medicine. The first is the prevention before the failure, and the second is the rapid recovery ability after the failure . Let’s talk about several common solutions below.

4. Several ideas for improving stability

4.1 System Split

The purpose of splitting is not to reduce unavailable time, but to reduce the impact of failure. Because a large system is split into several small independent modules, a problem with one module will not affect other modules, thereby reducing the impact of failures. System splitting includes access layer splitting, service splitting, and database splitting.

  • The access layer & service layer are generally split according to dimensions such as business modules, importance, and change frequency.

  • The data layer is generally split according to the business first, and if necessary, it can also be split vertically, that is, data sharding, read-write separation, and hot and cold data separation.

::: hljs-center

4.2 Decoupling

:::

After the system is split, it will be divided into multiple modules. There are strong and weak dependencies between modules. If it is strongly dependent, then if there is a problem with the relying party, it will also be implicated in the problem. At this time, the call relationship of the entire process can be sorted out and made into weakly dependent calls. Weakly dependent calls can be decoupled in the way of MQ. Even if there is a problem downstream, it will not affect the current module.

4.3 Technology selection

It can conduct a full evaluation in terms of applicability, advantages and disadvantages, product reputation, community activity, actual combat cases, scalability, etc., and select the middleware & database suitable for the current business scenario. The preliminary research must be sufficient, first compare, test, and research, and then make a decision, sharpening the knife is not the same as chopping wood by mistake.

4.4 Redundant Deployment & Automatic Failover

The redundant deployment of the service layer is easy to understand. A service deploys multiple nodes. Redundancy is not enough. Every time a failure occurs, manual intervention is required to restore the system, which will inevitably increase the unserviceable time of the system. Therefore, high availability of the system is often achieved through "automatic failover". That is, after a node goes down, it needs to be able to automatically remove the upstream traffic. These capabilities can basically be realized through the load balancing detection mechanism.

It is more complicated when it comes to the data layer, but there are generally mature solutions for reference. It is generally divided into one master and one slave, one master and many slaves, and multiple masters and multiple slaves. However, the general principle is that data synchronization realizes multiple slaves, and data sharding realizes multiple masters . During failover, a new master node is selected through an election algorithm and then provides services to the outside world (here, if there is no strong consistent synchronization when writing , some data will be lost during failover). For details, please refer to cluster architectures such as Redis Cluster, ZK, and Kafka.

4.5 Capacity Assessment

Before the system goes online, it is necessary to evaluate the capacity of the machines, DBs, and caches used in the entire service. The capacity of the machines can be evaluated in the following ways:

  • Clarify the expected flow indicator - QPS;
  • Specify acceptable latency and safety water level indicators (such as CPU%≤40%, core link RT≤50ms);
  • Evaluate the highest QPS that a single machine can support below the safe water level through stress testing (it is recommended to verify through mixed scenarios, such as simultaneously stress testing multiple core interfaces according to the estimated traffic ratio);
  • Finally, the specific number of machines can be estimated.

In addition to QPS, the evaluation of DB and cache also needs to evaluate the amount of data. The method is roughly the same. After the system is online, the capacity can be expanded and reduced according to the monitoring indicators.

4.6 Rapid Service Expansion Capability & Flood Release Capability

At this stage, whether it is a container or an ECS, simple node replication and expansion are very easy. The focus of expansion needs to be evaluated whether the service itself is stateless, for example:

  • How many servers does the maximum number of downstream DB connections support for expansion of the current service?
  • Does the cache need to be warmed up after capacity expansion?
  • Heavy volume strategy

These factors need to be prepared in advance, and a complete SOP document must be sorted out. Of course, the best way is to conduct a drill, and actually operate it by hand, so as to be prepared.

Flood discharge capability generally refers to that in the case of redundant deployment, several nodes are selected as backup nodes, which usually bear a small part of the traffic. When the traffic flood peak comes, a part of the traffic of the hot nodes is transferred to the backup nodes by adjusting the traffic routing strategy.

Compared with the expansion plan, the cost is relatively high, but the advantage is fast response and low risk .

4.7 Traffic Shaping & Fuse Degradation

3.png

Traffic shaping is also commonly referred to as current limiting, mainly to prevent the service from being overwhelmed by unexpected traffic, and circuit breaker is to prevent failures caused by long-term blocking and avalanches when its own components or dependent downstream failures occur. Regarding the ability of current-limiting fuse, the open source component Sentinel basically has it, and it is very simple and convenient to use, but there are some points that need to be paid attention to.

  • The current limit threshold is generally the highest water level that can be supported by a certain resource configured as a service, which needs to be evaluated through stress testing. As the system iterates, this value may need to be adjusted continuously. If the configuration is too high, the protection will not be triggered when the system crashes, and if the configuration is too low, it will cause accidental injury.

  • Fuse downgrading - After an interface or a resource is blown, it is necessary to evaluate whether to throw an exception or return a bottom-up result based on the business scenario and the importance of the blown resource. For example, if the inventory deduction interface is broken in the order placing scenario, since the deduction of inventory is a necessary condition for the order interface, an exception can only be thrown after the fuse is broken to cause the entire link to fail and roll back. If the interface related to obtaining product reviews is blown, Then you can choose to return an empty value without affecting the entire link.

4.8 Resource Isolation

If multiple downstreams of a service are blocked at the same time, and a single downstream interface has not reached the fuse standard (for example, the ratio of exceptions and slow requests does not reach the threshold), it will lead to a decrease in the throughput of the entire service and more threads are occupied , and in extreme cases even lead to exhaustion of the thread pool. After resource isolation is introduced, the maximum thread resources that can be used by a single downstream interface can be limited to ensure that the throughput of the entire service is affected as little as possible before it is blown.

Speaking of the isolation mechanism, I can expand it here. Because the traffic of each interface is different from RT, it is difficult to set a reasonable maximum number of available threads, and as the business iterates, this threshold is also difficult to maintain. Here you can use sharing plus exclusive to solve this problem. Each interface has its own exclusive thread resource. When the exclusive resource is full, the shared resource is used. After the shared pool reaches a certain water level, the exclusive resource is forced to be used and waits in line. The obvious advantage of this mechanism is that it can ensure isolation while maximizing resource utilization.

The number of threads here is only a type of resource, and the resource can also be the number of connections, memory, and so on.

4.9 Systematic protection

4.png

Systematic protection is a kind of indiscriminate current limiting. In a word, the concept is to implement indiscriminate current limiting on all flow inlets before the system is about to collapse, and stop the current limiting when the system returns to a healthy water level. The specific point is to combine the monitoring indicators of several dimensions such as application load, overall average RT, entrance QPS and number of threads to achieve a balance between the system entrance traffic and the system load, so that the system can run at the maximum throughput as much as possible while ensuring overall system stability.

4.10 Observability & Alerting

5.png

When the system fails, we first need to find the cause of the failure, then solve the problem, and finally restore the system. The speed of troubleshooting largely determines the duration of the entire fault recovery, and the greatest value of observability lies in rapid troubleshooting. Secondly, configure alarm rules based on the three pillars of Metrics, Traces, and Logs, which can detect possible risks & problems in the system in advance and avoid failures.

4.11 Three tricks to change the process

Change is the greatest enemy of usability. 99% of failures come from changes, which may be configuration changes, code changes, machine changes, etc. So how to reduce the failure caused by the change?

  • Grayscale can use a small proportion of traffic to verify the changed content, reducing the impact on the user base.

  • Can be rolled back After a problem occurs, there can be an effective rollback mechanism. If data modification is involved, dirty data will be written after publishing, and a reliable rollback process is required to ensure the removal of dirty data.

  • Observable By observing the indicator changes before and after the change, problems can be found in advance to a large extent.

In addition to the above three tricks, other development processes should also be regulated, such as code control, integrated compilation, automated testing, static code scanning, etc.

V. Summary

For a dynamically evolving system, there is no way for us to reduce the probability of failure to zero. All we can do is to prevent and shorten the recovery time in case of failure as much as possible. Of course, we don’t need to blindly pursue usability. After all, while improving stability, maintenance costs and machine costs will also increase. Therefore, it is necessary to combine the business SLO requirements of the system, and the one that is suitable is the best.

How to ensure stability and high availability is a very large proposition. This article does not have too many in-depth details, but only talks about some overall ideas, mainly for everyone to have a set of The framework of the system can be referred to. Finally, I would like to thank the students who read it patiently.

Text/Shinichi

Recommended offline activities:

Time: June 10, 2023 (Saturday) 14:00-18:00 Theme: Dewu Technology Salon No. 18 - Wireless Technology No. 4 Venue: Dewu Hangzhou R&D Center, No. 77 Xueyuan Road, Xihu District, Hangzhou Training classroom on the 12th floor (Exit G of Wensan Road Station of Metro Line 10 & Line 19)

Highlights of the event: This wireless salon focuses on the latest technology trends and practices, and will bring you four exciting speech topics in Hangzhou/online, including: "Douyin Creation Tool-iOS Power Consumption Monitoring and Optimization" , "Dewu Privacy Compliance Platform Construction Practice", "Netease Cloud Music - Daily Guarantee Program Practice for Client Mass Traffic Activities", "Dewu Android Compilation and Optimization". I believe these topics will be helpful to your work and study, and we look forward to discussing these exciting technical contents with you!

Click to register: wireless salon registration

This article belongs to Dewu technology original, source: Dewu technology official website

It is strictly forbidden to reprint without the permission of Dewu Technology, otherwise legal responsibility will be investigated according to law!

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5783135/blog/9869178