How to design a highly available system? I briefly summarized 10 methods, and I will tell you all today!


A short article, this question often encountered in interviews. This article mainly includes the following contents:

  1. Definition of high availability
  2. What conditions may cause the system to be unavailable?
  3. Some ways to improve system availability? Just a brief mention, more specific content will be introduced in the follow-up article, take the current limit as an example, you need to understand: What is the current limit? How to limit current? Why current limit? How to do it Tell me about the principle? .

1. What is high availability? What is the criterion for usability?

High availability describes a system that is available most of the time and can provide services for us. High availability means that the service is still available even in the event of a hardware failure or system upgrade.

Under normal circumstances, how many 9s do we use to judge the availability of a system, for example, 99.9999% means that the system is unavailable only 0.0001% of the total running time. Such a system is very, very highly available! Of course, there will be systems that may not even be available if the availability is not good.

In addition, the availability of the system can also be measured by the ratio of the number of failures of a certain function to the total number of requests. For example, if there are 1000 requests to the website, 10 requests fail, then the availability is 99%.

2. What conditions will cause the system to be unavailable?

  1. hacker attack;
  2. Hardware failure, such as a broken server.
  3. The surge in concurrent/user requests has caused the entire service to go down or some services are unavailable.
  4. Bad smells in the code cause memory leaks or other problems that cause the program to hang.
  5. An important role of website architecture such as Nginx or database is suddenly unavailable.
  6. Natural disasters or man-made destruction.

Reference material: "Comprehensive Analysis of Java Intermediate and Advanced Core Knowledge" is limited to 100 copies. Some people have already obtained it through my previous article!
Seats are limited first come first served! ! ! There are more Java Pdf learning materials waiting for you! ! !
Students who want to get this learning material can click here to get it for free """""""

3. What are the ways to improve system availability?

1. Pay attention to code quality and strictly control the test

I think this is the most important. Problems with code quality, such as common memory leaks and circular dependencies, can greatly damage the availability of the system. Everyone likes to talk about current limiting, downgrading, and fusing, but I think that the source of code quality control is a very important thing to do first. How to improve code quality? What's more practical is CodeReviewthat don't care about the extra hour or so spent every day, it can be very useful!

In addition, Amway, a treasure that has practical effects on improving code quality:

  1. sonarqube: Ensure that you write safer and cleaner code! (Ps: This plugin is basically used in the current project).
  2. Alibaba open-source Java diagnostic tool Arthasis also very good choice.
  3. IDEA's own code analysis and other tools for code scanning are also very good.

2. Use clusters to reduce single points of failure

Let's take the commonly used Redis as an example! How do we ensure that our Redis cache is highly available? The answer is to use clusters to avoid single points of failure. When we use a Redis instance as a cache, after the Redis instance hangs, the entire cache service may hang. After using the cluster, even if a Redis instance, another Redis instance will be on top in less than a second.

3. Current limit

The principle of flow control is to monitor the QPS of application traffic or the number of concurrent threads and other indicators. When the specified threshold is reached, the flow is controlled to avoid being overwhelmed by instantaneous traffic peaks, thereby ensuring the high availability of the application. - from alibaba-Sentinel 's Wiki .

4. Timeout and retry mechanism settings

Once the user request is not responded for more than a certain period of time, an exception is thrown. This is very important. Many online system failures are caused by the failure of timeout settings or the wrong way of timeout settings. When we read third-party services, it is especially suitable for setting timeout and retry mechanisms. Generally, when we use some RPC frameworks, these frameworks have their own timeout retry configuration. Failure to set a timeout may result in slow response to requests, and even cause requests to accumulate and make the system unable to process requests. The number of retries is generally set to 3, and multiple retries are not beneficial, but will increase the pressure on the server (some scenarios using the failure retry mechanism may not be suitable).

5. Fuse mechanism

In addition to the timeout and retry mechanism settings, the circuit breaker mechanism is also very important. The fuse mechanism means that the system automatically collects the resource usage and performance indicators of the dependent services. When the dependent service deteriorates or the number of call failures reaches a certain threshold, it fails quickly, allowing the current system to immediately switch to rely on other backup services. The more commonly used frameworks for flow control and fusing downgrade are Netflix's Hystrix and alibaba's Sentinel.

6. Asynchronous call

If we call asynchronously, we don't need to care about the final result, so that we can return the result immediately after the user request is completed, and we can do the specific processing later. This is quite a lot for the spike scenario. However, we may need to use asynchronous after appropriate modifications with business processes, such as user after submitting an order, the user can not return immediately successful order submission, after the message needs to process orders for consumer queue really processed the order, even a library , And then notify the user that the order is successful via email or SMS. In addition to achieving asynchronous in the program, we often use message queues. Message queues can improve system performance (peak clipping, reduce response time) and reduce system coupling through asynchronous processing.

7. Use cache

If our system has a relatively high amount of concurrency, if we simply use the database, when a large number of requests fall directly to the database, the database may hang directly. Use cache to cache hot data, because the cache is stored in memory, so the speed is quite fast!

8. Other

  1. Prioritize better hardware for core applications and services
  2. Added alarm settings for monitoring system resource usage.
  3. Pay attention to backup and roll back when necessary.
  4. Gray release : Divide the server cluster into several parts, release only a part of the machines every day, observe that the operation is stable and there is no fault, continue to release some machines the next day, and continue to release the entire cluster for several days. Just roll some of the published servers
  5. Regular inspection/replacement of hardware : If it is not a purchased cloud service, it is necessary to conduct a regular inspection of the hardware. For some hardware that needs to be replaced or upgraded, it must be replaced or upgraded in time.

Four, summary


Reference material: "Comprehensive Analysis of Java Intermediate and Advanced Core Knowledge" is limited to 100 copies. Some people have already obtained it through my previous article!
Seats are limited first come first served! ! ! There are more Java Pdf learning materials waiting for you! ! !
Students who want to get this learning material can click here to get it for free """""""

Guess you like

Origin blog.csdn.net/Java_Caiyo/article/details/112388277