How to design a highly available system? What areas to consider?

This article has been included from the author of open source JavaGuide: github.com/Snailclimb (69K + Learning + Star [Java] an interview guide covering the core knowledge that the majority of Java programmers need to know) if they feel good also, may wish to point a Star, encouragement!

A short article, interview often encounter this problem. This paper includes the following content:

The definition of high availability
What circumstances may cause the system unusable?
Some methods to improve system availability? Simply mention a mouth, and more specifically the content covered in a future article, Take the current limit, you need to get to know: What is the current limit? How limiting? Why limiting? How to do it? Talk about principles? .

What is availability? Availability criteria is what?

Availability describes a system in most of the time available, can provide services for us. Availability represents the system even in the event of a hardware failure or a system upgrade, the service is still available.

Under normal circumstances, we use to judge how many nines of availability of a system, such as 99.9999% of the system is to represent all of the running time, only 0.0001% of the time available, such a system is very, very high availability of the! Of course, if there will be a system availability is not good, they may not even have the 9.

What the situation will cause the system unusable?

hacker attack;
Hardware failure, such as a server is broken.
Concurrency / surge user requests cause the whole or part of the service shoot down service is unavailable.
Code bad taste cause a memory leak or other problem caused the program to hang.
An important role in site architecture or database such as Nginx suddenly unavailable.
Natural disasters or vandalism.
......

Methods which have increased system availability?

1. pay attention to code quality, strict testing

I think this is by far the most important, the code quality problems such as memory leaks are more common, circular dependency on system availability are greatly damaged. Everyone loves to talk about limiting, demotion, fuse, but I think it is important from this source code quality check is the first to do one thing. How to improve code quality? Comparison is actually available CodeReview, do not care about more than a day to spend that one hour time, the role can do a great deal!

In addition, Amway has a baby on the practical effect of improving code quality:

sonarqube: make sure you write more secure cleaner code! (Ps: currently resides basic items will be used this plugin).
Java diagnostic tools open source Arthas Alibaba is a very good choice.
IDEA's own code analysis tools for code scanning is also very, very good.

2. Use the cluster, reducing single points of failure

Redis acquire common example! How do we ensure that our Redis cache availability of it? The answer is to use a cluster to avoid a single point of failure. When we use the example of a Redis as a cache when hung up after the Redis instance, the entire cache service may be hung up. After using the cluster, even if a Redis instance, less than one second there will be another one on top of Redis instances.

3. limiting

Flow control (flow control), the principle is to monitor the QPS or the number of concurrent threads indicators application traffic, and when it reaches a specified threshold for traffic control, to avoid being transient traffic spikes washed away, thereby protecting high-availability applications. - from alibaba- Sentinel Wiki's.

4. Timeout and retry mechanism is provided

Once a user requests not exceed a certain response time, throwing an exception. This is very important, because a lot of online system failure or timeout settings Timeout setting no way not caused. We read in third-party service time, especially for the set timeout and retry mechanisms. Generally, we use some RPC frameworks, these frameworks are built out retry configuration. If no request timeout setting may cause slow response, and even lead to the accumulation of the request and then let the system can not process the request. Generally set the number of retries 3 times, then repeatedly retry no benefit, but will increase the pressure on the server (using part of the scene failed retry mechanism will not fit).

5. fuse mechanism

Outside timeout and retry mechanisms set, fuse mechanism is also very important. It said fuse mechanism is the resource usage and performance metrics system automatically collects depends services, when the dependent services deteriorate or call failures reaches a certain threshold of defeat quickly, so that the current system immediately switches to rely on other backup services. It is commonly used to control the flow and fuse to demotion framework Netflix Hystrix alibaba and the Sentinel.

6. asynchronous call

Asynchronous call, then we do not care about the end result, so that we can return to a user request immediately after the completion of the results, we can deal with specific follow-up to do, with this scene spike still find many. However, we may need to use asynchronous after appropriate changes to business processes with , for example, after a user after submitting an order, the user can not return immediately successful order submission, after the message needs to process orders for consumer queue really processed the order, even a library and then notify the user via email or SMS orders successfully . In addition to the program in the asynchronous addition, we also often use the message queue, the message queue asynchronous processing can improve system performance (peak clipping, reduce the response time required) and may reduce the coupling system.

7. Use Cached

If our system is higher than the concurrent words, if we simply use the database, so that when a large number of requests straight into the database might database will hang directly. Using the cache data buffer hot, because in memory, so the speed is quite fast cache memory!

8. Other

The core priority applications and services using better hardware
Monitoring system resource usage to increase alarm settings.
Note that backup, roll back when necessary.
Gray Release: The server cluster is divided into several parts, published only part of the machine every day, no failure to observe stable, the next day to continue to release part of the machine for a few days and only then release all the entire cluster is completed, if the problems found during only need to return roll part of the server can be published
Periodic inspection / replacement of hardware: If you did not purchase cloud services, then still need to regularly check the wave of the hardware, for some the need to replace or upgrade the hardware to be replaced or upgraded.
..... (think of it to add! Also welcome welcome to add!)

to sum up

Open source projects recommended

On the recommendation of other open source projects:

JavaGuide : Java learning [+] Interview Guide covers a majority of Java programmers need to master the core knowledge.
Guide-springboot : suitable for beginners as well as experienced developers access to the Spring Boot tutorial (spare time maintenance, maintenance welcome together).
Advancement-Programmer : I think the technical staff should have some good habits!
-Security-jwt-the Spring Guide : Getting started from zero! Spring Security With JWT (including verification authority) a rear end part of the code.