How to improve system stability?

1. Criteria for judging system stability

Before we start talking about stability guarantee, let's talk about a word SLA that is often mentioned in the industry! The industry likes to use SLA (service level agreement, full name: service level agreement) to measure the stability of the system. For Internet companies, it is a mutually recognized agreement defined between the website and the user.

We often see Internet companies chanting slogans. We must achieve three nines and four nines this year, that is, 99.9%, 99.99%, and even five nines, that is, 99.999%.
The more 9 represents the available time of the service throughout the year, the longer the time, the more reliable the service . Take a standard 99.99% as an example, the downtime is 52.6 minutes, and the average downtime per week is only about 1 minute, which means that the time for network jitter may be gone.
The service stability calculation standard is generally, the total number of requests - the number of failures / the total number of requests, such as 100-5/100 = 95%, and several corresponding downtimes are listed below.

1年 = 365天 = 8760小时
3个9        99.9 = 8760 * 0.1% = 8760 * 0.001 = 8.76小时
4个9        99.99 = 8760 * 0.0001 = 0.876小时 = 0.876 * 60 = 52.6分钟
5个9        99.999 = 8760 * 0.00001 = 0.0876小时 = 0.0876 * 60 = 5.26分钟

2. The significance of improving system stability

I think this is a very important question. What is the purpose of spending so much resources, time and energy, and what is the significance of exposing the stability of the system?

  • It is not for the company to make more money, but for the company to lose less money! (e-commerce, trading systems)
  • Improve the user's experience of using the system and reduce the loss of users (user evaluation: smooth, garbage, use it again, use competing products)

3. The essence of improving system stability

  • MTTF (Mean Time To Failure) refers to the average time for the system to operate without failure, and takes the average value of all the time periods from when the system starts to operate normally to when a failure occurs. MTTF =∑T1/N
  • MTTR (Mean Time To Repair) refers to the average value of the time period between the failure of the system and the end of repair. MTTR =∑(T2+T3)/N
  • MTBF (Mean Time Between Failure) refers to the average value of the time period between two system failures. MTBF =∑(T2+T3+T1)/N

  • Reliability: The metric is the mean time between failures (MTBF), the time after which a component fails and requires repair. Improving reliability needs to emphasize reducing the number of system failures, that is, no failures or as few failures as possible, that is, increasing the MTTF time.
  • Availability: The quantitative indicator is the total time the system runs without failure (MTTF) during the period. Improving availability requires an emphasis on reducing the time to recover from a disaster, that is, reducing the MTTR time.

The essence of system stability is to improve reliability and availability, increase time between failures (MTTF), and reduce failure recovery time (MTTR) to ensure business continuity and reduce business loss.

4. Improve the cognitive trap of system stability

This section briefly talks about some common pitfalls when we maintain the system, and how we can improve our cognitive level.

Pitfall 1: My system has never had an accident, it must not fail

Continuous thinking: Usually people think that the past, present, and future are continuous, while the real world is discontinuous, and continuity is just a cognitive assumption. The default way of thinking of human beings is induction, and its scope of application is within the same curve, without mutation. Our system is a changing system. Once the premise assumptions are not established, the generalization of the future from the past is no longer valid.
Cognitive upgrade: Recognize the limitations of continuous thinking, change to discontinuous thinking, and solve thinking solidification

Trap 2: There is a problem with the network, there is a problem with the infrastructure, I can’t help it, it’s not my fault

Design for failure: Our system is built on infrastructure such as hardware and operating systems, and relies on middleware, databases, networks, and third-party systems. All of these may fail. We must rely on these dependencies. design for failure.
Cognitive upgrade: Everything may fail, and failure scenarios must be considered

Trap 3: I have considered these abnormal scenarios and made a special design, it must be no problem

Fault drill verification:  whether all our designs are valid, should be verified like physics and chemistry, and things that have not been verified are

Can not be trusted. We need to simulate failure scenarios, conduct reliability design verification and usability design verification according to the probability of occurrence, degree of hazard and consequences, and prove that it operates as we expect.
Cognitive upgrade: Whether the design is effective or not needs to be tested by fault drills.

Trap 4: This failure scenario is too unlikely to happen

Murphy's Law: There are four main aspects:

  • Nothing is as simple as it seems;
  • Everything will take longer than you expected;
  • What can go wrong will always go wrong;
  • If you worry about something happening, it's more likely to happen.

The fundamental content of Murphy's Law refers to any event, as long as it has a probability greater than zero, it cannot be erected that it will not happen.
Cognitive upgrade: Worry about what will happen sooner or later, put an end to fluke mentality

Trap 5: There are a lot of alarms these days, but there is no user feedback, let’s talk about it in a few days

Hayne's Law: Any unsafe accident is preventable. Hayne's Law is a law about flight safety in the aviation industry. Hayne's law points out: Behind every serious accident, there must be 29 minor accidents, 300 attempted precursors and 1,000 accident hazards.

According to the analysis of Hayne's law, when a major accident occurs, while we are dealing with the accident itself, we must also promptly deal with "accidents" of similar problems.

"Symptoms" and "accident signs" are investigated and dealt with, so as to prevent the recurrence of similar problems, timely solve the hidden dangers of major accidents, and solve the problems in the bud.

Hayne's law emphasizes two points: first, the occurrence of accidents is the result of the accumulation of quantity; second, no matter how good the technology is, no matter how perfect the regulations are, at the level of actual operation, they cannot replace the upgrading of people's own quality and sense of responsibility
: Don't be careless, things will change from quantitative to qualitative

5. Specific methods to improve system stability

There are a lot of things mentioned above, which are standard and meaningful, and the following are dry goods. I thought I made a summary from my own point of view.

6. Summary

The system is like a car running at high speed. There will be new demands and new problems waiting for us at any time. We cannot stop the car running at high speed to fix problems, so we can only fix them while it is running. It is a very risky operation, so we need to do a good job in all aspects to ensure that it will not go wrong. Improving system stability does not happen overnight, it is a long-term process, so don't relax and solve problems in time.

 

 

This is a person from the back mountain, and I am a guest in front of me. Drunk Dance Jingge half a volume of books, sitting in the well to talk about the vastness of the sky. Sorry for the bad writing!

Guess you like

Origin blog.csdn.net/qq_42859864/article/details/128707329