Stability Construction Framework | Jingdong Logistics Technical Team

1. Why do we need to build stability

1. The necessity of stability construction derived from the law of entropy increase

In physics, "entropy" is used to describe the degree of chaos of a system. Karl Friedman proposed the law of entropy increase . He believed that in a closed system, if there is no external force, all matter will develop from an ordered state to a disordered state.

If we don't want the system to become chaotic, what can we do? The answer is to fight against the law of entropy increase, and the way to fight against the law of entropy increase is to use external force to make the system return from chaos to order. for example:

In the figure below, we use the "entropy" value to measure the degree of chaos of the "dice system", 1 (maximum value) means "the most chaotic", which means that we cannot control the result of "rolling the dice", and the result of each throwing the dice will be When 1~6 appears randomly, the system performance is unstable; 1/6 (minimum value) means "the most orderly", which means that we can control the result of "throwing dice", and the system performance is stable. The results are all 6, we can introduce cheating means (that is, with the help of external force), so that the result of every roll of the dice is 6.

The law of entropy increase is also applicable to software systems. A software system is orderly when it is first released, and its entropy value tends to 1. As it continues to iterate, it gradually becomes chaotic and fragile, which leads to frequent online problems. Tend to 0, we need to use external force, that is, stability governance means, to increase the system entropy value and restore the system to stability.

2. Significance of Stability Construction

As shown in the figure below, system instability will result in real money losses. Therefore, the significance of stability construction is: not to make more money for the business, but to keep the business from losing money!

3. Stability measurement formula

① official

The system stability is measured by the following formula: Availability = MTTF / (MTTF + MTTR) ②Formula description

MTTF (Mean Time To Failure) refers to the average time for the system to run without failure.

The average value of the time period between normal operation and failure, namely: MTTF =ΣT1/N.

MTTR (Mean Time To Repair) refers to the average value of the time period between the failure of the system and the end of the repair, namely:

MTTR = Σ(T2+T3)/ N。

③Formula quantification

Usually "SLA is a few 9" to measure, corresponding to the following table:

Frequently Asked Questions

Question: By which dimension should SLA be defined? Interface, application, business?

Answer: Both are possible, as long as it is clearly stated whether it is an interface SLA, an application SLA, or a business SLA. But note: When referring to the application SLA, it should be equal to the worst SLA of the core interface; when referring to the business SLA, it should be equal to the worst SLA of the golden link.

Question: How long should the SLA time calculation period be?

Answer: Both are possible. It is enough to clarify the calculation cycle, and it is generally more representative to use the year as the unit.

4. Common mistakes

① Don't think that "distributed environment is stable"

It is believed that the network is reliable, the bandwidth is unlimited, the topology of the network will not change, the delay is 0, and the transmission overhead is 0

Reality: The network will jitter, the bandwidth has an upper limit, there are topology changes caused by down machines, there is a probability of response timeout, and so on.

②Do not have "certain thinking", but "uncertain thinking"

Think: follow the rule of thumb, if x then y. Example: I have seen swans are white, so all swans in the world are white; the system has been working fine, so there will be no problems in the future.

Should: The world is uncertain, if x then maybe y. Example: There are black swans.

③Don't "throw the blame", but have "ownership spirit"

I think: the failure is because their system is down, we just need to call to inform, and wait for the recovery.

Should: think in advance that the dependent system fails, how can we make our users run as normal as possible; when a failure occurs, we should work together to find a solution to the problem.

2. Current status of the industry

1. Technical status

The development of the Internet has brought more and more traffic. In order to support more and more traffic, the architecture has been evolving: single application architecture -> vertical application architecture -> distributed architecture -> SOA architecture -> microservices Architecture -> Service Mesh. In the current popular microservice architecture, there are some mechanisms to ensure stability at the application level and infrastructure level:

  • Stability guarantee mechanism at the application level

Taking the SpringCloud family bucket as an example, it provides many components to help us ensure system stability, as shown in the following figure:

  • Stability guarantee mechanism at the infrastructure level

At the infrastructure level, there will also be some stability guarantee mechanisms, as shown in the following table:

2. Status of implementation

According to what we have seen and heard, the current technical team generally adopts the following two methods for stability management:

  • A wave of stability building in a sporty manner

When online failures occur frequently, a "stability management project" is usually set up to define some governance points and give a plan, and then carry out a wave of campaigns. Generally, after governance, the stability will be significantly improved, but because it is a sporty operation, as the business continues to iterate, according to the "law of entropy increase", the stability will deteriorate again.

Disadvantages: It cannot be done in a closed loop. The stability improves when it is governed, and becomes worse when it is not governed. It gives people the impression that the technical team has been having problems.

  • Dot-shaped, dedicated closed-loop governance for each point

For example, set up a "Special Project for Slow SQL Governance", find slow SQL through the monitoring platform, send a work order to R&D, and assess the timeliness; for example, set up a "Special Project for Current Limiting Governance", let all interfaces configure current limiting parameters, and configure current limiting alarm strategies .

Disadvantages: R&D will feel that there are many stability projects, and the value is not clear. Sometimes they will deal with things and fail to achieve the goal of stability governance.

3. How should stability system governance be carried out?

The stability construction is divided into three stages: prevention in advance, stop loss in the event, and review after the event. For these three stages, the construction ideas are as follows:

1. Prevention beforehand

Stability construction is essentially a process of fighting against the principle of entropy increase. Specifically, through some technical means (such as timeout management, current limiting management, downgrading management, slow SQL, etc.), to build countermeasures for possible system failures in advance, so that The system operates according to the design goal.

Note: There are many methods of stability governance. Every time a governance method is implemented, the stability can be improved a little. You can list all known governance methods, and then manage them one by one according to the priority.

2. Stop loss mid-event

According to the stability measurement formula (as shown in the figure below), reducing T2 or T3 can improve SLA. Therefore, after a fault occurs, T2 and T3 should be reduced as much as possible. The way to reduce T2 is to detect system failures as soon as possible, which needs to rely on monitoring and alarm capabilities; the way to reduce T3 is to solve problems as soon as possible. It is necessary to stop the loss first and then find the cause. A clear set of SOPs is required to improve efficiency.

3. Post-event review

The goal of the review is not to determine responsibility, but to avoid recidivism. Therefore, in the process of review, it is necessary to track down the direct cause and the root cause. what caused it”; the root cause is the problem of process specification and cognitive iteration level, such as “because the branch specification is not the master line, resulting in code loss, if you switch to gitflow, you can completely avoid the problem of code loss” .

Examples of direct and root causes: Chen Sheng and Wu Guang rebelled, the direct reason was: heavy rain, he might be late, and if he was late, he would be killed, so he rebelled; the root cause was: the strict system of the Qin Dynasty, even if there was no rain, Chen Sheng and Wu Guang, there will be a rain, and there will be an uprising by Zhang Sheng and some Guang for other reasons.

4. Governance framework of the stability system

As mentioned in the previous chapter, when we dig out the stability management methods from the perspective of "prevention in advance, stop loss during the event, and recovery after the event", we will find that there are many popular methods in the industry, such as timeout management, current limit management, system Isolation, normalized pressure testing, slow SQL management, etc.

However, technical resources are always limited, and it is already very good to be able to take out 15% of the stability management. In addition, different development stages of the business require different stability methods, and the ROI of different stability management methods are also different. Therefore, , we need to answer a question: with limited R&D resources, how to carry out stability governance step by step.

The best practice is: build a stability governance framework, fill in the stability governance methods, and choose the current stability governance methods according to the stage of the business, which can be managed through the following table:

Remarks: After the stability governance framework is established, the governance means can be increased or decreased at any time. The value of the framework is to give us a panoramic view, let us know what to do and what we are doing, instead of doing it blindly.

V. Specific governance plan

According to the stability governance framework in the previous chapter, the next thing to do is to formulate a specific governance plan for a certain governance method, requiring the specific plan to form a closed loop and integrate it into the research and development process, such as:

  • The implementation plan of "Slow SQL Governance"
  1. Define the standard of slow SQL, that is, the execution time exceeds how many ms is considered slow SQL
  2. Discover slow SQL through the monitoring platform
  3. Send a management work order to the person in charge of R&D
  4. Acceptance of governance effects
  • The landing plan of "overtime governance"
  1. Define an appropriate timeout for each interface
  2. Check the interface once a week and find the interface with unreasonable timeout
  3. Fix timeout

Six, write at the end

Stability governance is a long-term process. Stability work should be integrated into the R&D process. On the one hand, we must be conscious and try not to bury pits. For example, microservices emphasize middleware isolation, so we don’t mix middleware. On the other hand, stability Sexual issues must be solved in one step, such as governance timeout period, there must be a complete specification to define the timeout period, and during the research and development process, the new interface and historical interface should be configured reasonably and can be updated dynamically.

Author: JD Logistics Zheng Chuanzhou

Source: Reprinted from Yuanqishuo Tech by JD Cloud developer community, please indicate the source

Microsoft official announcement: Visual Studio for Mac retired The programming language created by the Chinese developer team: MoonBit (Moon Rabbit) Father of LLVM: Mojo will not threaten Python, the fear should be C++ The father of C++ Bjarne Stroustrup shared life advice Linus also Dislike the acronym, what TM is called "GenPD" Rust 1.72.0 is released, and the minimum supported version in the future is Windows 10 Wenxin said that it will open WordPress to the whole society and launch the "100-year plan" Microsoft does not talk about martial arts and uses "malicious pop-ups "Prompt users to deprecate Google's high-level, functional, interpreted, dynamic programming languages: Crumb
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10106419
Recommended