The essence of high availability

I am Leyang, a person who loves risk prevention and control. I have previously participated in the construction and high-availability construction of Ant Glocal sites from 0 to 1, and I am currently participating in the high-availability construction of Ant Security. Whether it is a domain, a BG, or a site, although the scope is large and small, and the objects are different, the concept of high availability is the same. Today I will think about high availability and summarize [nPRT Formula】Share with everyone.

This article introduces the logic of "what is high availability, why it is high availability, how to do it, why do it, and where is the software risk".

A high availability is an ability to control risks

High availability is a risk-oriented design that enables the system to control risks and provide higher availability.

Second, why high availability

For a company, "why need high availability" can be completely understood as "why does the company want (system) high availability". Taking the company as the object, from the inside, it includes: people, software (things), and hardware (things); from the outside, it includes: customers, shareholders, and society; from its own point of view, it includes: the company.

The major premise of high availability: everything is not 100% reliable

  • Everything changes (the only constant is change).
  • All changes are not 100% reliable.
  • Conclusion: Everything is not 100% reliable.

Internal cause: people and things are not 100% reliable

  • From the human level: people are likely to make mistakes.
  • From the software level: software is likely to have BUG.
  • From the hardware level: hardware is likely to be broken.

From a probabilistic point of view, if there is a possibility of error, as long as the number of changes is large enough, the probability of error will eventually tend to 1 infinitely.

External cause: no high availability, great external influence

  • From a customer perspective: Without high availability, customer service may be interrupted.
  • From the shareholder level: Without high availability, the stock price may fall.
  • From a social perspective: Without high availability, social order may be affected.

Root cause (essential): control risk

From the company's own perspective: control risks, protect the company's value, and avoid harming the fundamentals.

Three how to make high availability

How to make high availability is essentially: how to control risks.

1 Risk-related concepts

  • Risk: refers to the possibility that harm will occur in the future, but it has not actually occurred, and is recorded as r.
  • Failure: refers to the fact that a hazard has occurred or is occurring, and is the result of a risk becoming a reality.
  • Risk probability: refers to the probability of a risk becoming a failure. Use it to express the difficulty of triggering the risk as a failure, denoted as P(r).
  • Failure impact range: refers to the hazardous impact caused by a failure within a unit time, denoted as R(r).
  • Fault impact duration: refers to the duration of a fault, denoted as T(r).
  • Fault influence surface: refers to the sum of a fault influence range multiplied by the fault influence time. Here, the fault impact surface is used to express the total damage degree of the fault, denoted as F(r).
  • Risk expectation: refers to the probability of each risk becoming failure multiplied by the sum of the impact of each risk after the failure. Here, the risk expectation is used to express the degree of potential harm of the risk, denoted as E(r).

2 Formula for risk expectations

According to the definition in the previous section, the formula for risk expectations can be derived as follows:

r stands for risk, and the risk expectation will decrease as the number of risks n and P, R, and T of each risk decrease, referred to as the nPRT formula.

Note: If you want to quote the formula, please indicate the source.

3 4 factors to control risk (nPRT)

Reduce the number of risks, n

Stay away from the risk from the source, so that it is not connected or related to the risk carrier; then the risk probability is 0, and it does not care whether the impact of the failure after the risk occurs is large or small, and does not care at all.

  • For example: for major holiday events, the implementation of the site-wide closure of the network will significantly reduce the number of changes, which is a typical reduction in the number of risks.
  • For example: System A does not rely on Oracle at all, then System A does not need to care about any risks of Oracle. Even if the President of the United States suddenly and urgently announces that Oracle is immediately prohibited from being used in China, System A does not matter.
  • For example: the recent pandemic of the new crown, human-to-human transmission is terrible, if you choose not to go to work today, then you don't have to worry about being infected by pedestrians and colleagues outside today.

Reduce the probability of risk becoming failure (ie: increase the difficulty of risk becoming failure), P

Treat the risk as an object, set up cards at various levels to increase the threshold and difficulty for the risk to become faulty. Don't let the tragedy of "accidentally adding a space or character, the system hangs" easily occur.

  • For example: Person B wants to make changes to System C. You can add a change certification exam to Person B, require offline (or simulation) testing for the changed content, and CR on the changed content. System C provides the ability to preview the effect of the change (similar to monitoring mode or Trial operation), in case Person B wants to maliciously change and destroy, you can also add non-family review, and System C can add error-proof design for protection and so on.
  • For example, taking the new crown as an example, wearing a mask, washing hands frequently, and getting more ventilation can reduce the probability of contracting the new crown.

Reduce the scope of fault influence, R

The whole is divided into N small individuals, and each individual is isolated from each other. When a single individual has a problem, only a single individual is affected, realizing small and beautiful.

  • For example: the distributed architecture is a model of this, the centralized one loses everything, and the distributed one loses one part of the N loss.
  • For example: Taking the new crown as an example, grid management, restrictions on the flow of provinces or cities, cross-province nucleic acid + isolation for 14 days, effectively controlling the spread of the new crown.

Shorten the duration of fault impact, T

The duration of the fault impact is determined by the fault discovery time and the fault hemostasis time, so early detection and early hemostasis is necessary.

Discovery methods are divided into: pre-alarm and post-alarm. Try to make early warning as much as possible, buy time for hemostasis and even kill the risk in the cradle.

Hemostasis methods are divided into: switching, rollback, expansion, downgrade or current limit, BUG repair, etc. When a fault occurs, the first priority principle is rapid hemostasis (such as switching, rollback, expansion), and it is strictly forbidden to locate the root cause; when the bleeding cannot be stopped quickly, the second priority principle is to reduce bleeding, such as downgrading and flow restriction.

Hemostasis efficiency: automatic vs. manual; one-key vs. multi-step operation. Use automation as much as possible to replace manual operation. If manual operation is performed, try to achieve one-key operation to improve the speed of hemostasis.

  • For example: for the capacity water level, an early warning line can be drawn before the warning line to give early warning and respond calmly.
  • For example: in a distributed application cluster, when any application server has a problem, the load balancer will automatically remove the problematic application server through a heartbeat check, and forward the request to other (hot) redundant servers.
  • For example: Take the new crown as an example, but since each life is unique, there is no way to switch, no way to roll back, and no downgrade (involving humanitarianism), and only the right medicine can be treated slowly.

4 The 7 core principles of high-availability architecture design

According to the nPRT formula, there are the following 7 core principles when designing a high-availability architecture:

Principle of Less Dependence: Those who can be independent, don’t depend on as much as possible, the less the better (n)

Since all things are not 100% reliable, when two things have a relationship, they will affect each other, which is a risk for each other, and a problem with one may affect the other. We uniformly use dependency to refer to the "relationship" here.

  • For example: a system that relies on three relational databases, Oracle, Mysql, and OB at the same time, the principle of less reliance is to only rely on the most mature and stable OB instead of Oracle and Mysql.

What scenario is suitable for multi-dependence?

When the introduction of dependency (n becomes larger) can reduce one or more of the PRT, and make E(r) decrease as a whole.

  • For example: In order to solve the DB risk, the distributed cache is introduced, as long as the two are not at the same time, it is still available.

The principle of weak dependence: must be dependent, as weak as possible, the weaker the better (P)

Thing a strongly depends on thing b, once b has a problem, then a will have a problem, and everything will be lost.

Therefore, any strong dependence must be transformed into weak dependence as much as possible, which can directly reduce the probability of problems.

  • For example: the transaction core link must issue points rights to users after the transaction is successful; the transaction core system needs to rely on the points rights system. A good way is to use weak dependence and use asynchronous methods, so that when the points rights system is unavailable, there is a high probability Will not affect the core transaction link.

Dispersion principle: Don’t put eggs in a basket, spread the risk (R)

Break up and split into N parts; avoid the overall situation is only 1 part, otherwise the scope of influence will be 100% if there is a problem.

  • For example: All transaction data are placed in the same table in the same library. In case the library is down, all transactions will be affected at this time.
  • For example, if you buy the same stock with all your money, it will be miserable if the stock is LeEco.

The principle of equilibrium: evenly spread the risk and avoid imbalance (R)

It is best that each of the N shares is balanced; avoid a certain share that is too large, otherwise the scope of influence will be too large if there is a problem with the too large share.

  • For example: There are 1000 xx application clusters, but due to the BUG of the drainage component, all the traffic is led to 100 of them, resulting in a serious imbalance in the load, and finally a complete collapse due to the load. Similar major failures have occurred many times.
  • For example: I bought 10 stocks with all my money, and one of them accounted for 99%. It would be miserable if this stock was LeEco.

Isolation principle: control the risk without spreading, not magnifying (R)

Each copy is isolated from each other; avoiding one copy that has problems affecting the others also has problems, spreading the scope of influence.

  • For example: the transaction data is split into 10 databases and 100 tables, but they are deployed on the same physical machine; if a large SQL in a certain table fills up the network card, the 10 databases and 100 tables will be affected.
  • For example: I divided all my money and bought 10 stocks, each of which accounted for 10%, but the 10 stocks were all from LeTV.
  • For example: The Battle of Chibi in ancient times is a typical negative example. The isolation caused by the chained ship was destroyed, and the 80w army was burned in a big fire.

There are levels of isolation. The higher the isolation level, the more difficult it is to spread the risk and the stronger the disaster tolerance capability.

  • For example: an application cluster is composed of N servers, deployed on the same physical machine, or on different physical machines in the same computer room, or in different computer rooms in the same city, or in different cities, different deployments represent different disaster tolerance capabilities .
  • For example: human beings are composed of countless people living on different continents of the same earth. This means that human beings do not have the ability to isolate at the planetary level. When the earth has a devastating impact, human beings are not capable of disaster tolerance.

The principle of isolation is an extremely important principle, and it is the premise of the previous four principles. Without proper isolation, the first four principles are all fragile, and risks can easily spread and destroy the effects of the first four principles. A large number of real system failures are caused by poor isolation. For example, offline affects online, offline affects online, pre-release affects production, a rotten SQL affects the entire database (or the entire cluster), and so on.

Decentralization, balance, and isolation are the three core principles for controlling the scope of risk. Break up and split into N parts, each part is balanced and isolated from each other, one part is problematic, and the range of influence is 1/N.

No single point principle: There must be redundancy or other versions, so that there is a way to return (T)

The quick way to stop bleeding is switching, rollback, expansion, etc.; rollback and expansion are special switching. Rollback refers to switching to a certain version, and expansion refers to switching traffic to the newly expanded machine.

Switching must have a place to cut it, so there can be no single point (here specifically refers to the single point with strong dependence, and the weak dependent can be downgraded), and there must be redundant backup or other versions; the single point will limit the overall reliability.

Assuming that the reliability of a single point is 99.99%, it is very difficult to increase it to 99.999%, but if there is no single point but rely on two (it does not matter if one hangs up, as long as it does not hang at the same time), then the overall reliability Sex is 99.999999%, there will be a qualitative improvement.

A single point of failure will result in the inability to stop bleeding quickly and lengthen the entire hemostasis time. It is very important to go to a single point. The single point here not only refers to system nodes, but also includes personnel, such as people who subscribe to alarms, emergency personnel, and so on.

For (important) data nodes, the principle of no single point must be met, otherwise extreme conditions may cause permanent data loss and can never be restored; after (important) data nodes meet the principle of no single point, ensuring data consistency is more important than availability requirements.

  • For example: a merchant only supports one payment channel, which is a typical single point. If this payment channel hangs up, it will not be able to pay.
  • For example: all income of a family depends only on the salary income of one father. If the father gets sick, there will be no income.

There is no difference between the single point principle and the decentralized principle:

  • When the node is stateless, it is broken up and split into N parts, each of which has the same function and is redundant with each other, that is: when the node is stateless, the principle of decentralization and the principle of no single point are equivalent, and one is satisfied. can.
  • When the node has a state, it is broken up and divided into N parts. Each part is different. Each part is not redundant. Redundancy needs to be made for each part. That is, when the node is in a state, both Satisfying the principle of dispersion must meet the principle of single point.

The principle of self-protection: less bloodshed, sacrifice one part, and protect the other part (P&R&T)

External input is not 100% reliable, sometimes it is unintentional error, sometimes even malicious damage, so there must be error-proof design for external input and give yourself more protection.

In extreme cases, it may not be possible to stop the bleeding (rapidly). You can consider reducing the bleeding and sacrificing one part to protect the other. For example: current limit, downgrade, etc.

  • For example, during the peak promotion period, many functions are generally downgraded in advance, and the current limit is at the same time, mainly to protect the transaction payment experience of most people at the peak.
  • For example, when the human body loses too much blood or pain, it triggers shock, which is also a typical self-protection mechanism.

Where is the software risk?

The method of controlling risks was introduced earlier. Back to the field of software systems, where are the risks?

Taking the software system as the object, from the inside, it includes: computing system and storage system; from the outside, it includes: personnel, hardware, upstream system, downstream system; and (implied) time.

Since each object is composed of other objects, each object can be further decomposed (in theory, it can be decomposed indefinitely). The above decomposition method is mainly to simplify the understanding.

1 Sources of software system risks

Risk comes from (hazardous) changes, and the risk of an object comes from the (hazardous) changes of all objects related to it. Therefore, the sources of software system risks are divided into the following 7 categories:

Calculation system changes: slow operation, operation error

The load of server resources (such as CPU, MEM, IO, etc.) that the system depends on, application resources (number of RPC threads, number of DB connections, etc.), business resources (business ID full, insufficient balance, insufficient business quota, etc.) Risk expectations affecting the operation of the system.

Storage system changes: slower operation, operation error, data error

The server resources (such as CPU, MEM, IO, etc.), storage resources (concurrency, etc.), data resources (single library capacity, single table capacity, etc.) load and data consistency that the system depends on will affect the storage system operation Risk expectations.

Human changes: Errors in change

The number of change personnel, safety awareness, proficiency, the number of changes, the method of change, etc. will all affect the risk expectation of the change.

Due to the large number of people who change and the number of changes, the change has become the TOP1 among all the sources of ant failures. This is why the "change three axes" is so famous.

The correct order of "change three axes" should be "gray-scale, monitorable, and emergency"; gray-scale represents R, monitorable and emergency-capable represents T.

Thinking: If you change the three axes to add another one, what do you think it should be?

Hardware changes: damaged

The quantity, quality, service life, maintenance, etc. of the hardware will affect the risk expectation of the hardware, and the damage of the hardware will affect the unavailability of the upper software system.

Upstream changes: requests become larger

Requests are divided into 3 dimensions: network traffic (composed of countless APIs), API (composed of countless KEY requests), and KEY.

  • Excessive network traffic will cause network congestion and affect all network traffic requests in the network channel.
  • Too much API request causes the corresponding service cluster to be overloaded, affects all API requests on the entire service machine, and even spreads out.
  • Excessive KEY requests (commonly known as "hot KEY") will cause the single machine to overload, affect all KEY requests on the single machine, and even spread out.

Therefore, when promoting protection, not only pay attention to the capacity guarantee of core API, but also need to consider network traffic and hotspot KEY.

Downstream changes: slow response, wrong response

The number of downstream services, service levels, service availability, etc. affect the risk expectations of downstream services. Slow downstream response may slow down upstream, and downstream response errors may affect upstream operating results.

Time change: time expires

Time expiration is often overlooked, but it is often sudden and globally destructive. Once the time expires and triggers a fault, it will be very passive, so it must be identified in advance and early warning, such as: key expiration, certificate expiration, cost Expiry, cross time zone, cross year, cross month, cross day, etc.

  • For example: In 2019, the Japanese operator Softbank caused a four-hour communication interruption for 3000w users due to the expiration of the certificate.
Each of the above major types of risks can be analyzed and processed one by one based on the nPRT formula.

2 The number of risks: three in one life, three in all things

Any thing is not only composed of other things but also a component of other things, infinitely circulates; one life three, three life all things, the number of risks is endless.

Looking inwardly, the content can be infinitely small; when the problem of atomic particle size spreads, it may also affect the usability of the software system, just like the 100-nanometer new coronavirus can affect the usability of the human body.

Looking outward, there is an outside, and it can go on indefinitely; when the solar system is destroyed, the availability of the software system will naturally cease to exist.

Although the risks are endless, as long as we have a better understanding of risks and based on some concepts and principles of risk control, we can still better reduce risk expectations.

Talk about the heart of awe:

  • Our knowledge of the world is limited, which also makes us less fearful, but also less awe-inspiring.
  • What we really want to fear is not the punishment regulations, but what we don’t know, and what we don’t know we don’t know.

Five concluding remarks

  • Everything changes.
  • Nothing is 100% reliable.
  • Therefore, there is a risk. The risk is invisible, but the fault is visible.
  • Risk cannot eliminate the light, but it can be kept away and reduced.
  • Failure is inevitable, but it can be postponed, the scope of impact can be reduced, and the impact time can be shortened.

The nPRT formula is not only applicable to software system risks, but also to other risk areas. I hope it will be useful to everyone.

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/113974684