What is high availability and how to achieve high availability

Public account Ali Technology (ID: ali_tech)
What is high availability?

Whether it is a domain, a BG, or a site, although the scope is large or small and the objects are different, the concept of high availability is the same. Today I will share my thoughts on high availability and summarize the "nPRT" Formula" to share with everyone.

This article uses the logic of "what is high availability, why is high availability, how to make high availability, why do it, and where are the software risks?"

High availability is the ability to control risks

High availability is a risk-oriented design that enables the system to control risks and provide higher availability.

Why high availability

For a company, "why it needs high availability" can be completely understood as "why the company needs (to make the system) highly available".

Taking the company as the object, from the inside it includes: people, software (things), hardware (things); from the outside it includes: customers, shareholders, society; from the perspective of itself it includes: the company.

Insert image description here

The main premise of high availability is that everything is not 100% reliable:

All things change (the only constant is change).
All changes are not 100% reliable.
Conclusion: Not everything is 100% reliable.
Internal factors, neither people nor things are 100% reliable:

From a human level: everyone is likely to make mistakes.
From the software level: Software may have bugs.
From the hardware level: Hardware may break.
From a probabilistic perspective, if something is likely to go wrong, as long as there are enough changes, the final probability of error will tend to be infinitely close to 1.

External factors, without high availability, have a huge external impact:

From the customer's perspective: Without high availability, customer service may be interrupted.
From the shareholder level: without high availability, the stock price may fall.
From a social perspective: Without high availability, social order may be affected.
Root cause (essence): control risks.

From the company's own perspective: control risks, protect the company's value, and avoid damaging the fundamentals.

How to make high availability

How to achieve high availability is essentially: how to control risks.

Risk related concepts

Risk: refers to the possibility that harm will occur in the future, but does not actually occur, recorded as r.

Failure: refers to the fact that a hazard has occurred or is occurring, and is the result of the risk becoming reality.

Risk probability: refers to the probability that a risk becomes a failure. It is used to express the difficulty of risk triggering into failure, recorded as P®.

Fault impact scope: refers to the harmful impact caused by a fault within a unit time, recorded as R®.

Fault impact duration: refers to the duration of a fault, recorded as T®.

Fault impact area: refers to the sum of a fault’s impact range multiplied by the fault’s impact duration. Here, the fault impact surface is used to represent the total damage degree of the fault, which is recorded as F®.

Risk expectation: refers to the sum of the probability of each risk becoming faulty multiplied by the failure impact area after each risk becomes faulty. Risk expectations are used here to represent the potential harm of risks, recorded as E®.

risk expectation formula

Based on the definition in the previous section, the formula for risk expectation can be derived as follows:
Insert image description here

r represents risk, and risk expectations will decrease as the number of risks n and the P, R, and T of each risk decrease, referred to as the nPRT formula. (Note: If you want to quote this formula, please indicate the source.)

4 factors to control risk (nPRT)

① Reduce the number of risks, n

Stay away from the risk from the source and have no connection or relationship with the risk carrier; then the probability of the risk is 0, and it does not care whether the impact of the failure after the risk occurs is large or small. It does not matter at all.

For example: during major holiday events, if the entire site is blocked, the number of changes will be significantly reduced, which is a typical reduction in the number of risks.

For example: System A does not rely on Oracle at all, so System A does not need to care about any risks of Oracle. Even if the President of the United States suddenly declares an emergency ban on the use of Oracle in China, System A does not matter.

For example: the recent COVID-19 pandemic, human-to-human transmission is very scary. If you choose not to go to work or go out today, then you don’t have to worry about being infected by pedestrians and colleagues outside today.

② Reduce the probability of risk turning into failure (i.e. increase the difficulty of risk turning into failure), P

Treat risk as an object, set up layers of cards for it, and increase the threshold and difficulty for risks to become faults. Don't let tragedy like "accidentally add an extra space or character, and the system crashes" easily occur.

For example: If Person B wants to make changes to System C, he can add a change certification exam to Person B, require offline (or simulation) testing of the changed content, and conduct CR on the changed content. System C provides the ability to preview the effect of the change (similar to monitoring mode or trial run).

In case Person B wants to make malicious changes and cause damage, non-same review can be added, and System C can add error-proofing design for protection, etc.

For example: Taking COVID-19 as an example, wearing a mask, washing hands frequently, and ventilating more can reduce the probability of contracting COVID-19.

③Reduce the scope of fault influence, R

Break the big into the small, splitting a whole into N small individuals, and each individual is isolated from each other. Problems with a single individual only affect a single individual, achieving small but beautiful.

For example: Distributed architecture is a model of this. If the centralized architecture loses, all will suffer, and if the distributed architecture loses, it will be one-N loss.

For example: Taking COVID-19 as an example, grid management is implemented, movement between provinces or cities is restricted, and nucleic acid + quarantine is required for 14 days across provinces to effectively control the spread of COVID-19.

④Shorten the impact time of fault, T

The duration of the impact of a fault is determined by the time when the fault is discovered and the time when the fault stops bleeding, so it is important to detect and stop the bleeding early.

The discovery methods are divided into: early warning beforehand and alarm afterward. Try to provide advance warning as much as possible to buy time to stop the bleeding or even nip the risk in the bud.

Hemostasis methods are divided into: switching, rollback, capacity expansion, downgrade or current limiting, BUG repair, etc. When a fault occurs, the first priority is to quickly stop the bleeding (such as switching, rollback, expansion), and it is strictly forbidden to locate the root cause; when the bleeding cannot be stopped quickly, the second priority is to reduce bleeding, such as downgrading and current limiting.

Hemostasis efficiency: automatic vs manual; one-click vs multi-step operation. Use automation as much as possible to replace manual operations. If manual operations are performed, try to achieve one-click operation to increase the speed of hemostasis.

For example: for the capacity water level, a warning line can be drawn before the warning line to provide early warning and calm response.

For example: in a distributed application cluster, when any application server has a problem, the load balancing will automatically remove the problematic application server through heartbeat checks and forward the request to other (hot) backup redundant servers.

For example: Take COVID-19 as an example, but since every life is unique, there is no way to switch, roll back, or downgrade (involving humanitarianism). We can only prescribe the right medicine and treat it slowly.

7 core principles of high availability architecture design

According to the nPRT formula, there are the following 7 core principles in high-availability architecture design:

①Principle of less dependence: If you can't rely on it, try to be as independent as possible. The less, the better (n)

Since nothing is 100% reliable, when there is a relationship between two things, they will affect each other and be a risk to each other. A problem with one may affect the other. We use dependency to generally refer to the "relationship" here.

For example: a system relies on three relational databases: Oracle, MySQL, and OB. The principle of less dependence is to only rely on the most mature and stable OB, and not rely on Oracle and MySQL.

What scenarios are suitable for multiple dependencies? When introducing dependencies (n becomes larger) can reduce one or more of the PRTs and reduce the overall E®.

For example: In order to solve the DB risk, distributed cache is introduced, which is still available as long as the two are not hung up at the same time.

②Principle of weak dependence: If you must rely on it, depend on it as weakly as possible, the weaker the better§

Thing a is strongly dependent on thing b. Once there is a problem with b, then a will also have a problem, and both will suffer. Therefore, any strong dependencies must be converted into weak dependencies as much as possible, which can directly reduce the probability of problems.

For example: the transaction core link must issue points and rights to users after the transaction is successful; the transaction core system needs to rely on the points and rights system. A good way is to use weak dependence and use an asynchronous method. In this way, when the points and rights system is unavailable, there is a high probability that It will not affect the core transaction link.

③Principle of diversification: Don’t put eggs in one basket, diversify risks®

Insert image description here

Break it up and split it into N parts; avoid having only 1 part globally, otherwise the scope of impact will be 100% if there is a problem.

For example: all transaction data are placed in the same database and the same table. If this database goes down, all transactions will be affected.

For example: If you buy the same stock with all your money, it will be a disaster if the stock is LeTV.

④Balance principle: spread risks evenly and avoid imbalance®
Insert image description here

It is best that each of the N shares is balanced; avoid a certain share being too large, otherwise the too large share will have an excessive impact if there is a problem.

For example: There are 1,000 xx application clusters, but due to a bug in the traffic diversion component, all traffic is directed to 100 of them, resulting in a serious load imbalance, and finally a complete collapse due to the inability to bear the load. Similar major failures have occurred many times.

For example: I bought 10 stocks with all my money, and one of them accounted for 99%. It would be a disaster if the stock was LeTV.

⑤Isolation principle: Control risks without spreading or amplifying them®

Insert image description here

Each copy is isolated from each other; this prevents problems in one copy from affecting other problems, and the spread spreads the scope of influence.

For example: transaction data is split into 10 databases and 100 tables, but deployed on the same physical machine; if a large SQL statement in a certain table fills up the network card, all 10 databases and 100 tables will be affected.

For example: I bought 10 stocks evenly with all my money, each accounting for 10%, but all 10 stocks are LeTV stocks.

For example: The ancient Battle of Chibi is a typical negative example. The iron chain and the ship caused the isolation to be destroyed, and a fire burned an 800,000-strong army.

There are levels of isolation. The higher the isolation level, the more difficult it is for risks to spread and the stronger the disaster recovery capability.

For example: an application cluster consists of N servers, deployed on the same physical machine, or on different physical machines in the same computer room, or in different computer rooms in the same city, or in different cities. Different deployments represent different disaster recovery capabilities. .

For example: Humanity is composed of countless people living on different continents on the same earth, which means that humans do not have the ability to isolate at the planet level. When there is a devastating impact on the earth, humans do not have disaster tolerance.

The isolation principle is an extremely important principle. It is the premise of the previous four principles.

Without isolation, the first four principles are all fragile, and risks can easily spread, destroying the effect of the first four principles.

A large number of real system failures are caused by poor isolation, such as: offline affects online, offline affects online, pre-release affects production, a bad SQL affects the entire library (or the entire cluster), etc.

Dispersion, balance, and isolation are the three core principles for controlling the scope of risk impact. Break it up and split it into N parts. Each part is balanced and isolated from each other. If there is a problem with one part, the impact range is 1/N.

⑥No single point principle: There must be redundancy or other versions, so that there is a way to retreat (T)

The ways to quickly stop bleeding are switching, rollback, expansion, etc.; rollback and expansion are special switches. Rollback refers to switching to a certain version, and expansion refers to switching traffic to the newly expanded machine.

There must be a place to switch, so there cannot be a single point (here specifically refers to a single point with strong dependencies, and weak dependencies can be downgraded). There must be redundant backup or other versions; a single point will limit the overall reliability.

Assuming that the reliability of a single point is 99.99%, it is very difficult to increase it to 99.999%. However, if there is no single point but relies on 2 (it doesn't matter if one of them hangs up, as long as they don't hang up at the same time), then the overall reliability Sex is 99.999999%, there will be a qualitative improvement.

A single point of failure will result in the inability to stop bleeding quickly and prolong the entire hemostasis time. It is crucial to remove the single point. The single point here refers not only to system nodes, but also to personnel, such as people who subscribe to alarms, emergency responders, etc.

For (important) data nodes, the no single point principle must be met, otherwise in extreme cases the data may be permanently lost and can never be recovered; after the (important) data node meets the no single point principle, ensuring data consistency is more important than availability requirements.

For example: a merchant only supports one payment channel, which is a typical single point. If this payment channel fails, payment will not be possible.

For example: a family relies solely on the father's salary for all its income. If the father becomes ill, there will be no income.

The difference between the no single point principle and the decentralized principle:

When the node is stateless, it is broken up and split into N parts. Each part has the same function and is redundant with each other. That is: when the node is stateless, the decentralization principle and the no single point principle are equivalent, and one of the following is satisfied: Can.
When the node is stateful, it is broken up and split into N parts. Each part is different and there is no redundancy in each part. Redundancy needs to be done for each part. That is: when the node is stateful, both To satisfy the decentralized principle, we must also satisfy the single-point principle.
⑦Principle of self-protection: bleed less, sacrifice one part, protect the other (P&R&T)

External input is not 100% reliable. Sometimes it is an unintentional error, and sometimes it is even malicious damage. Therefore, you must have a mistake-proof design for external input to give yourself more protection.

In extreme cases, it may not be possible to stop the bleeding (quickly). You can consider reducing the bleeding and sacrificing one part to protect the other. For example: current limiting, downgrading, etc.

For example: During the peak sales period, many functions are usually downgraded in advance and the current flow is limited. This is mainly to protect the transaction payment experience of most people during the peak period.

For example, the human body will trigger shock when it loses too much blood or suffers excessive pain, which is also a typical self-protection mechanism.

Where are the software risks?

We have introduced methods to control risks before. Returning to the field of software systems, what are the risks?

Taking the software system as the object, from the inside it includes: computing system and storage system; from the outside it includes: personnel, hardware, upstream systems, downstream systems; and (implied) time.
Insert image description here

Since each object is composed of other objects, each object can continue to be decomposed (theoretically it can be decomposed infinitely). The above decomposition method is mainly to simplify understanding.

Sources of software system risks

Risks originate from (hazardous) changes, and the risk of an object originates from (hazardous) changes in all objects related to it.

Therefore, the sources of software system risks are divided into the following seven categories:

① Calculation system changes: slow operation, operation errors

The load of server resources (such as CPU, MEM, IO, etc.), application resources (number of RPC threads, number of DB connections, etc.), business resources (business ID is full, balance is insufficient, business quota is not enough, etc.) that the system depends on will all be affected. Risk expectations affecting system operation.

②Storage system changes: slow operation, operation errors, data errors

The load and data consistency of server resources (such as CPU, MEM, IO, etc.), storage resources (number of concurrency, etc.), data resources (single database capacity, single table capacity, etc.) that the system depends on will affect the operation of the storage system. Risk expectations.

③Changes in people: changes go wrong

The number of change personnel, safety production awareness, proficiency, number of changes, change methods, etc. will all affect the risk expectations of the change.

Due to the large number of people making changes and the large number of changes, change has become the TOP1 source of all failures in Ant. This is why the "Three Changes" are so famous.

The correct order of the "Three Changes" should be "grayscale, monitorable, and emergency"; grayscale represents R, and monitorable and emergency represent T.

Thinking: If changing the three axes allows you to add another ax, what do you think it should be?

④Hardware changes: damage

The quantity, quality, service life, and maintenance of hardware will all affect the risk expectations of the hardware. Hardware damage will affect the unavailability of the upper-level software system.

⑤ Upstream changes: Requests become larger

Requests are divided into 3 dimensions: (network traffic composed of countless APIs), API (composed of countless KEY requests), KEY.

Excessive network traffic can cause network congestion, affecting all network traffic requests in the network channel.
Excessive API requests will cause the corresponding service cluster to be overloaded, affecting all API requests on the entire service machine, and even spreading outward.
Excessively large KEY requests (commonly known as "hot KEY") will cause overload on a single machine, affecting all KEY requests on the single machine, and even spreading to the outside world.
Therefore, when guaranteeing a major promotion, you should not only focus on the capacity guarantee of the core API, but also consider the network traffic and hotspot KEY.

⑥Downstream changes: slow response, wrong response

The number, service level, service availability, etc. of downstream services affect the risk expectations of downstream services. Slower downstream responses may slow down the upstream, and errors in downstream responses may affect the results of upstream operations.

⑦Time change: time expires

Time expiration is often ignored by people, but it is often sudden and globally destructive. Once the time expires and a fault is triggered, it will cause a very passive situation, so it is necessary to identify it in advance and give early warning, such as: secret key expiration, certificate expiration, fees Expiration, across time zones, across years, across months, across days, etc.

For example: In 2019, Japanese operator SoftBank caused a 4-hour communication interruption for 30 million users due to certificate expiration.

Each of the above major categories of risks can be analyzed and processed one by one based on the nPRT formula.

The number of risks: three in one life, three in everything

Any thing is both composed of other things and a component of other things, and the cycle goes on endlessly; three are born, and three are born, and the number of risks is endless.

Looking inward, contained within, it can be infinitely small; when atomic particle size problems spread, it may also affect the availability of software systems, just like the 100-nanometer new coronavirus can affect the availability of the human body.

Looking outward, there is outside, and it can go on infinitely; when the solar system is destroyed, the usability of the software system will naturally cease to exist.

Although the risks are endless, as long as we understand the risks more and follow some concepts and principles of risk control, we can still better reduce our risk expectations.

Let’s talk about awe:

Our knowledge of the world is limited, which makes us less fearful and less awe-inspiring.
What we really have to fear is not the penalty regulations, but what we don't know, and what we don't know we don't know.

Summarize

All things change.
Not everything is 100% reliable.
That's why there is risk. Risk is invisible, what is visible is failure.
Risk cannot eliminate light, but it can be kept away and reduced.
Failure is inevitable, but it can be postponed, the scope of impact can be reduced, and the impact time can be shortened.
The nPRT formula is not only applicable to software system risks, but also to other risk areas. I hope it will be useful to everyone.

actual case

We all know that single points are the enemy of system high availability. Single points are often the biggest risk and enemy of system high availability. We should try to avoid single points in the system design process. Methodologically, the principle of high availability guarantee is "clustering", or "redundancy": there is only one single point, and the service will be affected if it goes down; if there is a redundant backup, there will be other backups that can take over if it goes down.

To ensure high system availability, the core principle of architecture design is: redundancy.

Having redundancy is not enough. Every time a fault occurs, manual intervention is required to restore it, which will inevitably increase the unserviceability of the system. Therefore, high availability of the system is often achieved through "automatic failover".

Next, let’s look at how to ensure the high availability of the system through redundancy + automatic failover in a typical Internet architecture.

Common Internet layered architecture

Insert image description here

Common Internet distributed architectures are as above, divided into:

(1) Client layer: The typical caller is a browser or mobile application APP

(2) Reverse proxy layer: system entrance, reverse proxy

(3) Site application layer: implement core application logic and return html or json

(4) Service layer: If servitization is realized, there will be this layer

(5) Data-cache layer: cache accelerates access to storage

(6) Data-database layer: database solidified data storage

The high availability of the entire system is comprehensively achieved through redundancy + automatic failover at each layer.

Layered High Availability Architecture Practice

High availability of [Client layer->Reverse proxy layer]
Insert image description here

The high availability from [Client Layer] to [Reverse Proxy Layer] is achieved through the redundancy of the reverse proxy layer. Take nginx as an example: there are two nginx, one provides services online, and the other is redundant to ensure high availability. A common practice is to keepalived survival detection, and the same virtual IP provides services.
Insert image description here

Automatic failover: When nginx hangs up, keepalived can detect it, automatically perform failover, and automatically migrate traffic to shadow-nginx. Since the same virtual IP is used, this switching process is transparent to the caller. .

High availability of [reverse proxy layer->site layer]
Insert image description here

The high availability from [reverse proxy layer] to [site layer] is achieved through redundancy at the site layer. Assuming that the reverse proxy layer is nginx, multiple web backends can be configured in nginx.conf, and nginx can detect the viability of multiple backends.
Insert image description here

Automatic failover: When the web-server hangs, nginx can detect it, automatically perform failover, and automatically migrate the traffic to other web-servers. The entire process is automatically completed by nginx and is transparent to the caller.

High availability of [site layer -> service layer]

Insert image description here

High availability from [site layer] to [service layer] is achieved through redundancy in the service layer. The "service connection pool" will establish multiple connections with downstream services, and each request will "randomly" select a connection to access the downstream service.
Insert image description here

Automatic failover: When the service hangs up, service-connection-pool can detect it, automatically perform failover, and automatically migrate traffic to other services. The entire process is automatically completed by the connection pool and is transparent to the caller. (So ​​the service connection pool in RPC-client is a very important basic component).

High availability of [Service Layer>Cache Layer]
Insert image description here

The high availability from [service layer] to [cache layer] is achieved through the redundancy of cached data.

There are several ways to implement data redundancy in the cache layer: the first is to use client encapsulation and service to double read or double write the cache.
Insert image description here

The cache layer can also solve the high availability problem of the cache layer through a cache cluster that supports master-slave synchronization.

Take redis as an example. Redis naturally supports master-slave synchronization. Redis officially also has a sentinel mechanism to do redis survival detection.
Insert image description here

Automatic failover: When the redis master fails, sentinel can detect it and notify the caller to access the new redis. The entire process is completed by the cooperation of sentinel and the redis cluster and is transparent to the caller.

After talking about the high availability of cache, I want to say one more thing here. The business does not necessarily have "high availability" requirements for cache. More usage scenarios for cache are to "accelerate data access": putting part of the data in the cache. Here, if the cache hangs or the cache does not hit, you can go to the back-end database to retrieve the data.

For this type of business scenario that allows "cache miss", the recommendations for the cache architecture are:
Insert image description here

Encapsulate the kv cache into a service cluster, and set up a proxy upstream (the proxy can use cluster redundancy to ensure high availability). The backend of the proxy is horizontally divided into several instances according to the key accessed by the cache. Access to each instance is not done. High availability.
Insert image description here

Cache instance hangs up and shielded: When a horizontally split instance hangs up, the proxy layer directly returns a cache miss. At this time, the cache hangup is also transparent to the caller. Key horizontal sharding instances are reduced, and re-hash is not recommended, as this can easily cause cached data inconsistencies.

High availability of [Service Layer>Database Layer]

In most Internet technologies, the database layer uses a "master-slave synchronization, read-write separation" architecture, so the high availability of the database layer is divided into two categories: "read database high availability" and "write database high availability".

High availability of [Service Layer>Database Layer "Read"]
Insert image description here

The high availability from [service layer] to [database reading] is achieved through the redundancy of the reading database.

Since the reading database is redundant, generally speaking, there are at least 2 slave databases. The "database connection pool" will establish multiple connections to the reading database, and each request will be routed to these reading databases.

Insert image description here

Automatic failover: When the reading library hangs, db-connection-pool can detect it, automatically perform failover, and automatically migrate the traffic to other reading libraries. The entire process is automatically completed by the connection pool, and the caller is Transparent (so the database connection pool in DAO is a very important basic component).

[Service layer>Database layer "write"] high availability

Insert image description here

The high availability from [service layer] to [database writing] is achieved through the redundancy of the writing database.

Taking mysql as an example, you can set up two mysql dual-master synchronization, one to provide services online, and the other to provide redundancy to ensure high availability. A common practice is to keepalived survival detection, and the same virtual IP provides services.

Insert image description here

Automatic failover: When the writing library hangs, keepalived can detect it, automatically perform failover, and automatically migrate the traffic to shadow-db-master. Since the same virtual IP is used, this switching process is very harmful to the caller. Be transparent.

Summarize

High availability HA (High Availability) is one of the factors that must be considered in the design of distributed system architecture. It usually refers to reducing the time when the system cannot provide services through design.

Methodologically, high availability is achieved through redundancy + automatic failover.

The high availability of the entire Internet layered system architecture is comprehensively achieved through redundancy + automatic failover of each layer. Specifically:

(1) High availability from [Client Layer] to [Reverse Proxy Layer] is achieved through the redundancy of the reverse proxy layer. A common practice is keepalived + virtual IP automatic failover

(2) High availability from [reverse proxy layer] to [site layer] is achieved through redundancy at the site layer. Common practices are survival detection and automatic failover between nginx and web-server.

(3) High availability from [site layer] to [service layer] is achieved through redundancy in the service layer. Common practice is to ensure automatic failover through service-connection-pool.

(4) High availability from [Service Layer] to [Cache Layer] is achieved through the redundancy of cached data. Common practices are to cache clients for dual reading and dual writing, or to use the master-slave data synchronization of the cache cluster and sentinel keep-alive. and automatic failover; in more business scenarios that do not have high availability requirements for cache, cache servitization can be used to shield the caller from the underlying complexity.

(5) High availability from [service layer] to [database "read"] is achieved through the redundancy of the read database. Common practice is to ensure automatic failover through db-connection-pool

(6) High availability from [service layer] to [database "write"] is achieved through the redundancy of the write library. A common practice is keepalived + virtual IP automatic failover

Guess you like

Origin blog.csdn.net/qq_26356861/article/details/132717937
Recommended