timeout, retry, fuse, current limit

1 Written in the front

1.1 Explanation of terms

consumer means that the service caller

provider indicates the service provider, which is generally said in dubbo.

The following A calls the B service, which generally refers to calling an interface in the B service.

1.2 Topology Diagram

Capital


2 Thinking from a Microscopic Perspective

2.1 Timeout

During the interface calling process, when the consumer calls the provider, the provider may be slow in responding. If the provider responds in 10s, the consumer will respond in at least 10s. If this happens frequently, the overall performance of the consumer service will be degraded.

The symptom of this slow response time will be like a wave layer by layer, from the bottom system to the top layer, causing the timeout of the entire link.

Therefore, it is impossible for the consumer to wait for the return of the provider interface indefinitely, and a time threshold will be set. If the time threshold is exceeded, it will not continue to wait.

The selection of this timeout time generally depends on the normal response time of the provider, and then a buffer can be added.

2.2 The configuration of the retry

timeout is to protect the service and prevent the consumer service from becoming very slow due to the slow response of the provider, so that the consumer can maintain the original performance as much as possible.

However, it is also possible that the provider only shakes occasionally, then giving up directly after a timeout without subsequent processing will result in an error in the current request and business losses.

Then, for this occasional jitter, you can retry after the timeout. If the retry returns normally, then the request is saved and the data can be returned to the front end normally, but it is a little slower than the original response.

Some refined strategies when

retrying: You can consider switching a machine to make the call when retrying, because the original machine may have performance degradation due to temporary high load, and retrying will exacerbate its performance problems. The probability of returning soon is also greater.

2.2.1 Idempotent (idempotent)

If the consumer is allowed to retry, then the provider must be able to achieve idempotent.

That is, the same request is called multiple times by the consumer, and the impact on the provider (the impact here generally refers to some write-related operations) is consistent.

And this idempotency should be at the service level, not at the level of a certain machine. Retrying to call any machine should be idempotent.

2.3 The circuit break

retry is to deal with occasional jitter in order to recover more losses.

But what if the provider's continuous response time is too long?

If the provider is a service of the core path, it is basically impossible to provide the service when it is down, so we have nothing to say. If it is a less important service, but the core service in the consumer is also slowed down because the service has been responding for a long time, then the gain is not worth the loss.

A simple timeout can't solve this situation, because the general timeout time is longer than the average response time. Now all the requests to the provider have timed out, then the average response time of the consumer request provider is equal to the timeout time, load was also dragged down.

Retrying will exacerbate this problem, making the availability of the consumer even worse.

Therefore, the logic of fuse appears, that is, if frequent timeouts are detected, the request from the consumer to call the provider is directly short-circuited, instead of actually calling, but directly returning a mock value.

Call it again after the provider service becomes stable.

2.3.1 For simple circuit breaker processing logic,

please refer to the code of Hystrix open sourced by Netflix.

2.4 Current limiting The

above strategies are designed by the consumer for various situations in the provider.

And the provider sometimes has to guard against the traffic mutation problem from the consumer.

In such a scenario, the provider is a core service that provides services to N consumers. Suddenly, a certain consumer is exhausted and the traffic soars, occupying most of the provider's machine time, causing other consumers that may be more important to be unable to be served normally.

Therefore, on the provider side, it is necessary to set a traffic line for each consumer according to the importance of the consumer and the usual QPS size. At the same time, only N thread support will be provided to A consumer. If the limit is exceeded, it will wait or directly reject it.

2.4.1 Resource isolation

The provider can limit the traffic from the consumer to prevent the provider from being dragged down.

Similarly, the consumer also needs to isolate the thread resources that call the provider. This ensures that calling a certain provider logic will not consume the thread pool resources of the entire consumer.

2.4.2 Service downgrade

The downgrade service can be judged automatically by the code, or manually switched according to emergencies.

2.4.2.1 Consumer side

If the consumer finds that a certain provider has abnormal conditions, such as frequent timeouts (which may be downgraded by fuses), or data errors, the consumer can adopt certain strategies to downgrade the logic of the provider, and basically return fixed data directly.

2.4.2.2 On the provider side,

when the provider finds that the traffic surges, in order to protect its own stability, it may also consider downgrading the service.

For example, 1, directly return fixed data to the consumer, 2, if it needs to be written to the database in real time, first cache it in the queue and write it asynchronously to the database.

3 Rethinking from a macro perspective

The macro includes longer links that are more complex than A -> B.

A long link is a calling environment such as A -> B -> C -> D.

Moreover, a service will also be deployed on multiple machines. A service will actually have A1, A2, A3...

microscopically reasonable problems, but macroscopically not necessarily reasonable.

In the following discussions, the main point I want to express is: if the system is complex, the fault-tolerant configuration of the system must be viewed as a whole, and the overall control can be more meaningful.

3.1 Timeout

If the time-out time set by A for B is shorter than the time-out time set by B for C, then it is definitely unreasonable. A hangs up directly when the time-out time is up, and it makes no sense for B to support too long a time-out time for C.

R represents the internal logic execution time of the service consumer itself, and TAB represents the time from when consumer A starts calling provider B until it returns.

So then TAB > RB + TBC is correct.

3.2

Retry Retry is similar to the problem faced by timeout.

Service B usually returns in 100ms, so A sets a timeout of 110ms for B, and B sets a retry for C, and finally returns correctly in 120ms, but A's timeout time is relatively tight, so B's retry to C is rejected. wasted.

A may also retry B, but because of what we mentioned in the previous article, it may be that C does not perform well. Every time B retries, it will be OK, but A cannot get the correct result after two retries.

N indicates the number of retries set.

Modify the formula in the above section, TAB > RB+TBC * N.

Although there is no problem with this formula itself, if we think from the perspective of long links, we need to plan the timeout time and number of retries of each service as a whole, not just the formula.

For example, the following situation:

A -> B -> C.

RB = 100ms, TBC=10ms

B is a core service, and the computational cost of B is very large, so A should try to give B a longer timeout period, and try not to retry calling B, and if B finds that C has timed out, B can Call C a few more times, because retrying C is cheap and retrying B is expensive. so …

3.3 Fusing

A -> B -> C, if C has a problem, then B is blown, and A does not need to be blown.

3.4 Current limiting

B only allows A to request traffic with QPS<=5, but C only allows B to request qps with QPS<=3, then the setting of B for A is a bit large, and the upstream setting depends on the downstream.

Moreover, the configuration of the QPS by the current limit may change as the service adds or subtracts machines. It is best to configure it at the cluster level and automatically adjust it according to the size of the cluster.

3.5

Service If the problem of service degradation is to be dealt with as a whole,

1. It must be the interface that degrades the priority first, whichever is less.

2. If the overall service link has no particularly poor performance, such as a sudden surge in external traffic, the degradation starts from the outside to the inside.

3 If a service can detect an increase in its own load, it can be downgraded from the service itself.

3.6 Ripple

A -> B -> C, if the service of C is jittery, but B does not handle the jitter, causing the service of B to also jitter, and when A calls B, the service jitter will also occur.

This temporarily unavailable state is passed from the bottom layer to the top layer like a wave.

Therefore, from the perspective of the entire system, each service must try to control the jitter of its own downstream services, and do not let the entire system jitter with a certain service.

3.7 Cascading failure (cascading failure)

A service in the system fails and becomes unavailable, which transitively leads to the unavailability of services in the entire system.

The difference from the ripples above (self-made words) is also a matter of seriousness.

Ripple describes the occasional and unstable delivery of services, while cascading failures basically make the system unusable. Generally, the former may not be paid attention to because of short-term recovery, while the latter is generally highly valued.

3.8 Critical path The

critical path is the downstream service chain that your service must fully depend on in order to work properly. For example, the database is generally a node in the critical path.

Minimizing the number of critical path dependencies is a measure to improve service stability.

The database is generally at the bottom of the service system. If your service can completely cache the data used by itself and release the database dependency, then the database will hang and your service will be temporarily safe.

3.9 Longest Path

To optimize the response time of your service, you need to look at the longest path in the service call logic. Only by shortening the time of the longest path can the performance of your service be improved.

refer to: http://www.tuicool.com/articles/mE3y2i

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326408636&siteId=291194637
Recommended