Microservice circuit breaker and isolation

Go to metadata start

 

From:  https://yq.aliyun.com/articles/7443

Microservices have been very popular in recent years, and related articles are full. This article will not describe the architecture design, but only talk about how to do in terms of fault tolerance of distributed services.

1 What are microservices

For microservices, we can simply understand the decoupling of a service to reduce the complexity of the business system, split the functions in the service system into multiple lightweight sub-services, and implement services between each self-service through RPC The advantage of doing this is to simplify the business, each sub-service can have its own independent programming language, mode, etc. and can be maintained independently, deployed independently, and function reuse.

2 Why do you need to do service isolation and fuse

Since microservices exchange data through RPC, we can make an assumption: in IO-type services, assume that service A depends on service B and service C, and service B and service C may continue to depend on other services, continue It will make the call link too long, which is technically called 1->N fanout. If one or several called sub-services are unavailable or have high latency on the link of A, the request to call the service of A will be blocked, and the blocked request will consume threads, io, etc. that occupy the system. Resources, when more and more requests of this type take up more and more computer resources, it will lead to the emergence of system bottlenecks, causing other requests to be unavailable, and eventually leading to the collapse of the business system, also known as the avalanche effect.

1->N sector

  

Avalanche effect

3 Reasons for Service Avalanches

(1) Some machine failures: for example, errors caused by the hard drive of the machine, or some bugs (such as memory interruption or deadlock) on some specific machines.

(2) Changes in server load: sometimes the service may not be able to process requests in time due to user behavior, leading to an avalanche. For example, in Alibaba's Double Eleven event, if the estimated traffic of the machine is not increased in advance, the pressure on the server will suddenly increase. Two hang up.

(3) Human factors: for example, the path in the code has bugs at some point

4 Solutions to Solve or Mitigate Service Avalanches

In general, there are three main solutions for the protection of service dependencies:

(1) Fusing mode: In this mode, the reference circuit is mainly blown. If a line voltage is too high, the fuse will be blown to prevent fire. In our system, if a target service is slow to call or has a large number of timeouts, at this time, the call of the service will be blown. For subsequent call requests, the target service will not be called, but return directly to quickly release resources. Resume the call if the target service improves.

(2) Isolation mode: This mode is like dividing system requests into small islands by type. When a small island is burned out, it will not affect other small islands. For example, a thread pool can be used to isolate resources for different types of requests. Each type of request does not affect each other. If the thread resources of one type of request are exhausted, the subsequent requests of that type are returned directly, and subsequent resources are not called. There are many usage scenarios for this mode, such as disassembling a service, deploying a separate server for important services, or the company's recent promotion of multi-center.

(3) Current-limiting mode: The above-mentioned fusing mode and isolation mode belong to the fault-tolerant processing mechanism after an error, and the current-limiting mode can be called the prevention mode. The current limiting mode is mainly to set the highest QPS threshold for each type of request in advance. If it is higher than the set threshold, the request will be returned directly, and subsequent resources will not be called. This mode cannot solve the problem of service dependence, but can only solve the problem of overall system resource allocation, because requests that are not limited in current may still cause an avalanche effect.

5 Fusing Design

The design of the fuse mainly refers to the practice of hystrix. The most important of which are three modules: fuse request judgment algorithm, fuse recovery mechanism, and fuse alarm

(1) Circuit breaker request judgment mechanism algorithm: use lock-free circular queue counting, each circuit breaker maintains 10 buckets by default, one bucket every 1 second, and each blucket records the status of request success, failure, timeout, and rejection, and default error More than 50% and more than 20 requests within 10 seconds are interrupted and intercepted.

(2) Fusing recovery: For the blown request, some requests are allowed to pass every 5s, and if the requests are all healthy (RT<250ms), the request will be restored healthily.

(3) Fusing alarm: log the blown request, and the abnormal request will alarm if it exceeds certain settings

6 Isolation Design

There are generally two ways to isolate

(1) Thread pool isolation mode: use a thread pool to store the current request, the thread pool processes the request, sets the task return processing timeout time, and accumulates the accumulated requests into the thread pool queue. This method needs to apply for a thread pool for each dependent service, which has a certain resource consumption. The advantage is that it can cope with sudden traffic (when the traffic peak comes, the data can be stored in the thread pool team and processed slowly)

(2) Semaphore isolation mode: use an atomic counter (or semaphore) to record how many threads are currently running, and request to judge the value of the counter first, and discard the new request of the changed type if it exceeds the set maximum number of threads , if it does not exceed, the count operation request is executed to counter +1, and the request returns to the counter -1. This method strictly controls threads and returns immediately, and cannot deal with burst traffic (when traffic peaks come, the number of threads processed exceeds the number, other requests will be returned directly, and will not continue to request dependent services)

7 Design of Timeout Mechanism

There are two types of timeout, one is the request waiting timeout, and the other is the request running timeout.

Waiting for timeout: Set the task queue time when the task is queued, and determine whether the task queue time at the head of the queue is greater than the timeout time, and the task will be discarded if it exceeds.

Running timeout: You can directly use the get method provided by the thread pool

8 Isolation and Fusing Code Implementation

It will be put on github later

9 Performance loss test

Due to the overhead of counting statistics and thread switching, there will be a certain performance loss for each request. The test results show that in the thread pool isolation mode, the average loss of a request is within 0.5ms.

Test method: request sequentially, record the business running time and the time when the isolator runs the business, and the number of requests is 500.

Variable explanation:

Time-consuming for a single request: the running time of the business (simulated using Thread.sleep());

Isolation consumption = total request time - business time;

Isolation Evaluation Consumption = Isolation Consumption/Number of Requests/

Test time statistics (unit ms):

single request time

total request time

business time

Isolate consumption

Isolation Average Consumption

1

586

510

76

0.152

5

2637

2514

124

0.248

10

5248

5136

112

0.024

50

25261

25111

150

0.3

100

50265

50130

135

0.27

200

100657

100284

373

0.746

10 References

Some existing designs and some articles were referenced in the process of design and implementation:

1. Hystrix official documentation: https://github.com/Netflix/Hystrix/wiki

2. Hystrix usage and analysis: http://hot66hot.iteye.com/blog/2155036

3. Facebook article: http://queue.acm.org/detail.cfm?id=2839461

4. Facebook article: http://queue.acm.org/detail.cfm?id=2209336

4. Distributed service fault tolerance mode and practice: http://www.atatech.org/articles/31559

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326082731&siteId=291194637