10 pictures show you what is current limiting, circuit breaker, service downgrade

In a distributed system, if a service node fails or the network is abnormal, it may cause the caller to be blocked and wait. If the timeout is set to a long time, the caller's resources are likely to be exhausted. This in turn caused the caller's upstream system to run out of resources, eventually leading to a system avalanche.

As shown below:

If service D fails and cannot respond, service B can only block and wait when calling D. If service B calls service D and sets the timeout time to 10 seconds and the request rate is 100 requests per second, then 1000 request threads will be blocked waiting within 10 seconds. Resource exhaustion has been unable to provide external services. This in turn affected the services of the entry system A, which eventually led to a complete system collapse.

Improving the overall fault tolerance of the system is an effective means to prevent system avalanches.

In the article "Microservices: a definition of this new architectural term" [1] by Martin Fowler and James Lewis, 9 characteristics of microservices are proposed, one of which is fault-tolerant design.

To prevent the system from avalanche, it must have a fault-tolerant design. In the event of sudden traffic flow, the general practice is to use circuit breaker and service degradation measures for non-core business functions to protect the normal services of core business functions, while for core function services, flow limiting measures are required.

Today, let's talk about current limiting, circuit breaker, and service degradation in system fault tolerance.

1 Current limit

When the processing capacity of the system cannot cope with the sudden flow of external requests, in order to prevent the system from crashing, measures must be taken to limit the flow.

1.1 Current limiting indicators

1.1.1 TPS

System throughput is a key indicator for measuring system performance, and it is most reasonable to limit the current according to the number of completed transactions.

However, for practical purposes, it is not realistic to limit the flow according to the transaction. Completing a transaction in a distributed system requires the cooperation of multiple systems. For example, when shopping in an e-commerce system, we need multiple services such as order, inventory, account, and payment to complete. Some services need to be returned asynchronously, so it may take a long time to complete a transaction. If the current is limited according to TPS, the time granularity may be very large, and it is difficult to accurately evaluate the response performance of the system.

1.1.2 HPS

The number of requests per second refers to the number of requests the server receives from clients per second.

If a request completes a transaction, then TPS and HPS are equivalent. However, in a distributed scenario, multiple requests may be required to complete a transaction, so the TPS and HPS indicators cannot be treated equally.

1.1.3 QPS

The number of client query requests that the server can respond to per second.

If there is only one server in the background, then HPS and QPS are equivalent. However, in a distributed scenario, each request requires multiple servers to cooperate to complete the response.

The current mainstream current limiting methods mostly use HPS as the current limiting indicator.

1.2 Current limiting method

1.2.1 Flow Counter

This is the simplest and most direct method, such as limiting the number of requests per second to 100, and rejecting more than 100 requests.

But there are 2 obvious problems with this method:

  • The unit time (such as 1s) is difficult to control, as shown in the figure below: In this picture, the HPS does not exceed 100 from the time below, but the HPS exceeds 100 from the top.
  • There is a period of time when the traffic exceeds, and it is not necessarily necessary to limit the current. As shown in the figure below, the system HPS limit is 50. Although the traffic exceeds the first 3s, if the timeout time is set to 5s, the current limit is not required.

1.2.2 Sliding time window

The sliding time window algorithm is a popular current limiting algorithm. The main idea is to regard time as a window that rolls forward, as shown in the following figure:

At the beginning, we regard t1~t5 as a time window, each window is 1s. If our current limit target is 50 requests per second, the sum of requests in the window t1~t5 cannot exceed 250.

This window is sliding, and the window of the next second becomes t2~t6. At this time, the statistics of the t1 time slice are discarded and added to the t6 time slice for statistics. The number of requests during this period also cannot exceed 250.

The advantage of the sliding time window is to solve the defects of the traffic counter algorithm, but there are also 2 problems:

  • If the traffic exceeds, it must be discarded or degraded logic
  • The flow control is not precise enough to limit the flow concentrated in a short period of time, nor can it cut peaks and fill valleys

1.2.3 Leaky Bucket Algorithm

The idea of ​​the leaky bucket algorithm is as follows:

Before the client's request is sent to the server, the leaky bucket is used to cache it. The leaky bucket can be a queue with a fixed length, and the requests in the queue are evenly sent to the server.

If the client's request rate is too fast and the queue of the leaky bucket is full, it will be rejected, or the downgrade processing logic will be performed. In this way, the server will not be impacted by burst traffic.

The advantage of the leaky bucket algorithm is that it is simple to implement, and message queues can be used to cut peaks and fill valleys.

But there are also 3 issues to consider:

  • If the size of the leaky bucket is too large, it may bring greater processing pressure to the server; if it is too small, a large number of requests may be discarded.
  • The request sending rate of the leaky bucket to the server.
  • Using the cache request method will make the request response time longer.

The two values ​​of leaky bucket size and sending rate will be selected according to the test results in the initial stage of the project. However, with the improvement of the architecture and the scaling of the cluster, these two values ​​will also change accordingly.

1.2.4 Token Bucket Algorithm

The token bucket algorithm is just like a patient going to a hospital to see a doctor. Before looking for a doctor, you need to register first, and the number of numbers that the hospital puts out every day is limited. If the number of the day is used up, another batch will be released the next day.

The basic idea of ​​the algorithm is to periodically execute the following process:

When the client sends a request, it needs to obtain the token from the token bucket first. If it is obtained, it can send the request to the server. If the token cannot be obtained, it can only be rejected or the logic of service degradation can be used. As shown below:

The token bucket algorithm solves the problem of the leaky bucket algorithm, and the implementation is not complicated, and it can be realized by using semaphore. It is most used in actual current limiting scenarios. For example, Google's guava implements the token bucket algorithm current limiting. If you are interested, you can study it.

1.2.5 Distributed current limiting

If in the distributed system scenario, are the four current limiting algorithms introduced above still applicable?

Taking the token bucket algorithm as an example, if a customer places an order in the e-commerce system, as shown below:

If we save the token bucket in a separate place (such as redis) for the entire distributed system, the client is calling the composite service, and the composite service calls the order, inventory and account services to interact with the token bucket, and the number of interactions Significantly increased.

An improvement is that the client first obtains four tokens before calling the composite service, subtracts one token when calling the composite service and passes three tokens to the composite service, and consumes one token in turn when the composite service calls the following three services.

1.2.6 hystrix current limiting

hystrix can use semaphores and thread pools for current limiting.

1.2.6.1 Semaphore current limit

hystrix can use semaphores for current limiting, such as adding the following annotations to the method of providing services. In this way, only 20 concurrent threads can access this method, and the excess will be transferred to the degraded method errMethod.

@HystrixCommand(
 commandProperties= {
   @HystrixProperty(name="execution.isolation.strategy", value="SEMAPHORE"),
   @HystrixProperty(name="execution.isolation.semaphore.maxConcurrentRequests", value="20")
 },
 fallbackMethod = "errMethod"
)

1.2.6.2 Thread Pool Current Limit

hystrix can also use the thread pool to limit the current, add the following annotations to the method of providing services, when the number of threads

@HystrixCommand(
    commandProperties = {
            @HystrixProperty(name = "execution.isolation.strategy", value = "THREAD")
    },
    threadPoolKey = "createOrderThreadPool",
    threadPoolProperties = {
            @HystrixProperty(name = "coreSize", value = "20"),
   @HystrixProperty(name = "maxQueueSize", value = "100"),
            @HystrixProperty(name = "maximumSize", value = "30"),
            @HystrixProperty(name = "queueSizeRejectionThreshold", value = "120")
    },
    fallbackMethod = "errMethod"
)

Note here: In the thread pool of java, if the number of threads exceeds coreSize, the thread creation request will enter the queue first. If the queue is full, threads will continue to be created until the number of threads reaches the maximumSize, and then the rejection policy will be followed. However, there is one more parameter queueSizeRejectionThreshold in the thread pool configured by hystrix. If queueSizeRejectionThreshold < maxQueueSize, the number of queues reaches queueSizeRejectionThreshold and the rejection strategy will be followed, so the maximumSize is invalid. If queueSizeRejectionThreshold > maxQueueSize, when the number of queues reaches maxQueueSize, the maximumSize is valid, and the system will continue to create threads until the number reaches the maximumSize. Hytrix thread pool setting pit [2]

2 blown

I believe that everyone is familiar with circuit breakers. It is equivalent to a switch, which can prevent traffic from passing through when turned on. For example, a fuse, when the current is too large, will blow, thereby avoiding damage to components.

Service fuse means that the caller accesses the service through the circuit breaker as a proxy. The circuit breaker will continuously observe the success and failure status returned by the service. When the failure exceeds the set threshold, the circuit breaker will be opened, and the request cannot really access the service. .

For better understanding, I have drawn the timing diagram below:

2.1 Status of the circuit breaker

A circuit breaker has 3 states:

  • CLOSED: Default state. The circuit breaker observes that the percentage of failed requests does not reach the threshold, and the circuit breaker considers the proxied service to be in good condition.
  • OPEN: The circuit breaker observes that the proportion of request failures has reached the threshold. The circuit breaker considers that the proxied service is faulty and turns on the switch. The request no longer reaches the proxied service, but fails quickly.
  • HALF OPEN: After the circuit breaker is opened, in order to automatically restore access to the proxied service, it will switch to the half-open state and try to request the proxied service to see if the service has recovered. If successful, it will go to CLOSED state, otherwise go to OPEN state.

The state transition diagram of the circuit breaker is as follows:

2.2 Issues to consider

There are a few things to consider when using circuit breakers:

  • For different exceptions, define different post-fuse processing logic.
  • Set the duration of the fusing, and switch to HALF OPEN to retry after this duration.
  • Record request failure logs for monitoring.
  • For active retry, for example, for the fuse caused by connection timeout, you can use asynchronous threads to perform network detection, such as telenet, and switch to HALF OPEN to retry when it detects that the network is unblocked.
  • Compensation interface, the circuit breaker can provide a compensation interface for operation and maintenance personnel to manually close.
  • When retrying, you can use the previously failed request to retry, but you must pay attention to whether this is allowed by the business.

2.3 Usage scenarios

  • Make the client fail fast when the service fails or is upgraded
  • Failure handling logic is easy to define
  • The response takes a long time, and the read timeout set by the client will be relatively long, preventing the connection and thread resources from being released due to a large number of retry requests by the client.

3 Service downgrade

I talked about current limiting and fusing. In contrast, service degradation is considered from the perspective of the overall system.

After the service is blown, the request is generally allowed to go through the pre-configured processing method, which is a downgrade logic.

Service downgrade is to downgrade non-core, non-critical services.

3.1 Usage scenarios

  • The service handles exceptions and feeds the exception information directly to the client, without any other logic.
  • The service handles exceptions, caches the request, returns an intermediate state to the client, and retry the cached request afterward
  • When the monitoring system detects a sudden increase in traffic, in order to avoid non-core business functions consuming system resources, these non-core functions are turned off
  • The database request pressure is high, you can consider returning the data in the cache
  • For time-consuming write operations, it can be changed to asynchronous write
  • Temporarily close running batch tasks to save system resources

3.2 Downgrade using hystrix

3.2.1 Abnormal downgrade

When hystrix is ​​downgraded, you can ignore an exception and add the @HystrixCommand annotation to the method:

The following code defines the downgrade method as errMethod, and does not downgrade the ParamErrorException and BusinessTypeException exceptions.

@HystrixCommand(
 fallbackMethod = "errMethod",
 ignoreExceptions = {ParamErrorException.class, BusinessTypeException.class}
)

3.2.2 Call timeout degradation

Specifically for calling third-party interface timeout degradation.

The following method is to call the third-party interface and downgrade to the errMethod method without receiving a response for 3 seconds.

@HystrixCommand(
    commandProperties = {
            @HystrixProperty(name="execution.timeout.enabled", value="true"),
            @HystrixProperty(name="execution.isolation.thread.timeoutInMilliseconds", value="3000"),
    },
    fallbackMethod = "errMethod"
)

Summarize

Current limiting, fusing, and service degradation are important design patterns for system fault tolerance. In a sense, current limiting and fusing are also means of service degradation.

Circuit breakers and service degradation are mainly aimed at non-core business functions, and if the core business process exceeds the estimated peak value, current limiting needs to be performed.

For current limiting, it is very important to choose a reasonable current limiting algorithm. The token bucket algorithm has obvious advantages and is also the most used current limiting algorithm.

During system design, these modes need to be configured with corresponding thresholds in conjunction with business volume estimation and performance test data, and these thresholds are best stored in the configuration center to facilitate real-time modification.

Guess you like

Origin blog.csdn.net/m0_63437643/article/details/123733529