How can enterprises enhance service stability and system availability through circuit breakers and downgrades?

API call stability is regarded as the most important indicator of data services. There are various influencing factors of this indicator. "Kangaroo Cloud Data Service Platform DataAPI" has not only conducted stress testing and testing on call performance and stability many times. Tuning, and also provides a variety of configuration item optimization methods for customers to tune themselves. However, API calls may still fail when encountering unexpected heavy traffic or other sudden situations.

When traffic continues to grow and reaches or exceeds the carrying capacity of the service itself, it becomes important to establish a self-protection mechanism for system services. "Kangaroo Cloud Data Service Platform DataAPI" combines API calls and microservice flow control concepts and launches circuit breaker Downgrade function ensures the stability of API calls and system availability to the greatest extent.

This article hopes to use the most popular explanation and appropriate examples to help everyone understand what circuit breaker degradation is.

Circuit Breaker Downgrade Overview

Generally speaking, when it comes to microservice system traffic protection, three methods are mentioned: current limiting, circuit breaker, and downgrade. In fact, they are all important design patterns for system fault tolerance.

Current limiting, circuit breaker, downgrading

● Current limiting

Current limiting is a measure to limit the frequency of system requests and the execution frequency of some internal functions to prevent the entire system from becoming unavailable due to sudden traffic surges. Traffic limiting is mainly a means of defense and protection. It controls traffic and avoids problems starting from the source of traffic.

● fuse

The circuit breaker mechanism is an automated response measure. When the traffic is too heavy or there is a problem with the downstream service, it can automatically disconnect the interaction with the downstream service to prevent further failures. diffusion. At the same time, the circuit breaker mechanism can also self-diagnose whether the errors in the downstream system have been corrected or whether the upstream traffic has been reduced to normal levels to achieve self-recovery.

Circuits are more likeautomated remediation methods, which may occur when the service cannot support a large number of requests or other failures occur in the service, and requests are restricted. processing, and you can also try to restore it.

●Service downgrade

is mainly for non-core business functions. If the core business process exceeds the estimated peak value, current limiting will be required. Downgrade Generally considers the integrity of the distributed system and cuts off the source of traffic from the source. Downgrading is more like a prediction method. Under the premise of predicting traffic peaks, the service experience is reduced by configuring functions in advance, or minor functions are suspended to ensure smooth response of the main process functions of the system.

circuit breaker downgrade

Current limiting and circuit breaker can also be regarded as a means of service degradation. Under the microservice architecture, services and calls between services usually focus on traffic. Developers need to consider traffic routing, traffic control, traffic shaping, and circuit breakers. Degrade, system adaptive overload protection, hotspot traffic protection and other dimensions to ensure the stability of microservices. Among them, circuit breaker and downgrade tend to ensure the stability of key links or key services in the microservice call link.

As shown in the figure below, when serviceD is abnormally unavailable, it will affect serviceA, B, G, and F. If not controlled, the entire microservice may eventually be paralyzed. What circuit breaker does is to stop calling the serviceD service after reaching a certain error threshold; downgrade will return the developer's customized downgrade content to at least ensure the overall availability of the link.

file

Introduction to circuit breaker rules

Kangaroo Cloud Data Service Platform DataAPI」 currently provides three circuit breaker strategies. Each API can and can only be associated with one circuit breaker strategy. The strategy types are:

● Slow call ratio (SLOW_REQUEST_RATIO)

Select slow call ratio as the threshold, you need to set the allowed slow call RT (i.e. the maximum response time), the requested response time If the value is greater than this value, it is counted as a slow call. When the number of requests within the unit statistical time period (statIntervalMs) is greater than the set minimum number of requests, and the proportion of slow calls is greater than the threshold, requests will be automatically circuit breaker in the next circuit breaker period.

After the blowing time, the fuse will enterdetection recovery state (HALF-OPEN state). If the response time of the next request is less than The set slow call RT will end the circuit breaker. If it is greater than the set slow call RT, the circuit breaker will be circuit breaker again.

● Abnormal ratio (ERROR_RATIO)

When the number of requests within the unit statistical time period (statIntervalMs) is greater than the set minimum number of requests, and the proportion of exceptions is greater than the threshold, requests will be automatically circuit breaker in the subsequent circuit breaker period.

After the blowing time, the fuse will enter the detection recovery state (HALF-OPEN state). If the next request is successfully completed (no error), the fusing will end, otherwise it will be blown again. The threshold range for anomaly ratios is [0.0, 1.0], representing 0% - 100%.

●Exception number (ERROR_COUNT)

When the number of exceptions within the unit statistical time period exceeds the threshold, the circuit breaker will be automatically disconnected. After the fuse duration the fuse will enter the detection recovery state (HALF-OPEN state). If the next request is successfully completed (without error), it will end fuse, otherwise it will be fused again.

file

DataAPI for circuit breaker degradation applications

The following will introduce the application of circuit breaker downgrade in "Kangaroo Cloud Data Service Platform DataAPI" through an example.

fuse

The circuit breaker downgrade of data services is implemented based on Sentinel framework. Sentinel's definition of resources is more service-level, but it also provides resource definition for specified code or content. So DataAPI is implemented by defining APIID as the only resource circuit threshold Judgment and implementation of specific circuit breaker actions. At the same time, by controlling the generation rules of resource names, environment isolation is achieved between the test API and the official API.

file

In the process of development and implementation, the biggest difficulty is that Sentinel natively supports cluster control of the current limiting strategy, but does not support cluster control of the circuit breaker strategy. The method adopted by DataAPI is to use the master node to achieve cluster control, which mainly includes the following two points.

● How to load circuit breaker rules only to the master node

First of all, circuit breaker strategy is loaded through DegradeRuleManager based on memory. Since it is based on memory, persistence must be done, otherwise the program will be restarted. The rules will be cleared. Here, MySQL is used to create a circuit breaker rule table to modify the rule details.

Secondly, when starting, obtain the gateway instance list through nacos and select the primary node. Call nacos' namingService.getAllInstance method to obtain all gateway instances. Select the first healthy gateway instance as the master node, and store the master node IP information in redis in the form of key value. This redis key will be updated when the master node is re-elected. Each time all instances of the gateway cluster are started, they first determine whether the current node is As the master node, only the master node will perform the initial loading action of the circuit breaker policy.

file

The DataAPI for the election of the master node here has doneautomatic election. Even if the current master node is down, it can still be elected once every minute. The timer obtains surviving node instances and reselects the primary node to ensure high availability. When the master node changes, a redis notification will be sent to all nodes. After receiving the notification, the new master node will obtain the latest circuit breaker policy from MySQL and load it into the memory. However, this action will clear the previous traffic statistics and the time window will be reset. Set.

file

Finally, how to synchronize policy modifications to the master node. The data service adopts the redis channel notification method. Each gateway monitors channel messages. Only the gateway that determines the current node as the master node will perform memory rules. load operation.

● Cluster threshold determination

The master node performs rule loading and threshold judgment. All examples in the cluster execute API requests normally. Onlythreshold judgment will initiate an http request to The master node makes a judgment and returns the result of whether it passes or not. The overall workflow chart is as follows:

file

Threshold judgment is divided into two modes:

· Error patterns, which can also be broken down into number of errors or error rates

· Slow call mode, refers to the proportion of slow call requests

Since the primary node is judged to be instance B, but the error occurred in instance A/C, it is necessary to manually generate exceptions or slow calls in instance B. When an exception occurs in the node request, the master node receives the exception flag and manually throws the exception. At this time, Sentinel can sense it and the exception number will be increased by one. Slow calls are implemented by modifying the complateTime of the request so that the counter can determine the slow call.

file

Downgrade

It is used to configure the downgraded content on the API editing page. The configuration condition is that the circuit breaker policy needs to be configured first. When the circuit breaker is turned on, the gateway will return completely user-defined downgraded content. This content must be in json format.

The main solution to downgrading is the contradiction between insufficient resources and increased access. With limited resources, it can cope withhigh concurrency and a large number of requests. Especially when the data source accessed by the API configuration cannot carry large traffic, with limited resources, in order to achieve the above effect, some restrictions on some service functions need to be implemented but not completely unavailable. The system will return to the default state. value to ensure that the entire system can run smoothly.

file

The downgrade code implementation is relatively simple. After the judgment condition takes effect, just rewrite the Response of the ServerWebExchange object.

"Dutstack Product White Paper" download address:https://www.dtstack.com/resources/1004?src=szsm

"Data Governance Industry Practice White Paper" download address:https://www.dtstack.com/resources/1001?src=szsm

For those who want to know or consult more about big data products, industry solutions, and customer cases, visit the Kangaroo Cloud official website:https://www.dtstack.com/? src=szkyzg

IntelliJ IDEA 2023.3 & JetBrains annual major version update New concept "defensive programming": make yourself a stable job GitHub .com runs more than 1,200 MySQL hosts, how to seamlessly upgrade to 8.0? Stephen Chow’s Web3 team will launch an independent app next month Will Firefox be eliminated? Visual Studio Code 1.85 released, floating window Yu Chengdong: Huawei will launch disruptive products next year and rewrite industry history U.S. CISA It is recommended to abandon C/C++ and eliminate memory security vulnerabilities TIOBE December: C# is expected to become the programming language of the year Lei Jun’s paper written 30 years ago: "Computer Virus Determination Expert System Principles and Design
{{o.name}}
{{m.name}}

おすすめ

転載: my.oschina.net/u/3869098/blog/10321023