High-availability system commonly used weapon (1)-service downgrade Hystrix

0. Preface

Internet high-concurrency systems generally have high QPS and TPS. When the traffic is relatively large, there are three common methods to ensure the high availability and stability of the system:

Cache
Service degradation and fusing
Service current limit

The purpose of caching is to reduce the pressure on the database and improve the access speed of the system. The use of caching requires consideration of cache penetration, cache invalidation, and inconsistencies between the DB and the cache under high concurrency; service degradation and circuit breakers are used to solve the core service being non-core A means to affect the service, such as the coupon display on the user's order page. When the service displayed by the coupon hangs up, the service can be blocked, so as not to affect the user's normal order placement; in some scenarios, caching is not allowed To solve the problem with service degradation, such as users placing orders, snapping up purchases, etc., service current limiting is needed at this time. Service current limiting is used to limit the number of requests for access to the service within a certain period of time to protect the system.
This article mainly explains the use of Hystrix technology stack in service degradation.

1. What is Hystrix

In distributed Internet applications, core businesses are often extracted as independent services for other services. For example, the e-commerce system will be split into multiple services such as orders, inventory, reviews, and C-end display, and each service will have a dedicated RD for maintenance.

When a user places an order, it will first call the order creation interface in the order service, and the order creation interface will call the inventory service to check whether the user's selected product inventory is sufficient. If the
inventory service is suspended due to network, bug, etc. issues, it will The thread that caused the order creation has been waiting to hang. If a large number of requests enter the system at this time, a large number of threads will hang, which will paralyze the entire system.

One of the solutions is that if the inventory service is unavailable, you can use the idea of fuse to make it fail quickly and return a response with insufficient inventory, so as to avoid the caller service being unavailable (it can be implemented through code fault tolerance, and set the call timeout of the RPC interface time).

Another example is the coupon display on the user's order page. If a failure occurs when the coupon query interface is called, the user cannot continue the order. At this time, you can use the idea of downgrading to make the coupon query interface directly return to empty, so that the user terminal will not see the coupons that can be used at most, but the order can be placed normally without affecting the use of core business.

However, it is not possible to directly fuse or degrade after a timeout is called. How to set the number of times, the time of fuse, and the subsequent operations after fuse, etc. need to be processed by a service, and Hystrix provides a service degradation and fuse implementation.

Introduction of Hystrix

Hystrix is a disaster tolerance framework open sourced by netflix, which solves the problem of bringing down business systems and even causing avalanches when external dependencies fail
(wiki see: https://github.com/Netflix/Hystrix/wiki).
Hystrix means porcupine. A porcupine is an animal with spines all over. Netflix's use of Hystrix means that Hystrix can protect your application like the spines of a porcupine. Hystrix is a fault-tolerant system open sourced by Netflix. The framework provides resilience to distributed environments, increases delay tolerance and fault tolerance.

Hystrix provides the following operations:

Degradation: Degrade over time and insufficient resources (threads or semaphores). After degrading, you can cooperate with the degraded interface to return the bottom data;
Isolation (thread pool isolation and semaphore isolation): Limit the use of resources for invoking distributed services, and problems with a certain invoked service will not affect other service invocations.
Fuse: When the failure rate reaches the threshold, the downgrade is automatically triggered (for example, the failure rate is high due to network failure/timeout), and the fast failure triggered by the fuse will recover quickly;
Cache: Provides request cache and request merge implementation;
Support real-time monitoring, alarm, control (modify configuration).

Second, the principle of Hystrix

The following briefly introduces the implementation principles of downgrading, fusing and caching provided by Hystrix.

2.1 Main workflow

1. Generate Hystrix commands by inheriting HystrixCommand / HystrixObservableCommand;
2. Execute the command
- execute() — Synchronize, return the result directly, execute()=queue().get();
- queue() — Asynchronous, returns Future, queue()=toObservable().toBlocking().toFuture();
- observe() — Request to return an observer. The observer needs to register a behavior listener to process the returned result. (The request is triggered when observe is called);
- toObservable() — returns a delayed observer, the request will be triggered after the action of registering the listener is completed.
3. Determine whether there is already returned data in the cache;
4. Determine whether the circuit breaker is turned on; if it is turned on, go to step 8, otherwise go to step 5;
5. Determine whether the Thread Pool/Semaphore is full.
If the concurrency is full, go to step 8, otherwise go to step 6
6. Execution; if the execution fails, Hystrix will proceed to step 8 and discard the final return result; if the execution ends, Hystrix will return the result and provide log and measurement data
7. Calculate the health of the circuit.
Hystrix reports the results of success, failure, and rejection to the circuit breaker. The circuit breaker maintains a rolling counter to calculate statistical data. The circuit breaker uses these data to determine whether to set the state to open
8. If the execution of fallback
4/5/6 is not satisfied or abnormal, it will enter the fallback, and implement HystrixCommand.getFallback() or HystrixObservableCommand.resumeWithFallback()

Insert picture description here

2.2 Circuit breaker

When the number of access requests is greater than the threshold (more than 20 times in 10 seconds by default), and the error rate exceeds the threshold (50% by default), the circuit breaker is turned on and subsequent requests are rejected.

After the specified time window has elapsed, a request is allowed to try the call, and the circuit breaker is closed according to the result of the request.
Insert picture description here

Third, the use of Hystrix

Let's first take a look at how to build a case of using Hystrix to achieve fusing.

1. Add Maven dependency

(There will be conflicts with Guava dependency packages, and conflicts need to be resolved)

    <dependency>
      <groupId>com.netflix.hystrix</groupId>
      <artifactId>hystrix-core</artifactId>
      <version>1.5.12</version>
    </dependency>

    <dependency>
      <groupId>com.netflix.hystrix</groupId>
      <artifactId>hystrix-javanica</artifactId>
      <version>1.5.12</version>
    </dependency>

2. Configure Hystrix Aspect: add it to applicationContext.xml

<aop:aspectj-autoproxy/>
<bean id="hystrixAspect" class="com.netflix.hystrix.contrib.javanica.aop.aspectj.HystrixCommandAspect"></bean>
<context:annotation-config/>

3、demo

Write a stock service StockService to query product inventory, and set that when the caller calls the interface over 400ms, it will enter the circuit breaker and fail quickly.

Slf4j
Service("stockService")
public class StockServic{
    public String queryStock () {
        return "库存充足;
    }
}

Use Hystrix to encapsulate StockService:

Slf4j
Component
public class StockServiceCommand {

    @Autowired
    private StockServic stockService;

    @HystrixCommand(groupKey = "stockService", commandKey = "queryStock", fallbackMethod = "queryStockFallback",
            commandProperties = {
                    //指定多久超时，单位毫秒。超时进fallback
                    @HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds", value = "400"),
                    //判断熔断的最少请求数，默认是10；只有在一个统计窗口内处理的请求数量达到这个阈值，才会进行熔断与否的判断
                    @HystrixProperty(name = "circuitBreaker.requestVolumeThreshold", value = "10"),
                    //判断熔断的阈值，默认值50，表示在一个统计窗口内有50%的请求处理失败，会触发熔断
                    @HystrixProperty(name = "circuitBreaker.errorThresholdPercentage", value = "10")
            }
    )
    public String queryStock() {
        // 模拟网络延时:超过400ms,就会进入熔断
        try {
            Thread.sleep(1000);
        }catch (Exception e) {
            log.error("thred sleep error,e:", e);
        }
        return stockService.queryStock();
    }

@HystrixCommand(groupKey = "stockService, commandKey = "queryStockFallback")
    public String queryStockFallback(Throwable e) {
        log.error("Hystrix fall method, queryStockFallback error, e:", e);
        return "库存不足,快速失败";
    }

}

test:

Service("orderService")
@Slf4j
public class OrderService {

    @Autowired
    private StockServiceCommand stockServiceCommand;

    public String createOrder () {
        log.info("用户开始下单了");
        // 调用库存服务
        String result = stockServiceCommand.queryStock();
        return result;
    }
}

@Controller
public class OrderController {

    @Autowired
    private OrderService orderService;

    @RequestMapping(value = "/createOrder", method = RequestMethod.GET)
    @ResponseBody
    public String createOrder() {
        return orderService.createOrder();
    }
}

Test result:
You can see that the order service enters the StockService fallback method for rapid failure due to network timeout when calling StockService:
Insert picture description here

And the caller will also get the result of fast failure:
Insert picture description here

In this way, the caller will not have been waiting due to a network timeout.

Four, Hystrix common parameters introduction

The following describes the common parameters of Hystrix.

groupKey: Method group name, which affects the selection of display group and thread pool in THREAD mode, the default value when not set: @HystrixCommand annotated method class name;
commandKey: method name, the methods are distinguished from each other only by commandKey. If two methods have the same commandKey, the same method configuration and health value will be used
. The default value when not set: @HystrixCommand annotated method name;
fallbackMethod: Specify fallback method, default value: no fallback method, Hystrix throws an exception when it needs to execute fallback

The falback method needs to keep the parameter list and return type consistent with the original method.
You can add a Throwable type parameter at the end of the parameter list to receive the execution exception
@param ex Hystrix returned by Hystrix. The execution exception may be the following:

1. Various business exceptions: uncaught exceptions are thrown when the original method is executed;
2. HystrixTimeoutException: Hystrix detects that the original method has executed a timeout;
3. HystrixSemaphoreRejectionException: semaphore acquisition failed in SEMAPHORE mode;
4. HystrixShortCircuitOpenException: in a fuse or manual degraded state;
5. RejectedExecutionException: The thread pool rejects the task in THREAD mode;

Hystrix provides two modes: SEMAPHORE and Thread mode.

import com.netflix.hystrix.contrib.javanica.annotation.HystrixCommand;
import com.netflix.hystrix.contrib.javanica.annotation.HystrixProperty;

import static com.netflix.hystrix.contrib.javanica.conf.HystrixPropertiesManager.*;

/*
 * 注意: @HystrixCommand 注解方式依赖 AOP, 不支持在同一个类的内部方法之间直接调用, 必须将被调用类作为 bean 注入并调用
 */
public class DemoCircuitBreakerAnnotation {

    /**
     * 使用 SEMAPHORE 模式及通用参数说明
     */
    @HystrixCommand(
            groupKey = "GroupAnnotation",
            commandKey = "HystrixAnnotationSemaphore",
            fallbackMethod = "HystrixAnnotationSemaphoreFallback",
            commandProperties = {
                /*
                 * 以 SEMAPHORE (信号量)模式执行, 原方法将在调用此方法的线程中执行
                 *
                 * 如果原方法无需信号量限制, 可以选择使用 NONE 模式
                 * NONE 模式相比 SEMAPHORE 模式少了信号量获取和判断的步骤, 效率相对较高, 其余执行流程与 SEMAPHORE 模式相同
                 *
                 * 默认值: THREAD
                 */
                @HystrixProperty(name = EXECUTION_ISOLATION_STRATEGY, value = "SEMAPHORE"),
                /*
                 * 执行 run 方法的信号量上限, 即由于方法执行未完成停留在 run 方法内的线程最大个数
                 * 执行线程退出 run 方法后释放信号量, 其他线程获取不到信号量无法执行 run 方法
                 *
                 * 默认值: 1000, SEMAPHORE 模式下有效
                 */
                @HystrixProperty(name = EXECUTION_ISOLATION_SEMAPHORE_MAX_CONCURRENT_REQUESTS, value = "100"),
                /*
                 * 执行 fallback 方法的信号量上限
                 *
                 * 注意: 所有模式(NONE|SEMAPHORE|THREAD) fallback 的执行都受这个参数影响
                 *
                 * 默认值: Integer.MAX_VALUE
                 */
                @HystrixProperty(name = FALLBACK_ISOLATION_SEMAPHORE_MAX_CONCURRENT_REQUESTS, value = "1000"),
                /*
                 * 超时时间参数
                 * 在 SEMAPHORE 模式下, 方法超时后 Hystrix 不会中断原方法的执行线程, 只标记这次方法的执行结果为失败(影响方法的健康值)
                 * 同时另开一个线程执行 fallback, 最终返回 fallback 的结果
                 *
                 * 默认值: 1000
                 */
                @HystrixProperty(name = EXECUTION_ISOLATION_THREAD_TIMEOUT_IN_MILLISECONDS, value = "500"),
                /*
                 * 方法各项指标值存活的滑动时间窗口长度, 每经过一个时间窗口长度重置各项指标值, 比如: 方法的健康值
                 *
                 * 默认值: 10000
                 */
                @HystrixProperty(name = METRICS_ROLLING_STATS_TIME_IN_MILLISECONDS, value = "10000"),
                /*
                 * 滑动时间窗口指标采样的时间分片数, 分片数越高时, 指标汇总更新的频率越高, 指标值的实时度越好, 但同时也占用较多 CPU
                 * 采样过程: 将一个滑动时间窗口时长根据分片数等分成多个时间分片, 每经过一个时间分片将最新一个时间分片的内积累的统计数据汇总更新到时间窗口内存活的已有指标值中
                 *
                 * 注意: 这个值只影响 Hystrix Monitor 上方法指标值的展示刷新频率，不影响熔断状态的判断
                 *
                 * 默认值: 10
                 */
                @HystrixProperty(name = METRICS_ROLLING_STATS_NUM_BUCKETS, value = "10"),
                /*
                 * 健康值采样的间隔, 相当于时间片长度, 每经过一个间隔将这个时间片内积累的统计数据汇总更新到时间窗口内存活的已有健康值中
                 *
                 * 健康值主要包括: 方法在滑动时间窗口内的总执行次数、成功执行次数、失败执行次数
                 *
                 * 默认值: 500
                 */
                @HystrixProperty(name = METRICS_HEALTH_SNAPSHOT_INTERVAL_IN_MILLISECONDS, value = "500"),
                /*
                 * 一个滑动时间窗口内, 方法的执行次数达到这个数量后方法的健康值才会影响方法的熔断状态
                 *
                 * 默认值: 20
                 */
                @HystrixProperty(name = CIRCUIT_BREAKER_REQUEST_VOLUME_THRESHOLD, value = "10"),
                /*
                 * 一个采样滑动时间窗口内, 方法的执行失败次数达到这个百分比且达到上面的执行次数要求后, 方法进入熔断状态, 后续请求将执行 fallback 流程
                 *
                 * 默认值: 50
                 */
                @HystrixProperty(name = CIRCUIT_BREAKER_ERROR_THRESHOLD_PERCENTAGE, value = "50"),
                /*
                 * 熔断状态停留时间, 方法进入熔断状态后需要等待这个时间后才会再次尝试执行原方法重新评估健康值. 再次尝试执行原方法时若请求成功则重置健康值
                 *
                 * 默认值: 5000
                 */
                @HystrixProperty(name = CIRCUIT_BREAKER_SLEEP_WINDOW_IN_MILLISECONDS, value = "5000")
            })
    public String HystrixAnnotationSemaphore(String param) {
        return "Run with " + param;
    }
    public String HystrixAnnotationSemaphoreFallback(String param, Throwable ex) {
        return String.format("Fallback with param: %s, exception: %s", param, ex);
    }
}

2. THREAD mode + general parameter description

import com.netflix.hystrix.contrib.javanica.annotation.HystrixCommand;
import com.netflix.hystrix.contrib.javanica.annotation.HystrixProperty;

import static com.netflix.hystrix.contrib.javanica.conf.HystrixPropertiesManager.*;

/*
 * 注意: @HystrixCommand 注解方式依赖 AOP, 不支持在同一个类的内部方法之间直接调用, 必须将被调用类作为 bean 注入并调用
 */
public class DemoCircuitBreakerAnnotation {

    /**
     * 使用 THREAD 模式及线程池参数、通用参数说明
     */
    @HystrixCommand(
            groupKey = "GroupAnnotation",
            commandKey = "HystrixAnnotationThread",
            fallbackMethod = "HystrixAnnotationThreadFallback",
            /*
             * 线程池名, 具有同一线程池名的方法将在同一个线程池中执行
             *
             * 默认值: 方法的groupKey
             */
            threadPoolKey = "GroupAnnotationxThreadPool",
            threadPoolProperties = {
                /*
                 * 线程池Core线程数及最大线程数
                 *
                 * 默认值: 10
                 */
                @HystrixProperty(name = CORE_SIZE, value = "10"),
                /*
                 * 线程池线程 KeepAliveTime 单位: 分钟
                 *
                 * 默认值: 1
                 */
                @HystrixProperty(name = KEEP_ALIVE_TIME_MINUTES, value = "1"),
                /*
                 * 线程池最大队列长度
                 *
                 * 默认值: -1, 此时使用 SynchronousQueue
                 */
                @HystrixProperty(name = MAX_QUEUE_SIZE, value = "100"),
                /*
                 * 达到这个队列长度后, 线程池开始拒绝后续任务
                 *
                 * 默认值: 5, MaxQueueSize > 0 时有效
                 */
                @HystrixProperty(name = QUEUE_SIZE_REJECTION_THRESHOLD, value = "90"),
            },
            commandProperties = {
                /*
                 * 以 THREAD (线程池)模式执行, run 方法将被一个线程池中的线程执行
                 *
                 * 注意: 由于有额外的线程调度开销, THREAD 模式的性能不如 NONE 和 SEMAPHORE 模式, 但隔离性比较好
                 *
                 * 默认值: THREAD
                 */
                @HystrixProperty(name = EXECUTION_ISOLATION_STRATEGY, value = "THREAD"),
                /*
                 * 方法执行超时后是否中断执行线程
                 *
                 * 默认值: true, THREAD 模式下有效
                 */
                @HystrixProperty(name = EXECUTION_ISOLATION_THREAD_INTERRUPT_ON_TIMEOUT, value = "true"),
                /*
                 * 超时时间参数
                 * 在 THREAD 模式下, 方法超时后 Hystrix 默认会中断原方法的执行线程, 并标记这次方法的执行结果为失败(影响方法的健康值)
                 * 同时另开一个线程执行 fallback, 最终返回 fallback 的结果
                 *
                 * 默认值: 1000
                 */
                @HystrixProperty(name = EXECUTION_ISOLATION_THREAD_TIMEOUT_IN_MILLISECONDS, value = "500")
                /*
                 * 其余参数参考上面的例子, 或者使用默认值
                 */
            })
    public String HystrixAnnotationThread(String param) {
        return "Run with " + param;
    }
    public String HystrixAnnotationThreadFallback(String param, Throwable ex) {
        return String.format("Fallback with param: %s, exception: %s", param, ex);
    }
}