Hystrix explains in simple language: (1) A preliminary study on background and functions

It’s finally time to understand Hystrix. Hystrix literally means hedgehog. This refers to a library class open sourced by Netflix that includes functions such as current limiting and fusing. It can provide the system with the ability to fail quickly and recover quickly. Make it more "flexible".

 

Flow control, fusing and rapid recovery are the basic disaster resistance and fault tolerance capabilities that each service node in a large distributed system should have. Quick stop loss (such as service degradation), prevent the entire distributed system from avalanches? After the emergency disappears, can the entire system service be restored quickly in a short period of time? This will be one of the topics of this series of articles, maybe Hystrix can help you do that.

 

Hystrix was developed by the Netflix API team in 2011, and started to be promoted and used within Hystrix in 2012. Hystrix has been in Netflix for a long time and is now a very mature system. Recently, the popular microservice architecture and Spring Cloud make Hystrix has become a supporting infrastructure and has gradually become popular in China. However, some users still complain that Hystrix is ​​not easy to use, so why? Let's gradually unravel the veil of Hystrix.

 

If you want to do a simple current limiting function, it is very easy. The article "Design and Implementation of Common Current Limiting Schemes" has clearly introduced its implementation, but if you want to do more precise control, processed subdivision and Fast recovery, there is still a lot of work to be done. Many RPC frameworks also have their own flow control and fusing functions, such as Dubbo, but the functions are not powerful enough. Most of them require manual operation, which is still a long way from automatic. This is why it needs to be used as a separate solution.

  

Here is a brief explanation of some key terms:

 

Current limit: that is to limit the maximum value of the flow, which is a way of flow control.

 

Fusing/Breaking: Fusing is actually a state in the fuse (also called circuit breaker) mode. Let's talk about the circuit breaker first. The circuit breaker is essentially a physical component, but here it refers to software. A design scheme for the system to quickly stop losses in emergencies (see http://martinfowler.com/bliki/CircuitBreaker.html for details ), there are three states in the fuse design, closed (closed state, traffic can enter normally), open (that is, the fuse state, once the error reaches the threshold, the fuse will open, rejecting all traffic) and half-open (half-open state, the open state will automatically enter this state after a period of time, and re-receive traffic, once the request fails, restart the Enter the open state, but if the number of successes reaches the threshold, it will enter the closed state), that is, the following figure:



 Downgrade: that is, what we often call service downgrade, which actually comes from the service level (or service classification). Downgrade refers to the need to downgrade the service level when certain conditions or special scenarios are met.

 

In order not to get caught in the vicious circle of terminology (such as the difference between concurrency and parallelism), we hereby refer to the whole set of emergency plans and measures taken when the system encounters "danger" as downgrade or service degradation. To help the service to automatically downgrade, you need to do the following steps first:

  1. Configurable downgrade policy: downgrade policy = downgrade condition + processing plan after downgrade, the policy must be configurable, because different services have different definitions of service quality, and the downgrade plan will also be different.
  2. Identifiable downgrade boundary: Be sure to know exactly who needs to be downgraded, which can be an external service, a downstream dependency, or an internal piece of processing logic. Degradation boundaries are mainly used to implant degraded logic.
  3. Data collection: Whether the downgrade condition is met depends on the collected data, which can be data from a certain current period of time or historical data for a long period of time.
  4. Behavioral intervention: After entering the degraded state, it will interfere with the normal business process, which may be current limiting, fusing, or the synchronous process may become an asynchronous process (for example, the one that sends MQ becomes oneway), etc.
  5. Result intervention: whether to return null, or the default value, or change the process from synchronous to asynchronous, etc.
  6. Fast recovery: that is, how to go from a degraded state back to a normal state, which also requires certain conditions to be met.

 

Let's take a step-by-step look at how Hystrix does the above points.

 

### Configurable downgrade policy ###

Hystrix provides three degradation strategies: concurrency, time-consuming and error rate, and Hystrix's design itself can well support dynamic adjustment of these strategies (simply speaking, adjusting the thresholds of concurrency, time-consuming and error rate), of course , How to dynamically adjust these policies needs to be implemented by users themselves, Hystrix only provides an entry, that is to say, Hystrix does not provide a server-side interface to dynamically adjust these policies, which is somewhat regrettable. If you want to understand the specific policy configuration of Hystrix, you can look at the two classes HystrixCommandProperties and HystrixThreadPoolProperties.

 

### Recognized downgrade boundaries ###

The first difficulty faced by the downgrade tool is how to implant downgrade logic in the business code. The business R&D personnel must clarify and define the risk points in advance, and then extract the logic of these places. Hystrix packs the business logic that needs to be downgraded. It is the Command design pattern. We know that the command pattern is mainly to encapsulate the request into the object, so that we can use the request as the object. This is very beneficial to Hystrix, because the business logic and data you need to downgrade have been encapsulated into a Command object and handed over to Hystrix. Hystrix directly takes over the execution right of the business logic. When to call, or even not to call, we Let's take a look at the command interface defined by Hystrix (actually an abstract class, which has been simplified here):

 

public abstract class HystrixCommand<R> {

	protected abstract R run() throws Exception;

	protected R getFallback() {
		throw new UnsupportedOperationException("No fallback available.");
	}
	
	public R execute() {
	    try {
	        return queue().get();
	    } catch (Exception e) {
	        throw Exceptions.sneakyThrow(decomposeException(e));
	    }
	}
	
	public Future<R> queue() {
	    ... 太长了,略 ...
	}
	
}

 

只需要简单继承HystrixCommand,就相当于接入了Hystrix,泛型R代表返回值类型,在run()方法中直接实现正常的业务逻辑,并返回R类型的结果,如果降级后需要返回特殊的值,你只需要覆盖getFallback()方法即可。举个例子,我们这里有个抽奖活动,只要是我们的注册用户,就有一次抽奖机会,前提是不在我们的黑名单内。

 

public class ChouJiangService {

    /**
     * 尝试抽奖
     *
     * @param userId
     * @return 中奖结果,false-没中奖,true-已中奖
     */
    public boolean tryChouJiang(Long userId) {

        if(!checkUserStatus(userId)) {
            return false;
        }

        // 返回是否中奖
        return ThreadLocalRandom.current().nextInt(2) == 1;
    }

    /**
     * 检查用户是否是黑名单用户
     *
     * @param userId
     * @return true-不在黑名单,false-在黑名单
     */
    private boolean checkUserStatus(Long userId) {
        // 请不要在意内部逻辑有多奇怪,这里只是演示
        return System.currentTimeMillis() % 2 == 1;
    }
}

 

对于抽奖服务来说,检查用户黑名单并不是必须的行为,如果checkUserStatus内部发生问题(有可能里面依赖了外部服务),不应该影响正常的抽奖逻辑,因为毕竟在黑名单里的用户是少数,如果我们要对checkUserStatus逻辑使用Hystrix,我们就会先创建一个CheckUserStatusCommand类,来封装检查用户黑名单的逻辑:

 

public class CheckUserStatusCommand extends HystrixCommand<Boolean> {

    private Long userId;

    public CheckUserStatusCommand(Long userId) {
        super(HystrixCommandGroupKey.Factory.asKey("ChouJiangCommandGroup"));
        this.userId = userId;
    }

    @Override
    protected Boolean run() throws Exception {
        // 请不要在意内部逻辑有多奇怪,这里只是演示
        return System.currentTimeMillis() % 2 == 1;
    }
}

 

这样做了以后,ChouJiangService#tryChouJiang方法就该写成如下样子:

 

public boolean tryChouJiang(Long userId) {

    if(!new CheckUserStatusCommand(userId).execute()) {
        return false;
    }

    // 返回是否中奖
    return ThreadLocalRandom.current().nextInt(2) == 1;
}

 

可以看出,我们创建了一个CheckUserStatusCommand的实例,然后调用了execute方法来获取结果,这样就基本完成了,Hystrix库类已经给检查用户黑白名单的逻辑附上了自动降级逻辑了,当然里面使用了大量Hystrix默认的降级策略配置(本文不是Hystrix使用的详细教程,所以这里主要突出的是用法而不强调具体的策略配置)。这里同样也说明了为什么动态调整配置是很容易的,因为每个请求都会新建Command对象(注意,Command对象是有状态的,不能重用),你只需要在创建时调整策略参数就行了,当然,这得用户自己来实现。<!--EndFragment-->

 

虽然看起来很简单,但老司机马上会发现问题:

 

  1. 系统中每一处需要降级的逻辑都需要将其封装成一个Command类,哪怕需要降级的方法只有一行代码。如果一个系统有一百个需要降级的点,那么我们需要在系统中新增一百个Command类,有时候这让人难以接受。
  2. 对老的业务系统来说,接入Hystrix将意味着巨大的工作量,因为你要把很多逻辑都封装成Command,你能接受但测试同学未必愿意。
  3. 每次请求都将创建一个Command对象,因为Command对象包含了降级逻辑的大部分操作,是个重状态的对象,不能复用,如果QPS过高,将产生大量的朝生夕死的对象,对内存分配和GC将产生一定的压力。

 

很多用户确实也提出过抱怨,为何Hystrix的侵入性那么强?但Hystrix设计者们这么做自然有他们的道理(详见:https://github.com/Netflix/Hystrix/wiki/FAQ%20:%20General 的Why is it so intrusive?部分),他们认为,我们需要给应用的依赖提供一个清晰的屏障,使用Command模式不仅仅是出于功能上的原因,也是作为一种标准机制,通过Command对象来向用户传递它是受保护的资源。可见,Hystrix的设计者们并不建议我们使用基于注解或AOP来作为接入Hystrix的方式,但他们仍然说:If you still feel strongly that you shouldn't have to modify libraries and add command objects then perhaps you can contribute an AOP module.(直译过来就是如果你嫌麻烦不想创建这么多Command对象,有本事你自己去实现AOP啊!开个玩笑(*^__^*) )。

 

### 数据采集 ###

收集数据是必不可少的一步,每个降级点(需要采取降级保护的点)的数据是独立的,所以我们可以给每个降级点配置单独的策略。这些策略一般是建立在我们对这些降级点的了解之上的,初期甚至可以先观察一下采集的数据来指定降级策略。采集哪些数据?数据如何存储?数据如何上报?这都是Hystrix需要考虑的问题,Hystrix采用的是滑动窗口+分桶的形式来采集数据(具体细节见另一篇),这样既解决了数据在统计周期间切换而带来的跳变问题(通过时间窗口),也控制了切换了力度(通过桶大小)。另一个有意思的地方是,与常规的同步统计数据的方式不同,Hystrix采用的是RxJava来进行事件流的异步统计数据,类似于观察者模式(具体细节见另一篇),这样做的好处是降低统计时阻塞业务逻辑的风险,在某些情况下还能享受多核CPU所带来的性能上的收益。

 

### 行为干预 ###

一旦发现采集的数据命中了降级策略,那么降级工具就将对请求进行行为干预,行为干预是评价一个降级工具好坏的重要指标,它的设计直接关系到系统的“弹性”到底有多大。但有时候行为干预和上面提到的数据采集这两个动作是同时完成的,比如使用信号量、线程池或者令牌桶算法来进行降级的时候。行为干预的设计是很有技巧的,一般来说有如下两种方案:

  1. 实时采集(当前某段时间周期的)数据,对每笔请求都进行策略判断(每笔请求都会加入数据并进行分析),一旦命中策略,当即对这笔请求进行行为干预,如果没有命中,则执行正常的业务逻辑。
  2. 实时采集(当前某段时间周期的)数据,对每笔请求都进行策略判断(每笔请求都会加入数据并进行分析),一旦有一笔请求命中了策略,接下来的一段时间(可配)内的所有请求都会被行为干预,哪怕接下来再也没有请求命中策略,一直到该段时间过去。

方案a似乎是比较合理的,它总是将系统的行为尽可能的控制在我们预期之内(即各项指标都在配置的策略之下),但多数情况下,我们配置策略会比较宽泛,不那么严格,那这时候采用方案a对系统来说还是有一定的风险。这时候就出现了相对更激进的方案b!一但某些请求导致统计数据触犯了降级策略,那么系统会对后续一段时间的所有请求进行降级处理,即我们熟知的降级延长。而Hystrix将两者结合起来了,让行为干预更加灵活。

 

### 结果干预 ###

被降级后的请求是应该返回null?还是默认值?还是抛异常?这些都要根据业务而定。Hystrix也在HystrixCommand提供了getFallback方法来方便用户返回降级后的结果。

 

### 快速恢复 ###

快速恢复功能在那些经常由于外部因素而导致进入降级状态的系统来说尤为重要,降级系统或工具的一个重大目标就是自动性,摆脱需要人为控制开关来保证功能熔断的“原始时代”,所以当外部条件已经恢复,系统也应该在最短的时间内恢复到正常服务状态,这就要求降级系统能够在让业务系统进入降级状态的同时,让业务系统有探测外界环境的机会。大多数降级系统都会在一段时间后“放”一笔请求进来,让它去“试一试”,如果结果是成功的,那么将让业务系统恢复到正常状态,Hystrix同样也是采用这种做法。

 

如果看到这里,其实大家已经对Hystrix的功能有一定的了解,这里再给一张官方的图:


这张图已经充分说明了官方推荐的是通过Command+线程池的模式来进行业务功能的剥离和管理,这些大大小小的线程池,使用不当,将产生隐患,所以千万不要让Hystrix的这种用法变成反模式。

 

在最后,我们来简单总结下Hystrix的特色:

  1. Hystrix内部大量使用了响应式编程模型,通过RxJava库,把能异步做的都做成异步了。这似乎能降低代码复杂度(我是指对RxJava了解的人),并且在多核CPU的服务器上能带来意外性能收获。
  2. Hystrix能做到通过并发、耗时和异常来进行降级,并能在(并发、限流或内部产生的异常导致的)错误率达到一定阈值时进行服务熔断,并且还能做到从降级状态快速恢复。
  3. Hystrix通过Command模式来包装降级业务,这有时候提高了接入成本。
  4. Hystrix只提供了策略变更的入口,但具体的策略可视化和动态配置还是得用户来实现,这确实非常尴尬。
  5. Hystrix默认的仪表盘只提供了简单的实时数据显示,如果要持久化历史数据,也得用户来实现。

 

Hystrix并不完美,但也许简单也是一种美,后续文章将深入介绍Hystrix的内部设计。

 

欢迎关注我们的技术公众号

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326229723&siteId=291194637