Microservice Tool--The Design of Hystrix

In today's prosperous world, microservices are prevalent. Yin and yang meet, and pros and cons coexist.

With the development of the Internet, the traditional mainframe single block application has been applied to the later distributed computing and the current microservices. The development of Docker has made the microservices even more powerful. The advantages of microservices do not need to be elaborated. After searching a lot on the Internet, I will talk about the problems caused by microservices and how to solve them. Through these questions, we analyze how Hystrix has become a tool for designing microservices.

Problems with microservices:

1. Increase the complexity of service dependencies: After a single application is split into microservices, the functions originally running in the same JVM process may be divided into multiple JVM processes, and the split microservices are called through RPC To jointly complete the previous functions, this will inevitably increase the complexity of service dependencies;

2. FailFast: The original system is in the same JVM process, there is no network call, the exception will fail directly, and it will not cause the processing to hang. After the microservice, the logic function is provided by another service on the network. If the processing speed of the provider is very slow, or even eventually fails (timeout), the requester of the service will hang here until the failure is returned; in addition In a period of time, 90% of the requests fail to return, and the 91st request does not need to be sent to the server to increase the pressure on the server. Just fast fail and follow the process. After some time, try the request again.

3. Resources cannot be distributed evenly: After microservices, the dependencies of services will increase, and the server performance and throughput of each dependency will be different. At this time, it is necessary to isolate the resources occupied by each dependency. For example, a certain machine P1 has three services SA, SB, SC, which depend on the three dependencies of A, B, and C respectively. After a period of time, the response speed of the server that A depends on begins to slow down. At this time, the CPU and thread of the machine where S1 is located , IO and other resources will be exhausted to A's dependencies, and there will be no idle CPU, threads, IO and other resources to deal with B and C's dependencies. As a result, S1 not only cannot provide SA services normally to the outside world, but even SB and SC services will be affected, and eventually the machine P1 will be down. Here are two pictures to show:

The following picture is when all dependencies work correctly:

When the response speed of one of the dependencies I starts to slow down, all the threads will be allocated to the dependency I, and will be hung here, and other functions cannot provide services at this time.

The three main issues mentioned above are also the core functions provided by Hystrix: monitoring, circuit breakers, and isolation.

Introduction to Hystrix

Hystrix was born to provide microservices and was developed by Netflix. Currently, Netflix handles tens of billions of thread isolation and more signal isolation requests every day. The code is hosted on Github ( https://github.com/Netflix /Hystrix ). Hystrix monitoring can check the running health of a microservice, internal thread and signal occupancy rate at any time; circuit breakers can achieve fast failure while protecting back-end services; isolation mechanism is to escort the efficient operation of microservices, When a certain dependency is abnormal, it can perform downgrade protection in time.

isolate

The CPU, thread, IO and other resources mentioned above are very precious and limited. When the usage reaches the upper limit, it is not far from the machine downtime. Another container will provide multiple services to the outside world, and each service may have a dependency. When a dependency runs abnormally, only the service corresponding to the dependency can be made unavailable to the outside world, while other services should be normal operation.

In reality, when a service is unavailable, the second and third services will have problems. That's why there is no isolation of dependencies. Because there is a dependency running exception, all the CPU, thread and other resources on the machine have come to process the request of this dependent resource, and there is no idle resource to process the request of other services. At this time, resource isolation for each dependency is required. When there is a problem with a dependency, only the threads, IO, etc. assigned to the dependency are in a busy state, and other threads, IO and other resources should handle other dependencies normally.

Hystrix provides two strategies for the isolation of dependent resources: thread pool isolation and signal isolation.

thread pool isolation

使用线程会在以下三个场景带来性能消耗：1、线程的创建和销毁；2、线程上下文空间的切换，线程池调度需要操作系统介入，系统需要从用户态空间切换到内核态空间，调度完成后又需要切回到用户态空间。

Hystrix通过使用固定线程池大小的方式解决了第一个问题，在创建线程池时，Hystrix只允许你设置CoreSize，而不允许你设置MaxSize。在系统初始化线程池时，MaxSize等于CoreSize。这样就避免了线程的频繁创建和销毁带来的性能消耗。Hystrix的线程池创建源码是在HystrixConcurrencyStrategy类getThreadPool方法中实现的：

public ThreadPoolExecutor getThreadPool(final HystrixThreadPoolKey threadPoolKey, HystrixProperty<Integer> corePoolSize, HystrixProperty<Integer> maximumPoolSize, HystrixProperty<Integer> keepAliveTime, TimeUnit unit, BlockingQueue<Runnable> workQueue) {
        return new ThreadPoolExecutor(corePoolSize.get(), maximumPoolSize.get(), keepAliveTime.get(), unit, workQueue, new ThreadFactory() {

            protected final AtomicInteger threadNumber = new AtomicInteger(0);

            @Override
            public Thread newThread(Runnable r) {
                Thread thread = new Thread(r, "hystrix-" + threadPoolKey.name() + "-" + threadNumber.incrementAndGet());
                thread.setDaemon(true);
                return thread;
            }
        });

}

跟普通的创建线程池没什么区别，根据传进来的线程池参数创建一个线程池。但是玄机出现在调用的地方，传递的corePoolSize和maximumPoolSize是同一个值。调用方是HystrixThreadPool类的173行

public HystrixThreadPoolDefault(HystrixThreadPoolKey threadPoolKey, HystrixThreadPoolProperties.Setter propertiesDefaults) {
            this.properties = HystrixPropertiesFactory.getThreadPoolProperties(threadPoolKey, propertiesDefaults);
            HystrixConcurrencyStrategy concurrencyStrategy = HystrixPlugins.getInstance().getConcurrencyStrategy();
            this.queueSize = properties.maxQueueSize().get();
            this.queue = concurrencyStrategy.getBlockingQueue(queueSize);
            this.metrics = HystrixThreadPoolMetrics.getInstance(

threadPoolKey,

//这里是调用线程池创建的地方，corePoolSize和maximumPoolSize传递的都是coreSize()这个值

                    concurrencyStrategy.getThreadPool(threadPoolKey,properties.coreSize(), properties.coreSize(), properties.keepAliveTimeMinutes(), TimeUnit.MINUTES, queue),
                    properties);
            this.threadPool = metrics.getThreadPool();

            /* strategy: HystrixMetricsPublisherThreadPool */
            HystrixMetricsPublisherFactory.createOrRetrievePublisherForThreadPool(threadPoolKey, this.metrics, this.properties);

}

信号量隔离

为解决线程池隔离带来的第二性能开销的场景，Hystrix使用了信号量隔离。信号量隔离通过原子操作类AtoInteger实现Permits的管理，AtoInteger类使用的是CAS操作，相对于锁，CAS是通过硬件实现的原子操作，减小了性能开销。下面是获取Permit的源码分析：

public boolean tryAcquire() {

//count是AtoInteger类型的成员变量

int currentCount = count.incrementAndGet();

if (currentCount > numberOfPermits.get()) {

count.decrementAndGet();

return false;

} else {

return true;

}

tryAcquire获取Permit成功，调用方要在finally中释放Permit，调用release方法。

public void release() {

//对Permit进行减一操作

count.decrementAndGet();

}

建议不同的业务之间通过线程池隔离，同一个业务不同的依赖资源则可以通过信号量隔离，以提高吞吐量和性能。

熔断器

熔断器的功能等同于家庭电路中的自动跳闸器，当电路中流经的电流负荷过高或者有漏电等非安全用电情况时，跳闸器就会自动跳闸，起到对家用电器保护的作用。软件系统中的熔断器和跳闸器是类似的，当请求回路中发生异常时，熔断器打开。只是Hystrix会在一段时间后试着关闭回路，让部分请求发送成功，以检测异常是否恢复。

熔断器机制是指，在后端服务可用率降低到阀值以下时，新来的请求不再发给后端服务，直接返回请求失败即可，以实现 Fail-Fast机制，快速给用户请求，执行后续的失败流程。这样既可以拦截不必要的请求，减少对后端本来就异常的服务的压力，还可以实现Fail-Fast机制。

Hystrix默认熔断器机制是开启的，可以在HystrixCommand的run方法中设置为开启，设置代码如下：

HystrixCommandProperties.Setter()

//是否开启熔断器机制

.withCircuitBreakerEnabled(true)

开启熔断器机制后，Hystrix默认的阀值是50%。如果在10秒内请求失败率达到50%及以上，Hystrix会自动断开回路，后面的请求不会再被发往后端的服务器中，会直接返回给客户端。5秒之后，Hystrix会试着关闭回路，放一部分请求过去，以检测异常是否恢复。HystrixCircuitBreakerImpl类的allowSingleTest方法中实现该功能，源码如下：

public boolean allowSingleTest() {

long timeCircuitOpenedOrWasLastTested = circuitOpenedOrLastTestedTime.get();

// 1) if the circuit is open

// 2) and it's been longer than 'sleepWindow' since we opened the circuit

if (circuitOpen.get() && System.currentTimeMillis() > timeCircuitOpenedOrWasLastTested +properties.circuitBreakerSleepWindowInMilliseconds().get()) {

// We push the 'circuitOpenedTime' ahead by 'sleepWindow' since we have allowed one request to try.

// If it succeeds the circuit will be closed, otherwise another singleTest will be allowed at the end of the 'sleepWindow'.

if (circuitOpenedOrLastTestedTime.compareAndSet(timeCircuitOpenedOrWasLastTested, System.currentTimeMillis())) {

// if this returns true that means we set the time so we'll return true to allow the singleTest

// if it returned false it means another thread raced us and allowed the singleTest before we did

return true;

}

return false;

}

properties.circuitBreakerSleepWindowInMilliseconds().get()取的值就是设置的间隔时间，默认是5秒。在每次打开熔断器时都要保存当前时间，在下次决策是否需要关闭回路时，判断距离上次开启时间是否达到设置的值。在Hystrix中大量使用了CAS在保证成员变量的原子操作前提下，又提高了性能。

监控

http://blog.csdn.net/a_fengzi_code_110/article/details/53643527

Microservice Tool--The Design of Hystrix

Guess you like