How to reduce the MTTR (mean recovery time) of the application in the state of chaotic exercise

Yuan Chuanghui, offline restart! July 1, 2023 Shenzhen Station - Basic software technology interview! Early bird tickets are available for a limited time only!

In the field of enterprise business, Jinli is a one-stop solution for employee procurement scenarios such as benefits, marketing, and incentives, including a flexible incentive SAAS platform for employees and members. Since it is directly oriented to all employees of the company, the high availability of its services is particularly important. This article will introduce how to reduce the MTTR of the application through chaos engineering exercises on the eve of the Jinli Mall promotion.

MTTR (mean time to recovery) is the average time required to recover from a product or system failure. This includes the entire duration of the outage - from the time the system or product fails until it is restored to full operation.

How to reduce the MTTR of the application in the scenario of chaos drill, must it be based on monitoring and positioning, and then manual feedback for processing? Can it be automated, and is there a solution to reduce the impact during the chaos exercise? In this way, rapid hemostasis can be achieved and the stability of the system can be further improved.

This article will answer the above questions based on some thinking and practice.

Failures are ubiquitous and unavoidable.

We will start with the investigation and measures of the restart problem of the host machine and the chaos drill of the underlying service.

background

[Client perspective]: There are a large number of interfaces (including bills of lading) overtime error reporting, availability rate jumps, some customers are hit, resulting in customer complaints.

Through positioning, we found out the reasons for the restart of the host machine and the chaotic drill of the underlying services in the early stage of the preparation for the big promotion, which affected the availability and performance of our side system for a long time. Especially the deployment and application of the core interface will affect the availability of multiple interfaces on a large scale, and further affect the experience of purchasing end customers.

Especially in the TOB field, there is a word-of-mouth effect from major customers. If a top customer happens to encounter this problem, it will be easily magnified and intensified.

Temporary measures

On the one hand, cooperate with the operation and maintenance team to confirm that the host machine restarts without timely notification, and on the other hand, synchronize the drill impact with the underlying service provider. It is recommended that they abide by the drill principle to minimize the explosion radius, control the scope of impact, and ensure that the drill is controllable.

In addition to the above coordination with the external situation, we also have internal thinking. First of all, the situation failure itself is uncontrollable. Regardless of the host machine or the chaos drill, the real scene has a probability to happen (and has happened). Then we can only monitor and locate, and then manually remove the machine or notify the service provider to deal with it? Can it be automated, and are there options to reduce the impact? In this way, rapid hemostasis can be achieved and the stability of the system can be further improved.

Long-term plan - JSF middleware capability practice

Since failures cannot be avoided, embrace failures and build the ability to obtain application failures through some technical means to ensure high availability of applications.

Since 90+% of internal calls are (JSF) RPC calls, we still focus on the fault tolerance of JSF middleware. The following mainly introduces the timeout and retry, adaptive load balancing, and service fusing of JSF middleware . Theory and practice of failover.

Practice is the sole criterion for testing truth.

About timeouts and retries

In the actual development process, I believe you have seen too many failures caused by unset timeouts and incorrect settings. When the timeout is not set or the setting is unreasonable, the request response will be slowed down, and the continuous accumulation of slow requests will cause a chain reaction and even cause an application avalanche.

Not only our own services, but also external dependent services, not only HTTP services, but also middleware services, should set a reasonable timeout retry strategy and pay attention to it.

First of all, the timeout retry strategy of read and write services is also very different. Read services are naturally suitable for retry (for example, retry twice after setting a reasonable timeout), but most write services cannot be retried, but if they are all idempotent Design is also possible.

In addition, before setting the timeout period of the caller, it is necessary to understand the TP99 response time of the dependent service (if the performance of the dependent service fluctuates greatly, you can also refer to TP95). The timeout period of the caller can be added 50% on this basis. Of course, the response time of the service is not constant, and more computing time may be required under certain long-tail conditions, so in order to have enough time to wait for such long-tail request responses, we need to set the timeout reasonably enough.

Finally, the number of retries should not be too many (high concurrency may cause a series of problems (generally 2 times, up to 3 times), although the greater the number of retries, the higher the service availability, but high concurrency will lead to multiple request traffic , similar to simulating a DDOS attack, in severe cases, it may even accelerate the cascade of faults. Therefore, it is best to use timeout retry in conjunction with fuses, fast failures and other mechanisms for better results, which will be mentioned later.

In addition to introducing means, it is important to verify the effectiveness of the means.

Simulation scene (the other two subsequent methods also use this scene)

Solution : use fault injection (50% machine network delay 3000-5000ms) to simulate similar scenarios and verify.

The machine deployment is as follows :

Monitoring key value of pressure test interface (QPS-300) and fault interface :

1. Pressure measurement interface: jdos_b2b2cplatform.B2b2cProductProviderImpl.queryProductBpMap

2. Service consumption: jdos_b2b2cplatform.ActivityConfigServiceRPCImpl.queryActivityConfig

3. Service provision: jdos_b2b2cshop.com.jd.ka.b2b2c.shop.service.impl.sdk.ActivityConfigServiceImpl.queryActivityConfig

[Note] The network scenario does not support the following situations:

1. The computer room where the application container is located: xxx

2. The kernel version of the physical machine: xxx

Normal case (no fault injected)

Injection failure - when the timeout setting is unreasonable (timeout 2000ms, retry 2)

Injection fault - when the timeout setting is reasonable (timeout 10ms, retry 2)

The interface TP99 is at 6ms, set the timeout to 10ms, retry 2. Namely: jsf:methodname="queryActivityConfig"timeout="10"retries="2"/

Timeout retry summary

Through reasonable timeout retries, the overall request is stable, and the failover after retries greatly improves the interface availability.

timeout retry supplement

In the case of unreasonable splitting of the interface dimension, we can use the timeout retry configuration of the method dimension in a more fine-grained manner. However, there is a note here. The current annotation method of JSF does not support the timeout retry setting of the method dimension, only supports the interface dimension. , if the annotation class has been used, it can be configured and used by migrating XML.

About Adaptive Load Balancing

The purpose of the shortestresponse adaptive load balancing design is to solve the problem of uneven provider node capabilities, so that providers with weaker processing capabilities can accept less traffic, so that individual providers with poorer performance will not affect the overall call time of consumers. and availability.

Those who are able work more than those who are clumsy, and those who are wise worry more than those who are foolish.

However, there are some problems with this strategy:

Excessive concentration of traffic on high-performance instances and service provider's single-machine traffic limit may become a bottleneck.
The length of the response sometimes does not represent the throughput of the machine.
In most scenarios, when there is no obvious difference in the response duration of different providers, shortestresponse is the same as random (random).

The existing shortestresponse implementation mechanism is similar to the P2C (Power of Two Choice) algorithm, but the calculation method does not use the number of connections currently being processed, but randomly selects two service providers to participate in the fastest response comparison calculation by default, namely: statistics Request time-consuming, visits, exceptions, and concurrent requests for each provider, compare the average response time * current number of requests, and use it to calculate the fastest response load. Pick winners to avoid herding. In this way, the throughput capacity of the provider-side machine is adaptively measured, and then the traffic is allocated to the machine with high throughput capacity as much as possible to improve the overall service performance of the system.

    <jsf:consumer id="activityConfigService"
                  interface="com.jd.ka.b2b2c.shop.sdk.service.ActivityConfigService"
                  alias="${jsf.activityConfigService.alias}" timeout = "3000" filter="jsfLogFilter,jsfSwitchFilter"
                  loadbalance="shortestresponse">
        <jsf:method name="queryActivityConfig" timeout="10" retries="2"/>
    </jsf:consumer>

Inject faults (set up adaptive load balancing)

Summary of Adaptive Load Balancing

By introducing adaptive load balancing, the "capable man works more" mode starts from the initial call of the interface, and the elected machine carries higher traffic. After the fault is injected, the short-term window of interface availability disappears, and the availability rate jumps. point, which further guarantees the high availability and performance of the service.

About service circuit breaker

When the circuit is short-circuited or seriously overloaded, the fuse in the fuse will automatically fuse to protect the circuit. Avoid major impact on equipment, even fire.

Service fusing is a link protection mechanism for unstable service scenarios.

The basic idea behind it is very simple, wrap a protected function call in a circuit breaker object, which monitors for failures. When a service of the call link is unavailable or the response time is too long, causing the fault to reach the set threshold, the service will be broken, and there will be no more calls to the node service within the blown window, so as to avoid the instability of downstream services to the greatest extent. Impact on upstream services.

<!-- 服务熔断策略配置 -->
<jsf:reduceCircuitBreakerStrategy id="demoReduceCircuitBreakerStrategy"
    enable="true"   <!-- 熔断策略是否开启 -->
    rollingStatsTime="1000" <!-- 熔断器指标采样滚动窗口时长，单位 ms，默认 5000ms -->
    triggerOpenMinRequestCount="10" <!-- 单位时间内触发熔断的最小访问量，默认 20 -->
    triggerOpenErrorCount="0"   <!-- 单位时间内的请求异常数达到阀值，默认 0，小于等于0 代表不通过异常数判断是否开启熔断  -->
    triggerOpenErrorPercentage="50" <!-- 单位时间内的请求异常比例达到阀值，默认 50，即 默认 50% 错误率  -->
    <!-- triggerOpenSlowRT="0" 判定请求为慢调用的请求耗时，单位 ms，请求耗时超过 triggerOpenSlowRT 则认为是慢调用 （默认为 0，即默认不判定）-->
    <!-- triggerOpenSlowRequestPercentage="0"  采样滚动周期内触发熔断的慢调用率（默认为 0，即默认不触发慢调用熔断 -->
    openedDuration="10000"   <!-- 熔断开启状态持续时间，单位 ms，默认  5000ms -->
    halfOpenPassRequestPercentage="30"  <!-- 半闭合状态，单位时间内放行流量百分比，默认 40-->
    halfOpenedDuration="3000"   <!-- 半闭合状态持续时间设置，需要大于等于 rollingStatsTime ，默认为 rollingStatsTime  -->
    <!-- failBackType="FAIL_BACK_EXCEPTION" failBack策略， 取值：FAIL_BACK_EXCEPTION抛出异常、FAIL_BACK_NULL返回null、FAIL_BACK_CUSTOM配置自定义策略，配合 failBackRef 属性 -->
    <!-- failBackRef="ref" 如果 failBackStrategy 配置为 FAIL_BACK_CUSTOM 则必填，用户自定义的failback策略com.jd.jsf.gd.circuitbreaker.failback.FailBack<Invocation> 接口实现类 -->
/>

<jsf:consumerid="activityConfigService"interface="com.jd.ka.b2b2c.shop.sdk.service.ActivityConfigService"
                alias="${consumer.alias.com.jd.ka.b2b2c.shop.sdk.service.ActivityConfigService}" timeout="2000"check="false"
                serialization="hessian"loadbalance="shortestresponse"
                connCircuitBreakerStrategy="demoCircuitBreakerStrategy">
      <jsf:methodname="queryActivityConfig"timeout="10"retries="2"/>
</jsf:consumer>

Here comes a small episode. Due to the heartbeat mechanism of JSF itself, after detecting a fault, the corresponding machine is automatically removed (detected once in 30s, and removed if all three times are abnormal). The fuse mechanism we set ourselves is not obvious, so we reset the fault. (Network delay 800-1500ms) to re-drill.

Injection failure (service circuit breaker)

Service circuit breaker summary

From the perspective of availability, it is true that access to abnormal machine nodes will be closed within the window. However, since the failback strategy is not implemented and the opening window for fusing is short, the availability will still directly return the call failure message after the window is opened. Thus affecting availability. Therefore, compared to failure after fusing, the best way is to cooperate with the service degradation capability, by calling the pre-set service degradation logic, and use the result of the degradation logic as the final call result to return it to the service caller more elegantly.

Service Circuit Breaker Supplement

The group has built a unified fuse component and established corresponding platform capabilities on Mount Tai. If the team needs to introduce circuit breaker capability, it can be directly accessed and used to avoid duplication of construction .

 一种机制可能会击败另一种机制。

In fact, in order to enhance the flexibility and robustness of the system to cope with various failures and unpredictable situations, in distributed systems, it is usually designed to be able to partially fail. Even if it cannot satisfy all customers, it can still send Certain customers provide services. But fusing is designed to turn a partial failure into a complete failure, thereby preventing further propagation of the failure. Therefore, there is a mutual restrictive relationship between service circuit breaking and the design principles of distributed systems. Therefore, careful analysis and thinking, as well as subsequent tuning, are required before use.

in conclusion

Ability is only a means, stability is the goal.

No matter what method is used to build stability, we need to always think about how to find a balance between business needs and stability building, so as to build a high-availability architecture that supports long-term business growth.

This time I wrote this, if you have any questions, welcome to communicate. I hope that some of the experience in the article will bring you some gains, or in other words, you may wish to think about what technical solutions and means you will use to solve similar problems. Welcome to leave a message and exchange, and hope to communicate with more like-minded partners.

Finally, as usual, everyone is welcome to like, favorite and follow + follow.

reference documents

The power of two random choices ： https://brooker.co.za/blog/2012/01/17/two-random.html

Load balancing : https://cn.dubbo.apache.org/zh-cn/overview/core-features/load-balance/#shortestresponse

Author: JD Retail Li Mengdong

Source: JD Cloud Developer Community