RPC timeout setting, accidentally is an online accident!

Author | Luo Junwu

Source | Advanced Workplace of IT People (ID: BestITer)

The above monitoring chart is familiar to R & D students on the server side. In daily system maintenance, "service timeout" should belong to the type of problem with the most monitoring alarms.

Especially in the microservices architecture, a request may have to go through a very long link, and the result can only be returned after multiple service calls. When a service timeout occurs, R & D students often have to analyze the performance of their own systems and rely on the performance of their services, which is why it is more difficult to investigate service timeouts than service errors and abnormal service calls.

This article will systematically introduce a real online accident: under the microservice architecture, how to correctly understand and set the timeout period of the RPC interface, so that everyone has a more global perspective when developing the server-side interface. The content will be divided into the following 4 parts:

  • From an online accident caused by an RPC interface timeout

  • What is the implementation principle of timeout?

  • What problem does the timeout set to solve?

  • How to set a reasonable timeout period?

Speaking from an online accident

The accident occurred on the homepage recommendation module of the e-commerce APP, and one day suddenly received user feedback at noon: except for the banner map and navigation area on the APP homepage, the recommendation module below became a blank page (the recommendation module occupies 2/3 of the space on the homepage, It is a list of products recommended by the algorithm in real time according to user interests).

The above business scenario can be understood with the help of the following call chain:

  • The APP initiates an HTTP request to the service gateway

  • The service gateway RPC calls the recommendation service to obtain a list of recommended products

  • If the call fails in step 2, the service is degraded and changed to RPC to call the merchandise sorting service to obtain a list of hot-selling merchandise

  • If the call fails in step 3, then downgrade again and get the list of hot products in the Redis cache directly

At first glance, both of the service-dependent downgrade strategies are taken into account. In theory, even if the recommendation service or the merchandise ordering service are all hung up, the server should be able to return data to the APP. However, the recommendation module on the APP side is indeed blank, and the downgrade strategy may not be effective. The following describes the positioning process in detail.

1. Problem location process

Step 1: The APP side finds through the packet capture that the HTTP request has an interface timeout (the timeout period is set to 5 seconds).

Step 2: The service gateway found through the log that a large area timeout occurred in the RPC interface calling the recommended service (the timeout time is set to 3 seconds), and the error message is as follows:

Step 3: The recommendation service found through the log: dubbo's thread pool is exhausted, and the error message is as follows:

Through the above three steps, the problem was basically located in the recommendation service. Later, it was further investigated that it was because the unavailability of the redis cluster that the recommendation service depends on caused a timeout, which in turn caused the thread pool to run out. The detailed reasons are not expanded here, and are not relevant to the topic to be discussed in this article.

2. Analysis of the reasons why the downgrade strategy did not take effect

Let's continue with the analysis below: Why does the service gateway's downgrade strategy not take effect when the recommendation service call fails? Theoretically, should n’t we downgrade to call the commodity ordering service for backing?

The final tracking analysis found the root cause: the timeout time for the APP to call the business gateway is 5 seconds, the timeout time for the business gateway to call the recommended service is 3 seconds, and also set 3 timeout retry, so that when the recommended service call fails When retrying twice, the HTTP request has timed out, so all downgrade policies of the service gateway will not take effect. The following is a more intuitive schematic:

3. Solutions

  • Changed the timeout time for the service gateway to call the recommended service to 800ms (the recommended service TP99 is about 540ms), and changed the timeout retry times to 2

  • Changed the timeout period for the service gateway to call the commodity ordering service to 600ms (TP99 for the commodity ordering service is about 400ms), and the number of timeout retry attempts was changed to 2

Regarding the setting of the timeout time and the number of retries, many factors such as the time consumption of all dependent services in the entire call chain and whether each service is a core service need to be considered. I won't start here, and I will introduce the specific methods in detail later.

What is the implementation principle of timeout?

Only by understanding the timeout principle of the RPC framework can we better set it up. Whether it is dubbo, SpringCloud, or a microservice framework developed by a large manufacturer (such as JSF in JD), the implementation principle of timeout is basically similar. Let's take the source code of Dubbo 2.8.4 as an example to see the specific implementation.

Students who are familiar with dubbo know that the timeout period can be configured in two places: the provider (server, service provider) and consumer (consumer, service caller). The timeout configuration of the server is the default configuration of the consumer, that is to say, as long as the server sets the timeout period, all consumers do not need to set, and can be passed to the consumer through the registration center. In this way: on the one hand, the configuration is simplified, on the other hand Because the server is more aware of its interface performance, it is reasonable to leave it to the server for configuration.

Dubbo supports very fine-grained timeout settings, including: method level, interface level, and global. If all levels are configured at the same time, the priority is: consumer method level> server method level> consumer interface level> server interface level> consumer global> server global.

Through the source code, we first look at the timeout processing logic of the server

 1public class TimeoutFilter implements Filter { 2 3    public TimeoutFilter() { 4    } 5 6    public Result invoke(...) throws RpcException { 7        // 执行真正的逻辑调用,并统计耗时 8        long start = System.currentTimeMillis(); 9        Result result = invoker.invoke(invocation);10        long elapsed = System.currentTimeMillis() - start;1112        // 判断是否超时13        if (invoker.getUrl() != null && elapsed > timeout) {14            // 打印warn日志15            logger.warn("invoke time out...");16        }1718        return result;19    }20}


It can be seen that even if the server times out, it only prints a warn log. Therefore, the timeout setting of the server will not affect the actual calling process, and even if it times out, the entire processing logic will be executed .

Let's look at the timeout processing logic of the consumer side

 1public class FailoverClusterInvoker { 2 3    public Result doInvoke(...)  { 4        ... 5        // 循环调用设定的重试次数 6        for (int i = 0; i < retryTimes; ++i) { 7            ... 8            try { 9                Result result = invoker.invoke(invocation);10                return result;11            } catch (RpcException e) {12                // 如果是业务异常,终止重试13                if (e.isBiz()) {14                    throw e;15                }1617                le = e;18            } catch (Throwable e) {19                le = new RpcException(...);20            } finally {21                ...22            }23        }2425        throw new RpcException("...");26    }27}

FailoverCluster is the default mode of cluster fault tolerance. When the call fails, it will switch to calling other servers. Look at the doInvoke method again. When the call fails, it will first determine whether it is a business exception, if it is, then terminate the retry, otherwise it will continue to retry until the number of retries is reached.

Continue to follow the invoke method of invoker, you can see that the result is obtained through the get method of Future after the request is issued, the source code is as follows:

 1public Object get(int timeout) { 2        if (timeout <= 0) { 3            timeout = 1000; 4        } 5 6        if (!isDone()) { 7            long start = System.currentTimeMillis(); 8            this.lock.lock(); 910            try {11                // 循环判断12                while(!isDone()) {13                    // 放弃锁,进入等待状态14                    done.await((long)timeout, TimeUnit.MILLISECONDS);1516                    // 判断是否已经返回结果或者已经超时17                    long elapsed = System.currentTimeMillis() - start;18                    if (isDone() || elapsed > (long)timeout) {19                        break;20                    }21                }22            } catch (InterruptedException var8) {23                throw new RuntimeException(var8);24            } finally {25                this.lock.unlock();26            }2728            if (!isDone()) {29                // 如果未返回结果,则抛出超时异常30                throw new TimeoutException(...);31            }32        }3334        return returnFromResponse();35    }


After entering the method, the timer starts. If no return result is obtained within the set timeout period, a TimeoutException is thrown. Therefore, the timeout logic of the consumer is controlled by the two parameters of the timeout time and the number of timeouts, such as network exceptions and response timeouts, which will always retry until the number of retries is reached.

What problem is the timeout set to solve?

What problem does the RPC framework's timeout retry mechanism solve? From the macro perspective of microservice architecture, it is to ensure the stability of the service link and provide a framework-level fault tolerance. How to understand it microscopically? Can be seen from the following specific cases:

1. The consumer calls the provider. If the timeout is not set, the consumer's response time will definitely be greater than the provider's response time. When the provider's performance deteriorates, the consumer's performance is also affected because it must wait indefinitely for the provider's return. If the entire call link goes through multiple services A, B, C, and D, as long as the performance of D deteriorates, it will affect A, B, and C from the bottom up, eventually causing the entire link to time out or even paralyze, so set The timeout period is very necessary.

2. Assume that consumer is the core commodity service and provider is a non-core review service. When the evaluation service has performance problems, the commodity service can accept no return of evaluation information, thus ensuring that it can continue to provide external services. In this case, a timeout period must be set. When the evaluation service exceeds this threshold, the commodity service does not need to continue to wait.

3. The provider is most likely due to a momentary network jitter or a timeout caused by high machine load. If you give up directly after the timeout, some scenarios will cause business losses (such as the inventory interface timeout will cause the order to fail). Therefore, for this temporary service jitter, if you try again after a timeout, it can be saved, so it is necessary to solve it through a retry mechanism.

But after introducing the timeout retry mechanism, not everything is perfect. It also brings side effects. These are the issues that must be considered when developing RPC interfaces, and are also the most easily overlooked issues:

  1. Repeated request: It is possible that the provider has finished executing, but because the network jitter consumer thinks that it has timed out, in this case the retry mechanism will cause repeated requests, which will cause dirty data problems, so the server must consider the idempotency of the interface.

  2. Reduce the load capacity of the consumer : If the provider is not a temporary jitter, but there is indeed a performance problem, so retrying multiple times will not be successful, but will make the average response time of the consumer longer. For example, under normal circumstances, the average response time of the provider is 1s, the consumer sets the timeout time to 1.5s, and the number of retries is set to 2, so that a single request will take 3s, the overall load of the consumer will be pulled down, if the consumer It is a high-QPS service and may cause a chain reaction to cause an avalanche.

  3. Explosive retry storm : If a call link passes 4 services, the lowest service D times out, so that upstream services will initiate retry, assuming that the retry times are set to 3, then B will face normal In the case of 3 times the load, C is 9 times and D is 27 times, the entire service cluster may be avalanche.

How to set a reasonable timeout period?

After understanding the implementation principle of RPC framework timeout and possible side effects, you can set the timeout according to the following methods:

  • Before setting the caller's timeout, first understand the response time of the TP99 that depends on the service (if the performance of the dependent service fluctuates greatly, you can also see TP95). The caller's timeout can be increased by 50% on this basis

  • If the RPC framework supports multi-granular timeout settings, then: the global timeout time should be slightly greater than the longest time-consuming time at the interface level, the timeout time for each interface should be slightly greater than the longest time-consuming time at the method level, and the timeout for each method The time should be slightly greater than the actual method execution time

  • Distinguish between retryable service and non-retryable service. If the interface does not implement idempotency, it is not allowed to set the number of retries. Note: The read interface is natural idempotent, and the write interface can use the business document ID or generate a unique ID at the caller and pass it to the server. This ID is used to prevent weight and avoid introducing dirty data.

  • If the RPC framework supports the timeout setting of the server, it is also set in sequence based on the previous three rules, so that the configuration can be avoided without the client setting, and the hidden dangers are reduced.

  • From a business perspective, if the service availability requirements are not so high (such as internal application systems), you can directly retry manually without setting the timeout retry number, which can reduce the complexity of interface implementation, but is more conducive to post-maintenance

  • The larger the setting of the number of retries, the higher the service availability, and the business loss can be further reduced, but the performance risks will also be greater. This needs to be comprehensively set to several times (generally 2 times, up to 3 times)

  • If the caller is a high-QPS service, you must consider the downgrade and fuse strategy when the server is overtime. (For example, if more than 10% of the requests are wrong, stop the retry mechanism and blow directly, and change to call other services, asynchronous MQ mechanism, or use the cached data of the caller)

Finally, a brief summary:

The timeout setting of the RPC interface seems to be simple, but in fact there are very college questions. Not only involves many technical issues (such as interface idempotency, service degradation and fuse, performance evaluation and optimization), but also needs to assess the necessity from a business perspective. Know what it is, and hope that this knowledge will give you a more global perspective when developing RPC interfaces.

Code words are not easy. If you think this article is valuable to you, please forward it to your circle of friends and click to read it. Thank you for your encouragement and support!

【END】

More exciting recommendations

Taking the first place in Gartner container products, Alibaba Cloud wins the key battle of cloud native!

Tencent interviewer asked me this binary tree, I just happen to be | The Force Program

☞Winning GitHub 2000+ Star, how does Alibaba Cloud's open source Alink machine learning platform outperform the double 11 data "game"? | AI Technology Ecology

☞Microsoft acquired a company for one person? Crack Sony programs, write hacker novels, and watch his tough program life!

Machine learning project template: 6 basic steps of ML project

☞IBM, Microsoft, Apple, Google, Samsung ... These technology giants in the blockchain have already done so many things!

Summary by senior programmers: I will tell you all 6 ways to analyze the Linux process

Today's Welfare: If you leave a comment in the comment area, you can get a ticket for the live broadcast of the "2020 AI Developer Ten Thousand Conference" worth 299 yuan . Come and move your finger and write what you want to say.

Click to read the original text, wonderful to continue!

Every "watching" you order, I take it seriously

1945 original articles published · 40 thousand likes + · 18.18 million views

Guess you like

Origin blog.csdn.net/csdnnews/article/details/105445747