Note the reduced interface availability caused by a JSF asynchronous call | JD Cloud Technical Team

Preface

This article records the troubleshooting process of the problem of reduced interface availability caused by JSF asynchronous call timeout. It mainly introduces the troubleshooting ideas and the process of JSF asynchronous calling. I hope it can help everyone understand the principle of JSF asynchronous calling and provide some troubleshooting ideas. The JSF source code analyzed in this article is based on JSF 1,7.5-HOTFIX-T6 version.

origin

Problem background

1. The advertising delivery system is a typical I/O-intensive (I/O Bound) service. Some interfaces in the system may rely on more than a dozen external interfaces for a single operation, causing the interface to take a long time and seriously affecting the user experience. Therefore It is necessary to switch these external calls to asynchronous mode, reduce the overall time consumption through concurrent mode, and improve the response speed of the interface.
2. In the scenario of synchronous calls, the interface takes a long time, has poor performance, and the interface response time is long. At this time, in order to shorten the response time of the interface, thread pools are generally used to obtain data in parallel. However, if thread pools are used, different businesses require different thread pools, which will eventually make maintenance difficult. As the number of CPU scheduling threads increases, , will lead to more serious resource contention, precious CPU resources will be wasted on context switching, and the thread itself will also occupy system resources and cannot be increased infinitely.
3. By reading the JSF documentation, we found that JSF supports asynchronous calling mode. Since the middleware already supports this function, we adopted the asynchronous calling mode provided by JSF. Currently, JSF supports three asynchronous calling methods, namely ResponseFuture method, CompletableFuture method and interface signature method that defines the return value as CompletableFuture.

(1) How to obtain ResponseFuture in RpcContext

This method requires first setting the async attribute on the Consumer side to true, which means enabling asynchronous calls, and then using the RpcContext.getContext().getFuture() method to get a ResponseFuture where the Provider is called. After getting the Future, you can use the get method to block. Wait for return, but this method is no longer recommended because the second CompletableFuture mode is more powerful.

Code example:

asyncHelloService.sayHello("The ResponseFuture One");
ResponseFuture<Object> future1 = RpcContext.getContext().getFuture();
asyncHelloService.sayNoting("The ResponseFuture Two");
ResponseFuture<Object> future2 = RpcContext.getContext().getFuture();
try {
     future1.get();
     future2.get();
} catch (Throwable e) {
    LOGGER.error("catch " + e.getClass().getCanonicalName() + " " + e.getMessage(), e);
}

(2) How to obtain CompletableFuture in RpcContext (supported by version 1.7.5 and above)

This method requires first setting the async attribute on the Consumer side to true, which means enabling asynchronous calls, and then using the RpcContext.getContext().getCompletableFuture() method where the Provider is called to obtain a CompletableFuture for subsequent operations. CompletableFuture extends Future and can process calculation results by setting callbacks. It supports combination operations and further orchestration, which solves the problem of callback hell to a certain extent.

Code example:

asyncHelloService.sayHello("The CompletableFuture One");
CompletableFuture<String> cf1 = RpcContext.getContext().getCompletableFuture();
asyncHelloService.sayNoting("The CompletableFuture Two");
CompletableFuture<String> cf2 = RpcContext.getContext().getCompletableFuture();

CompletableFuture<String> cf3 = RpcContext.getContext().asyncCall(() -> {
    asyncHelloService.sayHello("The CompletableFuture Three");
});
try {
    cf1.get();
    cf2.get();
    cf3.get();
} catch (Throwable e) {
    LOGGER.error("catch " + e.getClass().getCanonicalName() + " " + e.getMessage(), e);
}

(3) Use the interface signed by CompletableFuture (supported by version 1.7.5 and above)

This mode requires code modification, and the service provider needs to define the return value signature of the method as CompletableFuture in advance. This kind of caller can use asynchronous without configuration.

Code example:

CompletableFuture<String> cf4 = asyncHelloService.sayHelloAsync("The CompletableFuture Fore");
cf4.whenComplete((res, err) -> {
    if (err != null) {
        LOGGER.error("interface async cf4 now complete error " + err.getClass().getCanonicalName() + " " + err.getMessage(), err);
    } else {
        LOGGER.info("interface async cf4 now complete : {}", res);
    }
});
CompletableFuture<Void> cf5 = asyncHelloService.sayNotingAsync("The CompletableFuture Five");

try {
    LOGGER.info("interface async cf1 now is : {}", cf4.get());
    LOGGER.info("interface async cf2 now is : {}", cf5.get());
} catch (Throwable e) {
    LOGGER.error("catch " + e.getClass().getCanonicalName() + " " + e.getMessage(), e);
}

Through the analysis of the above three asynchronous calling modes, the third one requires the provider to modify the method signature to support asynchronous, which is difficult to implement. In order to minimize changes and optimize API usage, we finally chose the second method, which is to Set the async attribute to true on the end, and obtain a CompletableFuture object from RpcContext for subsequent operations after initiating the call.

Problem phenomenon

After the asynchronous mode transformation, the time consumption of some interfaces that rely on many external services has been significantly reduced. On the surface, the system looks peaceful, but the occasional decrease in interface availability is a very dangerous signal. The following is an interface that uses asynchronous calls. Availability monitoring





 

Through monitoring, we can find that the availability of this interface occasionally decreases. Generally, the decrease in the availability of the interface may be caused by timeout or triggering some hidden problems. However, the logic of this interface is very simple, which is to query the database based on the ID. The business logic is very Simple, in theory there should not be so much availability reduction. Through log inspection, we found that a TimeOutException exception occurred when the get method of CompletableFuture was called asynchronously and was blocked and waited. The current interface configuration timeout is 5s. Originally, interface timeout is a problem we often encounter, but we went to the provider to query the log. It was found that this request only took a few milliseconds. The provider obviously returned in a few milliseconds or tens of milliseconds. Why did the consumer still time out? With this question, we continued to analyze whether it was caused by JSF asynchronousness. .

Troubleshoot positioning reasons

By reading the source code of JSF, we understand that the basic process of JSF asynchronous call is that before the client sends a request to the server, it will first determine whether the request requires an asynchronous call. If necessary, a JSFCompletableFuture object will be generated. This class is inherited Since CompletableFuture, a futureMap object is used to cache the unique msgId of the request and a MsgFuture object. The MsgFuture object holds the channel, message, timeout, compatibleFuture and other attributes used in this call. After the server callback, you can pass the msgId Find the corresponding MsgFuture object for subsequent processing.

First, generate the mapping between MsgId and MsgFuture objects in the doSendAsyn method, then serialize the data, and finally write the data to be sent to the channel through netty's long connection.

(1) Generate JSFCompletableFuture





 

(2) Maintain the relationship between msgId and MsgFuture





 

(3) Maintain the relationship between msgId and MsgFuture





 

(4) Initiate a call





 

After the server receives the request, it will trigger the channelRead method of the ServerChannelHandler class on the server to be called back. This method will verify the serialization protocol, then generate a JSFTask task, and submit the task to the JSF business thread pool for execution, and wait for the business. After the task in the thread pool is executed, the write method will be called to write the return value back to the client through the channel.

(1) Server receives response processing





 

(2) Server writes back response





 

After the client receives the response, it will trigger the channelRead method of the client's ClientChannelHandler class. This method will find the MsgFuture object cached by the client through the msgId returned by the server, and then determine whether the compatibleFuture attribute in the object is non-null. If it is not If empty, a task will be submitted to the Callback thread pool. The main function of this task is to execute the completeExceptionally and complete methods of CompletableFuture, which is used to trigger the next phase of execution of CompletableFuture.

(1) The client receives the response





 

(2) Find the local MsgFuture





 

(3) Add MsgFuture to the thread pool





 

(4) Trigger the complete or completeExceptionally method of CompletableFuture





 

Through the analysis of the source code, although we know the entire process of JSF asynchronous calling, we still cannot explain why timeouts occasionally occur that should not time out (here it means that the server clearly does not time out, but the client still shows that it has timed out). By excluding various processes, we finally located that it may be related to adding the task to the Callback thread pool to execute the complete method of CompletableFuture after the JSF asynchronous callback, because this method will continue to execute the subsequent stages of CompletableFuture, and our business code gets the RpcContext After the CompletableFuture object returned inside, CompletableFuture's unary dependency method ThenApply is generally used to perform some subsequent processing. CompletableFuture's complete method is used to trigger the execution of these subsequent stages.

Asynchronously calling business code:



 

Let's introduce the basic knowledge of CompletableFuture. Each CompletableFuture can be regarded as an observer. There is a linked list member variable stack of Completion type inside, which is used to store all observers registered in it. When the execution of the observer is completed, the stack attribute will be popped, and the registered observers will be notified in turn. Therefore, at this stage, the ThenApply method in our program will be called. The following figure shows the key attributes inside CompletableFuture.



Figure 12 thenApply diagram



 

If the asynchronous calling process above is not clear, you can look at the calling relationship diagram below.





 



By looking at the default configuration of the Callack thread pool, we found that the number of core threads is 20, the queue length is 256, and the maximum number of threads is 200. Seeing this, we guessed that the number of core threads may not be enough, causing some callback tasks to be backlogged in the queue and not executed in time, resulting in timeout. Since there is no other way to obtain the running status of the CallBack thread pool at that time, we verified our guess by modifying the business code and obtaining the current status of the Callback thread pool when a timeout exception occurred.

(1) Get the thread pool status code





 

After modifying the code and going online, the system ran for a period of time and the interface availability decreased. Then we queried the log. It can be seen from the log that when a timeout exception occurred, the number of core threads in the JSF Callback thread pool was full. At the same time, There are 71 tasks backlogged in the queue. Through this log, it can be determined that the task queue timeout occurs because the number of core threads in the JSF callback thread pool is full.





 

problem analysis

1. From the above log, we know that it is caused by the asynchronous thread pool being full. In theory, normal requests should be processed quickly even if there are some queues. However, after we checked the business code, we found that some of our business was done in ThenApply. Some time-consuming operations, and another asynchronous method is called in ThenApply.

2. The first case will cause the thread pool thread to be occupied all the time, and other tasks will be queued. This is actually acceptable, but the second case may cause a deadlock due to circular references in the thread pool. The reason is that the parent The task will put the asynchronous callback in the thread pool for execution, and the subtasks of the parent task will also put the asynchronous callback in the thread pool for execution. The core thread size of the Callback thread pool is 20. When 20 requests arrive at the same time, the Callback core thread will be When full, the subtask enters the blocking queue when requesting the thread, but the completion of the parent task depends on the subtask. At this time, because the subtask cannot get the thread, the parent task cannot be completed. The main thread executes get and enters the blocking state, and will never be able to recover.

solution

Short-term solution: Because the thread pool core threads are full and cause queuing, the number of JSF callback thread pool core threads is adjusted from 20 to 200.

Long-term plan: Optimize the code so that the time-consuming operations in ThenApply are not executed in the callback thread pool. At the same time, optimize the code logic and remove the process of starting asynchronous calls again inside the ThenApply method.

Comparison before and after adjustment:





 

By checking the monitoring, it can be found that the interface availability rate has remained at 100% after optimization.

Author: JD Retail Song Weifei

Source: JD Cloud Developer Community Please indicate the source when reprinting

Broadcom announces the termination of the existing VMware partner program deepin-IDE version update, replacing the old look with a new look Zhou Hongyi: Hongmeng native will definitely succeed WAVE SUMMIT welcomes its tenth session, Wen Xinyiyan will have the latest disclosure! Yakult Company confirms that 95 G data was leaked The most popular license among programming languages ​​in 2023 "2023 China Open Source Developer Report" officially released Julia 1.10 officially released Fedora 40 plans to unify /usr/bin and /usr/sbin Rust 1.75.0 release
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10456180