Head picture.png

Author | Wang Ruixian Head of the Education Infrastructure Department R&D Engineer
Source | Alibaba Cloud Native Official Account

background

After receiving the development feedback from the company’s business department and applying it to upgrade the company’s internal framework, the UAT (pre-production) environmental interface performance pressure test failed to meet the standard.
Pressure test report before upgrade:

Stress test report after upgrade:

In the case of the same machine configuration (1C4G), the throughput dropped from the original 53.9/s to 6.4/s, and the CPU load was higher.

And development feedback from the company's full-link monitoring system SkyWalking query link information, we can know that most of the request Feign call time is not normal (390ms), and the actually called downstream service response speed is very fast (3ms) .

Positioning problem

After receiving the feedback, I immediately applied for the permission of the corresponding machine and uploaded Arthas (version 3.4.3) to the corresponding machine.

Let the business side maintain stress testing and start problem positioning.

1. Execute profiler command to analyze CPU performance

[arthas@17962]$ profiler start -d 30 -f /tmp/arthas/1.txt

After waiting for 30s, open 1.txt and view the CPU performance analysis results. The example at the beginning is as follows:

--- 1630160766 ns (4.24%), 141 samples
  ......
  [14] org.springframework.boot.loader.LaunchedURLClassLoader.definePackageIfNecessary
  [15] org.springframework.boot.loader.LaunchedURLClassLoader.loadClass
  [16] java.lang.ClassLoader.loadClass
  [17] java.lang.Class.forName0
  [18] java.lang.Class.forName
  [19] org.springframework.util.ClassUtils.forName
  [20] org.springframework.http.converter.json.Jackson2ObjectMapperBuilder.registerWellKnownModulesIfAvailable
  [21] org.springframework.http.converter.json.Jackson2ObjectMapperBuilder.configure
  [22] org.springframework.http.converter.json.Jackson2ObjectMapperBuilder.build
  [23] org.springframework.web.servlet.config.annotation.WebMvcConfigurationSupport.addDefaultHttpMessageConverters
  [24] org.springframework.web.servlet.config.annotation.WebMvcConfigurationSupport.getMessageConverters
  [25] org.springframework.boot.autoconfigure.http.HttpMessageConverters$1.defaultMessageConverters
  [26] org.springframework.boot.autoconfigure.http.HttpMessageConverters.getDefaultConverters
  [27] org.springframework.boot.autoconfigure.http.HttpMessageConverters.<init>
  [28] org.springframework.boot.autoconfigure.http.HttpMessageConverters.<init>
  [29] org.springframework.boot.autoconfigure.http.HttpMessageConverters.<init>
  [30] com.zhangmen.xxx.DefaultFeignConfig.lambda$feignDecoder$0
  [31] com.zhangmen.xxx.DefaultFeignConfig$$Lambda$704.256909008.getObject
  [32] org.springframework.cloud.openfeign.support.SpringDecoder.decode
  [33] org.springframework.cloud.openfeign.support.ResponseEntityDecoder.decode
 ......

2. Execute the trace command on the suspicious method and output the time consumption of each node on the method path

Analyzing the CPU performance analysis results obtained in the previous step, we can find that there are indeed Feign-related stack frames in the stack that takes up the most CPU.

And found that com.zhangmen related stack frames appeared around Feign-related stack frames: com.zhangmen.xxx.DefaultFeignConfig$$Lambda$704.256909008.getObject and com.zhangmen.xxx.DefaultFeignConfig.lambda$feignDecoder$0.

Searching for com.zhangmen.xxx.DefaultFeignConfig in 1.txt found 340 hits, so I think this is a very suspicious method.

Execute the trace command to output the time-consuming of each node on the method path:

[arthas@17962]$ trace com.zhangmen.xxx.DefaultFeignConfig * '#cost>200' -n 3
`---[603.999323ms] com.zhangmen.xxx.DefaultFeignConfig:lambda$feignEncoder$1()
    `---[603.856565ms] org.springframework.boot.autoconfigure.http.HttpMessageConverters:<init>() #42

Found that org.springframework.boot.autoconfigure.http.HttpMessageConverters:<init>() is time-consuming, and continue to trace it layer by layer:

[arthas@17962]$ trace org.springframework.boot.autoconfigure.http.HttpMessageConverters <init> '#cost>200' -n 3
......
[arthas@17962]$ trace org.springframework.http.converter.json.Jackson2ObjectMapperBuilder registerWellKnownModulesIfAvailable '#cost>200' -n 3

Finally found that org.springframework.util.ClassUtils:forName() is time-consuming and throws an exception.

Use the watch command to view specific exceptions:

[arthas@17962]$ watch org.springframework.util.ClassUtils forName -e "throwExp" -n

Solve the problem

Feedback the identified problems to relevant business development, and suggest to introduce jackson-datatype-joda dependency.

Stress test report after introducing dependency:

The throughput has increased from the original 6.4/s to 69.3/s, which is higher than the 53.9/s before the upgrade of the framework.

At this time, related business development feedback, this problem is caused by customizing Feign's codec in the code (shown in the figure below), and this codec has always existed before the framework is upgraded.

Therefore, perform stress testing on the code before upgrading the framework and use Arthas to execute the following commands during the stress testing process:

Found that there is also this anomaly. Introduce jackson-datatype-joda dependency, perform pressure test again, the pressure test report is as follows:

Summarize the previous pressure test results:

A new question can be found: why the new and old versions do not introduce dependencies at the same time, the throughput difference is nearly 8 times, and the new and old versions introduce dependencies at the same time, and the throughput difference is nearly doubled?

Further locate the problem

According to the new problems found in the previous step, the next step is to perform stress tests on the versions that have not upgraded the framework and introduced dependencies and the versions that have been upgraded and introduced dependencies, and use Arthas' profiler command to sample the CPU performance analysis data during the stress testing. Get sample 1 and sample 2. And find similar stacks from sample 1 and sample 2 for comparison:

Through comparison, it can be found that the first 17 rows of similar stacks of the two samples are different. And trace the suspicious stack frame in sample 2:

[arthas@10561]$ trace org.apache.catalina.loader.WebappClassLoaderBase$CombinedEnumeration * '#cost>100' -n 3
`---[171.744137ms] org.apache.catalina.loader.WebappClassLoaderBase$CombinedEnumeration:hasMoreElements()
    `---[171.736943ms] org.apache.catalina.loader.WebappClassLoaderBase$CombinedEnumeration:inc() #2685
        `---[171.724546ms] org.apache.catalina.loader.WebappClassLoaderBase$CombinedEnumeration:inc()

After discovering the upgrade framework, there is a time-consuming situation in the class loader.

However, the trace tracing of this part of sample 1 did not take more than 100ms.

Further use the profiler command to generate flame graphs of the two versions in the stress test scenario, and find similar stacks for comparison:

[arthas@10561]$ profiler start -d 30 -f /tmp/arthas/1.svg

It is found that the upgraded framework and the introduction of the dependent version also have some more org/springframework/boot/loader/ related stacks.

Solve the problem further

Feedback new findings to relevant business development.

They reflect that in addition to the framework upgrade, there are also adjustments to the Spring Boot war to jar deployment. From deploying with independent Tomcat war, transforming to deploy with Spring Boot embedded Tomcat java -jar. Therefore, it is suspected that there is a performance difference between the two deployment methods on the class loader.
Related business development During my last step of locating the problem, according to the problem I initially located, I searched feign com.fasterxml.jackson.datatype.joda.JodaModule on Google and found a related article "LoadClass Causes Online Service Stall Analysis" .

The author in the article encountered similar problems as ours.

After reading this article, I debugged part of the source code, and finally learned that the root cause of the problem is: SpringEncoder / SpringDecoder will call ObjectFactory<HttpMessageConverters>.getObject()).getConverters() every time encoding/decoding. HttpMessageConverters. The implementation of the ObjectFactory<HttpMessageConverters> configured in our custom DefaultFeignConfig is to create a new HttpMessageConverters object every time.

The construction method of HttpMessageConverters will execute the getDefaultConverters method by default to obtain the default HttpMessageConverter collection, and initialize these default HttpMessageConverter. Among them, MappingJackson2HttpMessageConverter (there are two, see the figure below) will load com.fasterxml.jackson.datatype.joda.JodaModule and com.fasterxml.jackson.datatype.joda$JodaModule (org.springframework. When util.ClassUtils fails to load the class, it will try to load the inner class again), and throw a ClassNotFoundException, and the exception is eventually swallowed.

And some of the XML-related default HttpMessageConverter, SourceHttpMessageConverter and Jaxb2RootElementHttpMessageConverter (two each, see the figure below) will execute TransformerFactory.newInstance() every time it is initialized, and SPI will be used to scan META-INF / under the classpath during execution. The services directory obtains the specific implementation, and the specific implementation is not obtained after each scan, and finally uses the default specified com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl as the implementation.

As a result, every Feign call (including encoding and decoding) will load 4 times com.fasterxml.jackson.datatype.joda.JodaModule and com.fasterxml.jackson.datatype.joda$JodaModule that are not in the classpath (8 times in total), And 8 times to use SPI to scan the META-INF/services directory under the classpath to obtain the implementations that cannot be found, and after war to jar, the performance of the class loader on frequent search and loading of resources is reduced, which eventually seriously affects the interface performance .

The default HttpMessageConverter collection:

Some key codes are as follows.

org/springframework/boot/autoconfigure/http/HttpMessageConverters.<init>：

org/springframework/http/converter/json/Jackson2ObjectMapperBuilder.registerWellKnownModulesIfAvailable：

org / springframework / util.ClassUtils.forName ：

org/springframework/http/converter/xml/SourceHttpMessageConverter：

javax/xml/transform/FactoryFinder.find：

The article also provides two solutions to this problem:

The first method is to introduce jackson-datatype-joda dependency that I originally suggested to avoid ClassLoader repeatedly loading com.fasterxml.jackson.datatype.joda.JodaModule and com.fasterxml.jackson that are not in the classpath every time the default MappingJackson2HttpMessageConverter is initialized .datatype.joda$JodaModule.

The second method is not to initialize the default HttpMessageConverter. Since we only need to use the custom FastJsonHttpMessageConverter to perform the codec here, we can completely avoid executing the getDefaultConverters method and re-initialize many default HttpMessageConverters that are not used. Therefore, when you new HttpMessageConverters object, you can set the addDefaultConverters parameter to false.

ObjectFactory<HttpMessageConverters> objectFactory = () -> new HttpMessageConverters(false, new HttpMessageConverter[] { (HttpMessageConverter)fastJsonHttpMessageConverter });

In fact, we can also modify the implementation of ObjectFactory<HttpMessageConverters> in DefaultFeignConfig to avoid creating a new HttpMessageConverters object every time (re-initialize HttpMessageConverters) to achieve further optimization.
Therefore, it is recommended that related business development be modified DefaultFeignConfig to the following code:

After relevant business development improved the DefaultFeignConfig in the old and new versions of the code and deployed it to the FAT (test) environment, I used JMeter on my own machine to simulate the stress test of the FAT environment.

Pressure test results after the old version improved:

Pressure test results after the new version:

It is found that at this time, the interface performance of the two versions is basically the same.

In the UAT environment, the testers performed a stress test again on the code after upgrading the framework and improving DefaultFeignConfig. The stress test results are as follows:

The throughput has increased from 6.4/s, which was not up to the standard initially, to 160.4/s.

Then why does the adjustment of war to jar deployment cause the performance of the class loader to decrease when it frequently finds and loads resources?

After understanding the principle of SpringBoot executable jar. It is suspected that in order to be able to start with a fat jar, Spring Boot has expanded the JarFile URL protocol of JDK, customized its own ClassLoader and Hander of jar file protocol, and realized the loading method of jar in jar and jar in directory .

Friends who are interested in the principle of SpringBoot executable jar can refer to: "Executable jar package" .

Research on the Root Causes of War2Jar Class Loader Performance Degradation

In order to verify my guess, I built a simple Demo on my own machine.

There are two services in Demo, A and B. Register both A and B in the Eureka registry, and A calls B through Feign.

Next, use Jmeter to perform pressure measurement on various scenarios under the same configuration, and use Arthas' profiler command to generate flame graphs in various scenarios during the pressure measurement process.

The pressure test results are as follows (-Xms512m -Xmx512m):

By comparing Table 3 and Table 4, we can know that after code optimization, whether to introduce dependencies has almost no effect on throughput.

According to Table 3 and Table 4, after code optimization, the throughput of the three deployment methods is basically the same when the non-existent resources are not frequently searched and loaded.

It can be known from Table 2 that Tomcat war deployment performance is better when SPI is frequently used to obtain implementations that cannot be found under the classpath.

It can be seen from Table 1 that when non-existent classes are frequently loaded, the startup performance through JarLauncher is better after decompressing the jar package.

Compare the flame diagrams of similar stacks ③ and ② in Table 1:

It can be found that there are differences between the two when loading classes in org/springframework/boot/loader/LaunchedURLClassLoader.loadClass.

② It will not only execute java/lang/ClassLoader.loadClass, but also execute org/springframework/boot/loader/LaunchedURLClassLoader.definePackageIfNecessary.

View the source code of org/springframework/boot/loader/LaunchedURLClassLoader.loadClass:

Found that there is a conditional branch.

View the source code of org/springframework/boot/loader/Launcher.createArchive:

It is found that the value of this condition is related to whether the application is an executable jar file or a file directory.

Perform pressure test again on ②, and trace org/springframework/boot/loader/LaunchedURLClassLoader.definePackageIfNecessary:

`---[165.770773ms] org.springframework.boot.loader.LaunchedURLClassLoader:definePackageIfNecessary()
    +---[0.00347ms] org.springframework.boot.loader.LaunchedURLClassLoader:getPackage() #197
    `---[165.761244ms] org.springframework.boot.loader.LaunchedURLClassLoader:definePackage() #199

I found out that this place is indeed time-consuming.

Reading this part of the source code, you can know from the comments that definePackageIfNecessary is mainly to try to define the package where the class is located according to the class name before calling findClass to ensure that the manifest in the jar file nested in the jar package can be associated with the package.

Debug definePackageIfNecessary This part of the code is found to traverse all jar packages under BOOT-INF/lib/ and BOOT-INF/classes/ when definingPackage. If the specified class is found in these resources, the definePackage method will continue to be called, otherwise null will be returned directly after the traversal.

As mentioned earlier, every Feign call will load 4 times com.fasterxml.jackson.datatype.joda.JodaModule and com.fasterxml.jackson.datatype.joda$JodaModule that are not in the classpath (8 times in total). And my simple Demo application depends on 117 jars (the actual enterprise-level projects will be more). Then every time Feign is called, 8 * (117 + 1) will be executed, a total of 944 cycles of logic. The org.springframework.boot.loader.jar.Handler.openConnection method in the logic will involve more time-consuming IO operations during the execution process, which will eventually seriously affect the interface performance. From the generated flame graph, you can also see this part of the processing logic.

At this point, it has been confirmed that the adjustment of war to jar deployment has caused the performance of the class loader to decrease when frequently searching and loading resources. The root cause is: In order to be able to start with a fat jar, Spring Boot has added some customized Processing logic, and this part of customized processing logic will have a greater impact on program performance when frequently executed.

As for [Why does it perform better than Tomcat war deployment when starting with JarLauncher after decompressing the jar package when loading non-existent classes frequently? ], [When using SPI frequently to obtain implementations that cannot be found under the classpath, the Tomcat war deployment performance is better than starting through JarLauncher after decompressing the jar package? ] Due to space limitations, I will not continue to expand in this article. Interested friends can follow the method introduced in this article, combined with the relevant source code for further exploration.

to sum up

When you customize Feign's codec, if you use SpringEncoder / SpringDecoder, you should avoid repeated initialization of HttpMessageConverters. If you do not need to use the default HttpMessageConverter, you can set the first input parameter to false when initializing HttpMessageConverters, so as not to initialize the default HttpMessageConverter.

In addition, you should understand that different deployment methods have performance differences when the class loader frequently finds and loads resources.

When we write code, we should also avoid repeated initialization, and repeated search and loading of non-existent resources.

Finally, making good use of SkyWalking and Arthas can help us troubleshoot program errors and performance bottlenecks more efficiently.

Easter eggs

If the application uses the SkyWalking Agent and then uses Arthas, some Arthas commands (trace, watch and other commands that enhance the class) may not work properly.

solution:https://github.com/apache/skywalking/blob/master/docs/en/FAQ/Compatible-with-other-javaagent-bytecode-processing.md

When Arthas can work normally, when we execute commands such as trace for the methods of the class that SkyWalking Agent has enhanced, it is best to add a * symbol after the method name for fuzzy matching. Arthas will finally summarize and display the trace tracking results of all matching methods.

The method name does not add * for trace:

Add * to the method name for trace:

You can see that after adding * to the method name, the result obtained by trace is our ideal result.

This is because SkyWalking Agent uses ByteBuddy for bytecode enhancement. Each time ByteBuddy enhances a method, it will generate an auxiliary inner class (HelloController$auxiliary$jiu2bTqU) for that method, and rename the original method (test1) in the current class (HelloController) (test1$original$lyu0XDob), And generate a method with the same name as the original method (test1) and a method with a different name but only for the auxiliary internal class call (test1$original$lyu0XDob$accessor$8F82ImAF).

Use the Java decompiler tool developed by colleagues to visually see the relevant code:

In addition, when using Arthas, it is recommended to choose the latest version. For example, the version of trace before 3.4.2 may cause JVM Metaspace OOM when tracing large methods. For details, see: "Remember a Metaspace OOM Problem Caused by Arthas" .

If you want to build an enterprise-level online diagnostic platform based on Arthas, you can refer to "Exploration and Practice of ICBC to Build an Online Diagnostic Platform" .

About the Author

Wang Ruixian , an open source enthusiast, currently heads the R&D engineer of the Education Infrastructure Department. He is mainly responsible for the R&D of the company's full-link monitoring system and application diagnostic platform.

Spring Boot microservice performance drops 90%! Use Arthas to locate root causes