Performance tuning - a small log with a big pit

introduction

"Only those who have been severely beaten by online service problems understand how important logs are!"
Let me start with the conclusion, who is in favor and who is against? If you feel the same, then congratulations you are a social person :)

The importance of the log to the program is self-evident. It is lightweight, simple, and requires no brains. It can be seen everywhere in the program code, helping us to troubleshoot and locate a problem. However, the seemingly inconspicuous log hides various "pitfalls". If it is not used properly, it will not only fail to help us, but will become a service "killer".
This article mainly introduces "pitfalls" caused by improper use of logs in the production environment and the guidelines for avoiding pitfalls, especially in high-concurrency systems. At the same time, a set of implementation solutions is provided to allow the program and the log to "coexist harmoniously".

Avoid pits and point north

In this chapter, I will introduce the log problems encountered online in the past, and analyze the root causes of the problems one by one.

Irregular log writing format

Scenes

// 格式1
log.debug("get user" + uid + " from DB is Empty!");

// 格式2
if (log.isdebugEnable()) {
    
    
    log.debug("get user" + uid + " from DB is Empty!");
}

// 格式3
log.debug("get user {} from DB is Empty!", uid);

I believe that everyone has seen the above three writing methods more or less in the project code, so what is the difference between them before, and what impact will it have on performance?
If the DEBUG log level is turned off at this time, the difference will appear. Format 1 still needs to perform string concatenation, even if it does not output logs, which is a waste.

The disadvantage of format 2 is that additional judgment logic needs to be added, adding waste code, which is not elegant at all.
Therefore, format 3 is recommended, which will be dynamically spliced ​​only during execution. After the corresponding log level is turned off, there will be no performance loss.

Production printing lots of logs consuming performance

As many logs as possible can string together user requests and make it easier to determine the code location of the problem. Due to the current distributed system and complex business, any lack of logs is a great obstacle for programmers to locate problems. Therefore, programmers who have suffered from production problems must log as much as possible during the code development process.
In order to locate and fix problems as soon as possible in the future, programmers will try to type as many key logs as possible during the programming implementation stage. After going online, the problem can be quickly located, but then there will be new challenges: With the rapid development of business, user visits continue to increase, and the system pressure is increasing. At this time, a large number of INFO logs are online, especially during peak periods. A large number of log disks are written, which greatly consumes service performance.
Then this becomes a game theory. If there are more logs, it is easier to troubleshoot problems, but the service performance is "eaten". If there are fewer logs, the service stability will not be affected, but troubleshooting is difficult, and programmers are "suffering".
insert image description here

Question: Why is there too many INFO logs, and the performance will be damaged (the CPU usage is high at this time)?

Root cause 1: Synchronous printing log disk I/O becomes a bottleneck, resulting in a large number of thread blocks

It is conceivable that if the logs are all output to the same log file, and multiple threads are writing to the file at this time, it will be messed up. The solution is to add locks to ensure that the log file output will not be disordered. If it is during the peak period, the contention for locks is undoubtedly the most performance-consuming. When a thread grabs the lock, other threads can only wait in Block, which seriously drags down the user thread. The performance is that the upstream call times out, and the user feels stuck.

The following is the stack when the thread is stuck in writing the file

Stack Trace is:
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.logging.log4j.core.appender.OutputStreamManager.writeBytes(OutputStreamManager.java:352)
- waiting to lock <0x000000063d668298> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
at org.apache.logging.log4j.core.layout.TextEncoderHelper.writeEncodedText(TextEncoderHelper.java:96)
at org.apache.logging.log4j.core.layout.TextEncoderHelper.encodeText(TextEncoderHelper.java:65)
at org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:68)
at org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:32)
at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:228)
.....

So is it okay to reduce INFO logs online? Similarly, the amount of ERROR logs should not be underestimated. If there is a large amount of abnormal data online, or a large amount of timeout occurs downstream, a large number of ERROR logs will be generated instantaneously. At this time, the disk I/O will still be full, causing user threads to block.

Question: Assuming you don’t care about INFO troubleshooting, is there no performance problem if you only print ERROR logs in production?

Root cause 2: Thread Block caused by log printing exception stack under high concurrency

有次线上下游出现大量超时,异常都被我们的服务捕获了,庆幸的是容灾设计时预计到会有这种问题发生,做了兜底值逻辑,本来庆幸没啥影响是,服务器开始“教做人”了。线上监控开始报警, CPU 使用率增长过快,CPU 一路直接增到 90%+ ,此时紧急扩容止损,并找一台拉下流量,拉取堆栈。
Dump 下来的线程堆栈查看后,结合火焰退分析,大部分现成都卡在如下堆栈位置:

Stack Trace is:
java.lang.Thread.State: BLOCKED (on object monitor)
at java.lang.ClassLoader.loadClass(ClassLoader.java:404)
- waiting to lock <0x000000064c514c88> (a java.lang.Object)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.logging.log4j.core.impl.ThrowableProxyHelper.loadClass(ThrowableProxyHelper.java:205)
at org.apache.logging.log4j.core.impl.ThrowableProxyHelper.toExtendedStackTrace(ThrowableProxyHelper.java:112)
at org.apache.logging.log4j.core.impl.ThrowableProxy.<init>(ThrowableProxy.java:112)
at org.apache.logging.log4j.core.impl.ThrowableProxy.<init>(ThrowableProxy.java:96)
at org.apache.logging.log4j.core.impl.Log4jLogEvent.getThrownProxy(Log4jLogEvent.java:629)
...

此处堆栈较长,大部分现场全部 Block 在 java.lang.ClassLoader.loadClass,而且往下盘堆栈发现都是因为这行代码触发的

at org.apache.logging.slf4j.Log4jLogger.error(Log4jLogger.java:319)

// 对应的业务代码为
log.error("ds fetcher get error", e);

啊这。。。就很离谱,你打个日志为何会加载类呢?加载类为何会 Block 这么多线程呢?
一番查阅分析后,得出如下结论:

  • 使用 Log4j 的 Logger.error 去打印异常堆栈的时候,为了打印出堆栈中类的位置信息,需要使用 Classloader进行类加载;
  • Classloader加载是线程安全的,虽然并行加载可以提高加载不同类的效率,但是多线程加载相同的类时,还是需要互相同步等待,尤其当不同的线程打印的异常堆栈完全相同时,就会增加线程 Block 的风险,而 Classloader 去加载一个无法加载的类时,效率会急剧下降,使线程Block的情况进一步恶化;
  • 因为反射调用效率问题,JDK 对反射调用进行了优化,动态生成 Java 类进行方法调用,替换原来的 native 调用,而生成的动态类是由 DelegatingClassLoader 进行加载的,不能被其他的 Classloader 加载,异常堆栈中有反射优化的动态类,在高并发的条件下,就非常容易产生线程 Block 的情况。

结合上文堆栈,卡在此处就很明清晰了:

  • A large number of threads flood in, causing downstream services to time out, causing the timeout exception stack to be frequently printed. Each layer of the stack needs to get the corresponding class, version, line number and other information through reflection, which needs to be waited synchronously. One thread is locked, causing most threads to block and wait for the class to load successfully, affecting performance loadClass.
  • To be reasonable, even if most threads are waiting for one thread loadClass, it is only a momentary freeze. Why does this error report continue to be loadClasslike this? Combining the above conclusions to analyze the program code, it is concluded that the request downstream service logic in the thread here includes the Groovy script execution logic, which belongs to dynamic class generation. The third conclusion above shows that dynamic classes cannot be correctly loaded by log4j reflection under high concurrency conditions, so stack reflection is used again, entering an infinite loop, and more and more threads can only join and wait and block.

Best Practices

1. Remove unnecessary exception stack printing

For obvious exceptions, don’t print the stack, save some performance, anything + high concurrency, the meaning is different :)

try {
    
    
    System.out.println(Integer.parseInt(number) + 100);
} catch (Exception e) {
    
    
    // 改进前
    log.error("parse int error : " + number, e);
    // 改进后
    log.error("parse int error : " + number);
}

If an exception occurs in Integer.parseInt, the cause of the exception must be that the number entered and exited is illegal. In this case, it is completely unnecessary to print the exception stack, and the print of the stack can be removed.

2. Convert the stack information to a string and print it

public static String stacktraceToString(Throwable throwable) {
    
    
    StringWriter stringWriter = new StringWriter();
    throwable.printStackTrace(new PrintWriter(stringWriter));
    return stringWriter.toString();
}

log.errorThe obtained stack information will be more complete, JDK version, Class path information, and the classes in the jar package will also print the name and version information of the jar. These are all reflected information from loading classes, which greatly reduces performance.
Call stacktraceToStringto convert the exception stack into a string. Relatively speaking, some version and jar metadata information is confirmed. At this time, you need to decide whether it is necessary to print out this information (for example, it is still very useful to check class conflicts based on version).

3. Disable reflection optimization

Use Log4j to print stack information. If there is a dynamic proxy class generated by reflection optimization in the stack, this proxy class cannot be loaded by other Classloaders. Printing the stack at this time will seriously affect the execution efficiency. But disabling reflection optimizations also has side effects, resulting in less efficient reflection execution.

4. Print logs asynchronously

In the production environment, especially for services with high QPS, asynchronous printing must be enabled. Of course, if asynchronous printing is enabled, there is a possibility of losing logs. For example, the server is forcibly "killed", which is also a process of trade-offs.

5. Log output format

The difference between the output formats of our play logs

// 格式1
[%d{
    
    yyyy/MM/dd HH:mm:ss.SSS}[%X{
    
    traceId}] %t [%p] %C{
    
    1} (%F:%M:%L) %msg%n

// 格式2
[%d{
    
    yy-MM-dd.HH:mm:ss.SSS}] [%thread]  [%-5p %-22c{
    
    0} -] %m%n

The official website also has a clear performance comparison prompt. If the following field output is used, the performance will be greatly lost

 %C or $class, %F or %file, %l or %location, %L or %line, %M or %method

insert image description here

In order to get the function name and line number information, log4j uses the exception mechanism, first throws an exception, then catches the exception and prints out the stack content of the exception information, and then parses the line number from the stack content. However, the lock acquisition and analysis process is added to the implementation source code. Under high concurrency, the performance loss can be imagined.

The following are parameter configurations that affect performance, please configure as appropriate:

%C - 调用者的类名(速度慢,不推荐使用)
%F - 调用者的文件名(速度极慢,不推荐使用)
%l - 调用者的函数名、文件名、行号(极度不推荐,非常耗性能)
%L - 调用者的行号(速度极慢,不推荐使用)
%M - 调用者的函数名(速度极慢,不推荐使用)

Solution - Dynamic adjustment of log level

The project code needs to print a large number INFOof level logs to support problem location and test troubleshooting. However, these large numbers of INFOlogs are ineffective for the production environment. A large number of logs will eat up CPU performance. At this time, it is necessary to dynamically adjust the log level, which can not only meet the needs of viewing logs at any time, but also meet the requirements of INFOdynamically closing them when they are not needed without affecting service performance.

Solution: Combining Apollo and log4j2 features, dynamically and fine-grained control the log level in the global or individual Class file from the API level. The advantage is that it takes effect at any time. For production troubleshooting, you can specify the log level to open a single class file, and you can close it at any time after troubleshooting.

Due to the space limit of this article, the specific implementation code will not be posted. In fact, the implementation is very simple, that is, the clever use of Apollo's dynamic notification mechanism to reset the log level. If you are interested, you can send me a private message or leave a message. I will open an article to explain how to implement it in detail.

Summary and Outlook

This article takes you to understand the common problems of logs in daily software services and the corresponding solutions. Remember, simple things + high concurrency = not simple! Be in awe of production!

Being able to read to the end shows that you are a true fan. If you have any questions, please send a private message or comment. I will definitely reply as soon as I see it. If you think the content I share is "dry", please like, follow, and forward it. This is my greatest encouragement, thank you for your support!

Guess you like

Origin blog.csdn.net/weixin_43975482/article/details/126647980