Details that are easy to overlook: the zero-point interface caused by Log4j configuration severely timed out

Author: vivo Internet Server Team - Jiang Ye

This article records in detail the troubleshooting experience of a serious timeout of the 0-point interface. This article describes the process from problem location to specific problem troubleshooting from the author's own perspective, and finally solves the problem through root cause analysis. The whole process requires clear troubleshooting ideas and rich experience in problem solving, and it is also inseparable from the company's powerful call chain and comprehensive system monitoring infrastructure.

1. Problem discovery

The mall activity system I am in charge of is used to undertake the marketing activities of the company's online official mall. Recently, I suddenly received a service timeout alarm at 0:00 am.

The system of marketing activities has the following characteristics :

  1. Marketing activities usually start at 0 o'clock, such as red envelope rain, large coupon grabbing, etc.

  2. Opportunities for daily marketing activities are refreshed, such as daily tasks, daily check-ins, daily lottery chances, etc.

The interest stimulation of marketing activities will attract a large number of real users and illegal products to participate in the activities, so the traffic will usher in a wave of small peaks at 0:00, and because of this, the occasional service timeout alarms on the line did not attract my attention at first . But in the next few days, I will receive a service timeout alarm at 0:00 every day, which aroused my vigilance and decided to find out.

2. Troubleshooting

First, check the P95 response time of each interface per minute around 0 o'clock through the company's application monitoring system. As shown in the figure below, the interface response time reaches a maximum of 8s at 0:00. Continue to check that the interface that takes the most time to lock is the product list interface. The following is a specific investigation of this interface.

2.1 Troubleshooting ideas

Before the official investigation, I would like to share with you my thoughts on troubleshooting the interface timeout problem. The figure below is a simplified request flow.

  1. The user initiates a request to the application

  2. Application service for logic processing

  3. Application services call downstream applications and perform database read and write operations through RPC

The service timeout may be caused by the slowness of the application service itself, or by the slow response of downstream dependencies. The specific investigation ideas are as follows:

2.1.1 Troubleshoot downstream dependent slow services

(1) Locate slow services in downstream dependencies through call chain technology

Call chain technology is an important guarantee for system observability. Common open source solutions include ziplin and pinpoint. The complete call chain can record the time-consuming of each downstream dependent call in chronological order, such as rpc service call, sql execution time-consuming, redis access time-consuming, etc. Therefore, using the call chain technology can quickly locate the slow services that the downstream depends on, such as dubbo interface access timeout, slow SQL, etc. But the ideal is very full, and the reality is very skinny. Due to the large amount of call link information, high storage costs and computing resources are required to collect the full amount of link information. Therefore, when the technology is implemented, a sampling strategy is usually used to collect link information. The problem caused by sampling is the loss or incompleteness of request link information.

(2) Slow service troubleshooting when there is no call chain

If the call chain is lost or incomplete, we need to combine other means for comprehensive positioning.

Downstream RPC service response timeout: If the Dubbo framework is used, timeout related logs will be printed when the provider response times out; if the company provides application monitoring, you can also check the comprehensive judgment of the downstream service P95 response time.

Slow SQL: MySQL supports setting the slow SQL threshold, which will record slow SQL if the threshold is exceeded; Druid, our commonly used database connection pool, can also print slow SQL logs through configuration. If you confirm that there is slow SQL in the request link, you can further analyze the execution plan of the SQL. If there is no problem with the execution plan, then confirm the system load of the mysql host when the slow SQL is generated.

When downstream dependencies include storage services such as Redis, ES, and Mongo, the troubleshooting ideas for slow services are similar to slow SQL troubleshooting, and will not be repeated here.

2.1.2 Application troubleshooting

(1) The application logic takes a lot of time

It is more common for application logic to take a lot of time, such as serialization and deserialization of a large number of objects, and a large number of reflection applications. The troubleshooting of such problems usually starts with analyzing the source code, which should be avoided as much as possible when coding.

(2) Pause caused by garbage collection (stop the world)

Garbage collection will cause the application to pause, especially when Old GC or Full GC occurs, the pause is obvious. However, it also depends on the garbage collector selected by the application and the cooperation related to garbage collection. For example, the CMS garbage collector can usually guarantee a short pause, while the Parallel Scavenge garbage collector pursues higher throughput, and the pause time will be shorter. Relatively longer.

Through the JVM startup parameter -XX:+PrintGCDetails, we can print detailed GC logs, so that we can observe the type, frequency and time consumption of GC.

(3) Thread synchronization blocking

Thread synchronization, if the thread currently holding the lock holds the lock for a long time, the queued thread will always be in the blocked state, causing the service response to time out. You can use the jstack tool to print the thread stack to find out whether there are threads in the block state. Of course, the jstack tool can only collect real-time thread stack information. If you want to view historical stack information, you generally need to collect and process it through Prometheus.

2.2 Troubleshooting process

The following is an investigation based on this investigation idea.

2.2.1 Troubleshoot downstream dependent slow services

(1) View downstream slow services through the call chain

First, go to the company's application monitoring platform, filter out the call chain list for 5 minutes before and after 0 o'clock, and then arrange them in reverse order according to the link time consumption, and find that the maximum interface time consumption is 7399ms. Check the details of the call chain and find that the downstream dependencies are all time-consuming at the ms level. Because the call chain is sampled and collected, link information may be lost. Therefore, other means are needed to further check downstream dependent services.

(2) Investigate downstream slow services by other means

Then I checked the system logs around 0 o'clock, and found no dubbo call timeout. Then check the P95 response time of the downstream application through the company's application monitoring, as shown in the figure below, at 0 o'clock, the response time of some downstream services is indeed slower, and the highest reaches 1.45s. Although it has a certain impact on the upstream, it will not affect it. so big.

(3) Slow SQL troubleshooting

The next step is to check the slow SQL. The connection pool of our system uses the open source druid of Ali. If the SQL execution exceeds 1s, the slow SQL log will be printed. Check the log center and there is no trace of the slow SQL.

Up to now, it can be preliminarily ruled out that the service timeout caused by slow downstream dependencies, we continue to troubleshoot the application itself.

2.2.2 Application troubleshooting

(1) Complex and time-consuming logic troubleshooting

Firstly, I checked the source code of the interface. The overall logic is relatively simple. The downstream product system is called to obtain product information through dubbo, and then the product information is sorted locally and other simple processing. There is no complex time-consuming logic problem.

(2) Troubleshoot garbage collection pauses

Check the GC status of the application through the company's application monitoring, and find that no full GC has occurred before and after 0:00, and no Old GC has occurred. The cause of garbage collection stalls is also ruled out.

(3) Thread synchronization blocking troubleshooting

Check whether there are synchronous blocking threads through company application monitoring, as shown in the following figure:

Seeing this, I finally feel that God pays off. From 00:00:00 to 00:02:00, during these two minutes, many threads in the state of BLOCKED appeared, and the interface that timed out is most likely related to these blocked threads . We only need to further analyze the JVM stack information to reveal the truth.

We randomly select a representative machine to check the block stack information. The stack collection time is 2022-08-02 00:00:20.

// 日志打印操作,被线程catalina-exec-408阻塞
"catalina-exec-99" Id=506 BLOCKED on org.apache.log4j.spi.RootLogger@15f368fa owned by "catalina-exec-408" Id=55204
    at org.apache.log4j.Category.callAppenders(Category.java:204)
    -  blocked on org.apache.log4j.spi.RootLogger@15f368fa
    at org.apache.log4j.Category.forcedLog$original$mp4HwCYF(Category.java:391)
    at org.apache.log4j.Category.forcedLog$original$mp4HwCYF$accessor$pRDvBPqB(Category.java)
    at org.apache.log4j.Category$auxiliary$JhXHxvpc.call(Unknown Source)
    at com.vivo.internet.trace.agent.plugin.interceptor.enhance.InstMethodsInter.intercept(InstMethodsInter.java:46)
    at org.apache.log4j.Category.forcedLog(Category.java)
    at org.apache.log4j.Category.log(Category.java:856)
    at org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:324)
    ...
 
// 日志打印操作,被线程catalina-exec-408阻塞
"catalina-exec-440" Id=55236 BLOCKED on org.apache.log4j.spi.RootLogger@15f368fa owned by "catalina-exec-408" Id=55204
    at org.apache.log4j.Category.callAppenders(Category.java:204)
    -  blocked on org.apache.log4j.spi.RootLogger@15f368fa
    at org.apache.log4j.Category.forcedLog$original$mp4HwCYF(Category.java:391)
    at org.apache.log4j.Category.forcedLog$original$mp4HwCYF$accessor$pRDvBPqB(Category.java)
    at org.apache.log4j.Category$auxiliary$JhXHxvpc.call(Unknown Source)
    at com.vivo.internet.trace.agent.plugin.interceptor.enhance.InstMethodsInter.intercept(InstMethodsInter.java:46)
    at org.apache.log4j.Category.forcedLog(Category.java)
    at org.apache.log4j.Category.log(Category.java:856)
    at org.slf4j.impl.Log4jLoggerAdapter.warn(Log4jLoggerAdapter.java:444)
    ...
 
// 日志打印操作,被线程catalina-exec-408阻塞
"catalina-exec-416" Id=55212 BLOCKED on org.apache.log4j.spi.RootLogger@15f368fa owned by "catalina-exec-408" Id=55204
    at org.apache.log4j.Category.callAppenders(Category.java:204)
    -  blocked on org.apache.log4j.spi.RootLogger@15f368fa
    at org.apache.log4j.Category.forcedLog$original$mp4HwCYF(Category.java:391)
    at org.apache.log4j.Category.forcedLog$original$mp4HwCYF$accessor$pRDvBPqB(Category.java)
    at org.apache.log4j.Category$auxiliary$JhXHxvpc.call(Unknown Source)
    at com.vivo.internet.trace.agent.plugin.interceptor.enhance.InstMethodsInter.intercept(InstMethodsInter.java:46)
    at org.apache.log4j.Category.forcedLog(Category.java)
    at org.apache.log4j.Category.log(Category.java:856)
    at org.slf4j.impl.Log4jLoggerAdapter.warn(Log4jLoggerAdapter.java:444)
    ...

Two points can be analyzed through the stack information:

  1. All threads in the blocked state are log printed

  2. All threads are blocked by thread name "catalina-exec-408"

After tracing here, the superficial reason for the slow service becomes clear. The thread blocked by the thread catalina-exec-408 has been in the blocked state, causing the service response to time out.

3. Root cause analysis

After the superficial reasons are found, let's push through layers of fog and find the truth behind the truth!

All slow service threads are blocked by thread catalina-exec-408 when printing logs. So what is thread catalina-exec-408 doing?

It can be found that at 00:00:18.858, the thread is printing the log that the login state verification fails, and there is no complicated processing logic. Could it be that the thread prints the log slowly and blocks other threads? With this question in mind, I began to dig into the source code of the logging framework to find the answer.

The logging framework used by our project is slf4j + log4j. According to the blocked thread stack information, we locate this code as follows:


public
void callAppenders(LoggingEvent event) {
  int writes = 0;
 
  for(Category c = this; c != null; c=c.parent) {
    // Protected against simultaneous call to addAppender, removeAppender,...
    // 这是204行,加了sychronized
    synchronized(c) {
  if(c.aai != null) {
    writes += c.aai.appendLoopOnAppenders(event);
  }
  if(!c.additive) {
    break;
  }
    }
  }
 
  if(writes == 0) {
    repository.emitNoAppenderWarning(this);
  }
}

You can see that line 204 in the stack information is a synchronized code block, which blocks other threads. So what is the internal logic of the synchronized code block? Why does it take so long? The following is the core logic in the synchronized code block:

public
  int appendLoopOnAppenders(LoggingEvent event) {
    int size = 0;
    Appender appender;
 
    if(appenderList != null) {
      size = appenderList.size();
      for(int i = 0; i < size; i++) {
    appender = (Appender) appenderList.elementAt(i);
    appender.doAppend(event);
      }
    }   
    return size;
  }

As you can see, this piece of logic is to write logs to all configured appenders. There are two appenders we configure, one is the console appender, which is output to the catalina.out file. There is also an appender that is output in Json format according to the collection requirements of the company's log center. It can be inferred here that the thread catalina-exec-408 takes a lot of time to output the log to the appender.

I naturally began to suspect that the machine load at that time, especially the IO load, might be relatively high. Through the company's machine monitoring, we checked the following related indicators:

Sure enough, from 00:00:00, the disk IO consumption continued to be high, and the first wave of peaks did not end until 1 minute and 20 seconds. At 00:00:20, the IO consumption reached a peak of 99.63%, close to 100%. No wonder it is so difficult for an application to output a log!

Who is it that has exhausted all the IO resources, until there are almost no bones left? With doubts, I further checked the host snapshot through the company's machine monitoring:

It is found that at 00:00:20, the tomcat user is executing the script /bin/sh /scripts/cutlog.sh, and the script is executing the command cp catalina.out catalina.out-2022-08-02-00. IO consumption reached 109475612 bytes/s (about 104MB/s).

Things are about to come to light, we continue to dig three feet. The operation and maintenance students log in to the machine, switch to the tomcat user, and view the scheduled task list (execute crontab -l), and the results are as follows:

00 00 * * * /bin/sh /scripts/cutlog.sh

It is the script /bin/sh /scripts/cutlog.sh in the snapshot, which is executed at 0:00 every day. The specific script content is as follows:

$ cat /scripts/cutlog.sh
#!/bin/bash

files=(
  xxx
)
 
time=$(date +%F-%H)
 
for file in ${files[@]}
do
  dir=$(dirname ${file})
  filename=$(echo "xxx"|awk -F'/' '{print $NF}')
  # 归档catalina.out日志并清空catalina.out当前文件内容
  cd ${dir} && cp ${filename} ${filename}-${time} && > ${filename}
done

We found the culprit of the high IO consumption from the script, which is the copy command, which aims to archive the catalina.out log and clear the catalina.out log file.

This normal operation and maintenance script must consume IO resources, and the execution time is affected by the file size. The operation and maintenance students also helped to look at the archived log size:

[root@xxx:logdir]

# you -sh *

1.4G catalina.out

2.6G catalina.out-2022-08-02-00

The size of the archived file is 2.6 G, estimated at 104MB/s, it will take 25 seconds. That is, during the period from 00:00:00 to 00:00:25, the output of business logs will be relatively slow, resulting in a large number of thread blocks, which will cause the interface response to time out.

4. Problem solving

Once you locate the root cause of the problem, you can prescribe the right medicine. There are several options to choose from:

4.1 The production environment does not print logs to the console

The operation that consumes IO resources is the archiving of the catalina.out log. If the log is not written to this file, the problem of waiting for log printing IO can be solved. However, environments such as local debugging and stress testing still rely on console logs, so different console appenders need to be set according to different environments. At present, logback and Log4j2 already support different configurations based on the Spring profile, but the Log4j we use does not yet support it. The cost of switching the underlying framework of the log is also relatively high. In addition, the early company middleware was strongly coupled with the Log4j log framework, and it was impossible to switch easily, so we did not adopt this solution.

4.2 Configure log asynchronous printing

Log4j provides AsyncAppender to support asynchronous log printing. Asynchronous logging can solve the problem of IO waiting for synchronous log printing without blocking business threads.

Side effects of asynchronous logging:

The asynchronous log is to add the event to the buffer queue when the log is printed. The default size of the buffer is 128, which supports configuration. There are two processing strategies when the buffer is full.

(1) blocking

When the attribute blocking is set to true, the blocking strategy is used, and the default is true. That is, after the buffer is full, wait synchronously. At this time, the thread will block and return to a synchronous log.

(2) Discard

If blocking is set to false, the log will be discarded when the buffer is full.

4.3 Final solution

In the end, we chose option 2, which is to configure the log to be printed asynchronously. The buffer queue size is set to 2048. Considering that some log loss is acceptable in business, a small part of reliability is sacrificed for higher program performance, and blocking is set to false.

4.4 Summary

I have gained a few insights from this troubleshooting experience, and I would like to share with you:

1) Be in awe of online alerts

We should be in awe of every online alarm and pay attention to every error log. The reality is that most of the time the alarm is only due to network jitter, short-term burst traffic, etc., and it can be recovered by itself. This is like a wolf story, which makes us relax our vigilance and cause us to miss the real problem. Bring serious disasters to the system and bring losses to the business.

2) Get to the bottom of it

Alarms are just appearances, and we need to figure out the superficial and root causes of each alarm. For example, the interface timeout alarm this time can only provide an elegant and reasonable solution after analyzing the root cause of "the copy file exhausts the disk IO and causes the log printing thread to block". It sounds simple, but it may encounter many difficulties in practice. This requires us to have clear troubleshooting ideas, a good system observability construction, solid technical basic skills and never give up until we find the "true culprit". determination.

Finally, I hope that my troubleshooting experience can give you something to gain and inspire. I also organized the timeout problem troubleshooting ideas used in this article into a flow chart for your reference. What online glitches have you encountered? What is your line of investigation? Welcome to leave a message for exchange and discussion.

Guess you like

Origin blog.csdn.net/vivo_tech/article/details/130266359