Troubleshooting CPU glitches on Linux machines

Author: jasonzxpan, IEG operators Tencent Development Engineer

This article investigates a Linux machine CPU glitch problem. During the investigation process, the process status is not changed, and online services are not affected . Finally, the risk caused by the CPU glitch is analyzed and verified.

Refer to the simple-perf-tools repository for the CPU statistics and core file generating tools mentioned in this article .

Problem Description

Statistics of the machine where a service is located show that its CPU usage has glitches during peak hours.

No bad feedback from the service caller has been received yet.

Preliminary investigation

Check the 1-minute average load of the CPU, and find that the 1-minute average load is high or low, with obvious fluctuations. It shows that some processes on the machine fluctuate greatly in the use of CPU.

Log in to the machine to check the process and use topinstructions. Because the CPU will rise significantly, it is suspected that the process with high total CPU time is suspected. After opening top, use shift+ tto sort by CPU TIME.

Intuitively, there are several spp_worker related processes that use relatively high CPU TIME.

The first process takes a long time to start, so the CPU TIME is also relatively large. You can use the following script to calculate the CPU usage of each process after it is started:

uptime=`awk '{print $1}' /proc/uptime` # why is it too slow indocker?
hertz=`zcat /proc/config.gz | grep CONFIG_HZ= |awk -F"=" '{print $2}'`
awk -v uptime=$uptime -v hertz=$hertz -- '{printf("%d\t%s\t%11.3f\n", $1, $2, (100 *($14 + $15) / (hertz * uptime - $22)));}' /proc/*/stat 2> /dev/null | sort  -gr -k +3 | head -n 20

It is also seen that these spp_workers use relatively higher CPU:

Choose one of the Worker process monitor CPU usage with PID 45558:

It can be found that most of its CPU is very low, but it will rise at a certain point in time, lasting about 1 second. And most of the time is spent in user mode, not system calls.

The sampling strategy for CPU usage described in " Linux Agent Collection Item Description-CPU Usage " is:

Linux Agent collects the average CPU usage within 15 seconds four times every minute. In order to avoid missing the CPU peak value, the network management agent uses the maximum value collected four times in this minute and reports it.

Because sampling may take a high or low point, when the CPU spikes within 1 minute, it will appear as a spike; if there is no spike four times, it will appear as a trough.

At this point, it has been confirmed that this batch of Worker processes caused this glitch, but the specific part of the code needs to be further investigated.

Further investigation

I confirmed that there are not too many system calls, so there is no need to use stracetools.

Use perftools

Use perftools to view. The specific command is perf top -p 45558when the CPU usage is low:

But when the CPU bounces up, the perfsampling position becomes as follows:

Looking at the position of the red box, you can find that there may be a problem with the configuration update part, because:

  • This particular many places Protobuf place, doing the update operation (there MergeFrom, there Delete)

  • There are a lot of use std::map(yes std::_Rb_tree, there is string comparison)

By observing the perfresults, although the location of a large amount of calculation can be guessed, there are two inconveniences:

  • If the CPU is high, the probability of occurrence is very low, and human observation is more time-consuming

  • Can not clearly know which function in which file

usegcore

In the initial statistics, it takes more than 1 second to find that the CPU is high. If you find that the CPU is under high load gcore {pid}, you can keep the stack information and clarify the location of the specific high load.

The instruction using gcore will be added to the statistical tool to set the threshold trigger on the CPU.

After gdbreading several coredump files, I found that the stack and function calls are basically the same. It can be clearly seen that a lot of time consuming occurs in AddActInfoV3this function:

At this point, we clarified the specific location where the high computational load occurred.

risk point

Is there a risk in a sudden surge in CPU? Is there no problem when computing resources are sufficient?

In this example, the SPP microthread function is used, and only one thread is enabled for each Worker process.

If the CPU is stuck due to high computational load, the logic for processing requests normally will be difficult to schedule. This will inevitably cause the delay in processing the request to increase, and even the risk of returning over time.

cost_stat_toolTools that use spp

Use spp's own statistical tool to verify this risk, view the worker processing front-end request delay statistics, and execute the command ./cost_stat_tool -r 1:

In the above example, within 5 seconds before and after the configuration update occurred, 3 of the 231 requests processed by the worker took more than 500ms to process, which was much higher than the normal request.

Use tcpdumpcapture to confirm

Because the service does not open detailed logs, you want to further verify that these requests over 500ms are normally processed requests, not abnormal requests, and can be analyzed by capturing packets.

tcpdump -i any tcp port 20391 -Xs0 -c 5000 -w service_spp.pcap

Open through wireshark, you need to filter out related requests with return time-request time> 500ms . The expression translated into a wireshark filter is:

tcp.time_delta > 0.5 && tcp.dstport != 20391

Filter out an eligible request:

Right-click on the record -> Follow -> TCP Stream to view the IP packets before and after the request:

The 4 packages above are:

  1. +0ms Client sends a request to the server

  2. +38ms The server responds with ACK without data

  3. +661ms The server sends back to the client

  4. +662ms client reply ACK

Looking at the contents of the package in detail, it is an ordinary request with simple logic and should be returned within 20ms. At this time, the CPU usage of the process is indeed high load:

The above statistics mutually confirm:

  • When the CPU is high, normal network requests will also be blocked (it takes 38ms to reply ACK, which is less than 40ms, which has nothing to do with TCP delay confirmation )

  • Usually only need 20ms to return the request, it took 660ms

Suddenly high CPU is risky and needs to be taken seriously.

Welcome to follow our video number: Tencent programmer

Guess you like

Origin blog.csdn.net/Tencent_TEG/article/details/109088648