(Good article reissued) Ali Yang Yong: On the systematic analysis of Linux high load

Original Yang Yong Linux Reading Field2019-12-26


Introduction to this article:
The topic of explaining how to troubleshoot Linux Load is a cliché, but most articles only focus on a few points and lack the introduction of overall troubleshooting ideas. The so-called "Teaching people how to fish is worse than teaching people how to fish". This article attempts to establish a method and routine to help readers have a more comprehensive understanding of Load high problem troubleshooting.
About the author
Oliver Yang, Linux kernel engineer, comes from the Alibaba Cloud System Group (WeChat subscription account: Kernel Monthly Talk). Worked at EMC, Sun China Engineering Research Institute, working in storage system and Solaris kernel development field. In recent years, I have started to study the Linux kernel and have a strong interest in Linux performance optimization and memory management subsystems. His contact information: http://oliveryang.net/about .

(Good article reissued) Ali Yang Yong: On the systematic analysis of Linux high load

The Alibaba Cloud system team, ( http://kernel.taobao.org ) is an expansion of the original Taobao kernel group. In 2013, the Taobao kernel group responded to the call of Alibaba Group and transferred the restructuring system to Alibaba Cloud and started to be the bottom layer of cloud computing. System construction complete system support. The Alibaba Cloud system team is composed of a group of kernel developers with a high sense of mission and self-pursuit. Most of the people in the team are active community kernel developers. The current work areas mainly involve (but not limited to) the memory management of the Linux kernel, file system, network and kernel maintenance and construction, as well as user-mode libraries and tools associated with the kernel.


The topic of explaining how to troubleshoot Linux Load is a commonplace topic, but most articles only focus on a few points and lack an introduction to the overall troubleshooting ideas. The so-called " Teaching people how to fish is better than teaching people how to fish" . This article attempts to establish a method and routine to help readers have a more comprehensive understanding of Load high problem troubleshooting.

Start by eliminating misunderstandings

Load without baseline is unreliable load

From the first day of exposure to Unix/Linux system management, many people have been exposed to the monitoring indicator System Load Average. However, not everyone knows the true meaning of this indicator. Generally speaking, the following misunderstandings are often heard:-High Load means high CPU load...

Traditional Unix has a different design from Linux. On Unix systems, the high Load is caused by more runnable processes, but not for Linux. For Linux, there may be two situations where Load is high:-Caused by an increase in the number of processes in the R state in the system-Caused by an increase in the number of processes in the D state in the system-There must be a problem if the Loadavg value is greater than a certain value... ...

The value of Loadavg is a relative value, which is affected by the amount of CPU and IO devices, and even by some software-defined virtual resources. The judgment of high load needs to be based on a certain historical baseline, and load cannot be compared across systems without principle. -Load Gao system must be busy...

Load high system can be very busy, for example, CPU load is high, CPU is very busy. But the load is high, the system is not very busy, such as IO load is high, the disk can be very busy, but the CPU can be relatively idle, such as high iowait. It should be noted here that iowait is essentially a special CPU idle state. Another type of Load is high, which may be caused by both CPU and disk peripherals being idle, which may support lock competition. At this time, in the CPU time, iowait is not high, but idle is high.

In Brendan Gregg's recent blog [Linux Load Averages: Solving the Mystery] ( http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html , Brendan Gregg: Linux Load Averages is a What the hell?), discussed the difference between Unix and Linux Load Average, and back to the discussion in the Linux community 24 years ago, and found out why Linux had to modify the definition of Unix Load Average. The article believes that it is precisely because of the calculation method of the D-state thread introduced by Linux that the reason for the high Load has become confused. Because there are too many reasons for the D state switching in the system, it is by no means IO load, lock competition is so simple! It is precisely because of this ambiguity that the value of Load is more difficult to compare across systems and across application types. All the basis of Load level should be based on historical baseline. This article does not intend to carry too much content of the original text, so for further details, it is recommended to read the original text.

How to troubleshoot high load issues

As mentioned earlier, because Load is a defined and ambiguous indicator in the Linux operating system, it is a very complicated process to troubleshoot the high loadavg. The basic idea is to enter different processes according to whether the root cause of the load change is an increase in R state tasks or an increase in D state tasks.

Here is a general routine for the investigation of Load increase, for reference only:

(Good article reissued) Ali Yang Yong: On the systematic analysis of Linux high load
In a Linux system, the number of processes in the R state can be obtained by reading the /proc/stat file; but the number of tasks in the D state is probably the most direct way to use the ps command. The procs_blocked in the /proc/stat file gives the number of processes waiting for disk IO:

$cat /proc/stat 
.......
processes 50777849
procs_running 1
procs_blocked 0
......

By simply distinguishing the increase in R-state tasks or the increase in D-state tasks, we can enter different troubleshooting procedures. Below, we will do a simple sorting out of the troubleshooting ideas of this big picture.

                 R 状态任务增多

That is, the CPU load is usually high. The main idea for troubleshooting such problems is to analyze the running time of the system, container, and process, to find the hot path on the CPU, or to analyze the main code of the CPU running time.

The distribution of CPU user and sys time can usually help people quickly locate whether it is related to the user mode process or the kernel. In addition, the length of the run queue of the CPU, the scheduling waiting time, and the number of nonvoluntary context switches can help to roughly understand the problem scenario.

Therefore, if you want to associate the problem scene with the relevant code, you usually need to use dynamic tracking tools such as perf, systemtap, and ftrace.

After being associated with the code path, in the following code time analysis process, some invalid runtimes in the code are also the primary concern in the analysis, such as spin locks in user mode and kernel mode.

Of course, if the CPU is running very meaningful and efficient code, the only thing to consider is whether the load is really too large.

              D 状态任务增多

According to the design of the Linux kernel, the D-state task is essentially an active sleep caused by TASK_UNINTERRUPTIBLE, so there are many possibilities. However, because the Linux kernel CPU idle time has made a special definition of sleep caused by the IO stack, that is, iowait, iowait has become an important reference for locating whether the Load is high in the D state classification is caused by IO.

Of course, as mentioned earlier, the change trend of procs_blocked in /proc/stat can also be a very good reference for judging the high load caused by iowait.

                  CPU iowait 高

Many people usually have a misunderstanding about CPU iowait, thinking that the high iowait is because the CPU is busy doing IO operations at this time. In fact, on the contrary, when iowait is high, the CPU is in an idle state and no tasks can run. Just because there is already issued disk IO at this time, the idle state at this time is marked as iowait instead of idle.

But at this time, if we use the perf probe command, we can clearly see that the CPU in the iowait state is actually running on the idle thread with a pid of 0:


$ sudo perf probe -a account_idle_ticks
$sudo perf record -e probe:account_idle_ticks -ag sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.418 MB perf.data (843 samples) ]

$sudo perf script
swapper     0 [013] 5911414.451891: probe:account_idle_ticks: (ffffffff810b6af0)
             2b6af1 account_idle_ticks (/lib/modules/3.10.0/build/vmlinux)
             2d65d9 cpu_startup_entry (/lib/modules/3.10.0/build/vmlinux)
             24840a start_secondary (/lib/modules/3.10.0/build/vmlinux)

The code for how the loop of the related idle thread counts CPU iowait and idle respectively is as follows:

/*       
 * Account multiple ticks of idle time.
 * @ticks: number of stolen ticks
 */   
void account_idle_ticks(unsigned long ticks)
{        
    if (sched_clock_irqtime) {
        irqtime_account_idle_ticks(ticks);
        return;
    }   
    account_idle_time(jiffies_to_cputime(ticks)); 
}        

/*
 * Account for idle time.
 * @cputime: the cpu time spent in idle wait
 */
void account_idle_time(cputime_t cputime)
{
    u64 *cpustat = kcpustat_this_cpu->cpustat;
    struct rq *rq = this_rq();

    if (atomic_read(&rq->nr_iowait) > 0)
        cpustat[CPUTIME_IOWAIT] += (__force u64) cputime;
    else
        cpustat[CPUTIME_IDLE] += (__force u64) cputime;
}

The Linux IO stack and file system code will call io_schedule and wait for the completion of the disk IO. At this time, the atomic variable rq->nr_iowait, which counts CPU time as a critical count from iowait, will be incremented before sleep. Note that before io_schedule is called, usually the caller will first explicitly set the task to TASK_UNINTERRUPTIBLE state:

/*           
 * This task is about to go to sleep on IO. Increment rq->nr_iowait so
 * that process accounting knows that this is a task in IO wait state.
 */          
void __sched io_schedule(void)
{
    io_schedule_timeout(MAX_SCHEDULE_TIMEOUT);
}            
EXPORT_SYMBOL(io_schedule);

long __sched io_schedule_timeout(long timeout)
{            
    int old_iowait = current->in_iowait;
    struct rq *rq; 
    long ret;

    current->in_iowait = 1; 
    if (old_iowait)
        blk_schedule_flush_plug(current);
    else 
        blk_flush_plug(current);

    delayacct_blkio_start();
    rq = raw_rq();
    atomic_inc(&rq->nr_iowait);
    ret = schedule_timeout(timeout);

    current->in_iowait = old_iowait;
    atomic_dec(&rq->nr_iowait);
    delayacct_blkio_end();

    return ret; 
}            
EXPORT_SYMBOL(io_schedule_timeout);
                   CPU idle 高

As mentioned earlier, there are quite a lot of kernel blockages, namely TASK_UNINTERRUPTIBLE sleep, which has nothing to do with waiting for disk IO, such as lock competition in the kernel, sleep for direct page recovery of memory, and some code paths in the kernel. Active blocking, waiting for resources.
Brendan Gregg in the recent blog [Linux Load Averages: Solving the Mystery] ( http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html), using the perf command to generate TASK_UNINTERRUPTIBLE The flame graph of sleep shows the diversity that causes high CPU idle. This article will not go into details.
(Good article reissued) Ali Yang Yong: On the systematic analysis of Linux high load
Therefore, the analysis of high CPU idle is essentially to analyze the main cause of blockage caused by the code path of the kernel. Usually, we can use perf inject to process the context switch event recorded by perf record, and correlate the kernel code path of the process switching from the CPU (swtich out) and switching in (switch in) again, and generate a so-called Off CPU flame graph .
of course, similar to the lock contention this simple question, Off CPU Figure enough to step in positioning the flame go wrong. But for the more complicated delay problem of blocking due to the D state, the Off CPU flame graph may only give us a starting point for investigation.
For example, when we see that the main sleep time of the Off CPU flame graph is caused by epoll_wait waiting. So, what we continue to investigate should be the delay of the network stack, which is the part of Net Delay in the big picture of this article.
At this point, you may find that the essence of CPU iowait and idle performance analysis is latency analysis. This is the big picture according to the general direction of resource management in the kernel, and the delay analysis is refined into six delay analysis:

  • CPU latency
  • Memory latency
  • File system delay
  • IO stack latency
  • Network stack latency
  • Lock and synchronization primitive competition

Any TASK_UNINTERRUPTIBLE sleep caused by the above code path is the object we want to analyze!

The problem is
limited to space. This article is difficult to expand the details involved, because after reading this, you may find that the original load analysis is actually a comprehensive load analysis of the system. No wonder it is called System Load. This is why Load analysis is difficult to fully cover in an article.

(Finish)

Guess you like

Origin blog.51cto.com/15015138/2555488