Song Baohua: Linux performance analysis with off-cpu flame graph

Original Song Baohua Linux Reading Field2019-12-22


In the article "Song Baohua: Flame Graph: Linux Performance Analysis from a Global View", we mainly looked at the on-cpu flame graph and understood the analysis of the system's CPU trend. However, in many cases, simply looking at the on-cpu situation (what code is consuming the CPU) does not solve the performance problem, because sometimes the bottleneck of poor performance is not necessarily on the CPU, but in the off-cpu time. such as:

  1. The process enters the system to call the io action, and the io action is delayed
  2. The time the process waits for the mutex lock
  3. The memory is swapped, the time of swap
  4. When the memory is insufficient, the time to perform direct memory recovery
  5. The process is preempted and scheduled, or the time slice runs out of time to be scheduled (runqueue is too large)

wait wait wait.
Basically, the state diagram of off-cpu is as follows (picture from:
http://www.brendangregg.com/offcpuanalysis.html )

Song Baohua: Linux performance analysis with off-cpu flame graph

For example, for an http server, if there are more users logging in, if you generally feel that the Internet is slow, the bottleneck may be slow network, slow hard disk reading, mutex competition, etc. In this case, on-cpu may not be a problem, and the main problem may be in the off-cpu part. Off-cpu analysis is also crucial to tuning performance issues.
Below we write a simplest program
Song Baohua: Linux performance analysis with off-cpu flame graph
gcc to compile it, you can get a.out. Suppose the performance goal we pursue is: the more hello world printed per second, the better , proving that the process can serve more print requests. Of course, the program in real life is 10,000 times more complicated than this, but this example does not prevent us from explaining the principle.
Experimental environment:
Ubuntu-18.10, kernel version 4.18, apt install to install bpfcc-tools-tools for BPF Compiler Collection (BCC) toolkit, and git clone
https://github.com/brendangregg/FlameGraph
below we collect one of its off-cpu time:

barry@barryUbuntu:~$ sudo offcputime-bpfcc  -K -p `pgrep -nx a.out`
Tracing off-CPU time (us) of PID 5593 by kernel stack... Hit Ctrl-C to end.

After pressing ctrl-c to stop, we see two main off-cpu stack tracebacks:
Song Baohua: Linux performance analysis with off-cpu flame graph
one occurred in usleep() call hrtimer_nanosleep -> do_nanosleep system call; one occurred during printf(), enter sys_write After the system call, n_tty_write of tty_write waits for a mutex code above.
At this time, we can further look at the code of the Linux kernel
https://lxr.missinglinkelectronics.com/linux+v4.18/drivers/tty/n_tty.c#L2285
We think that the delay should appear in this place:

Song Baohua: Linux performance analysis with off-cpu flame graph

....
Song Baohua: Linux performance analysis with off-cpu flame graph
So, if we want to achieve the goal of printing as many hello world as possible per second, we should obviously delete the usleep, analyze why the mutex_lock takes so long, and see if there is room for optimization in the kernel.
If we want to draw the off-cpu flame graph when the system is running the above a.out process, we can first collect scheduling data for 30 seconds to get out.stacks:

sudo offcputime-bpfcc -df -p pgrep -nx a.out` 30 > out.stacks

Next, we enter the FlameGraph project directory that was cloned, and use flamegraph.pl to draw the flame graph:

./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us ~/out.stacks > output.svg

Open output.svg with the tool that looks at the picture:

Song Baohua: Linux performance analysis with off-cpu flame graph
It can also be seen from the figure that one of the two main reasons for off-cpu is nanosleep, and the other is that mutex is used in n_tty_write after the write system call enters.
Click on the path of write to partially zoom in this part of the stack trace:
Song Baohua: Linux performance analysis with off-cpu flame graph

It is consistent with the results of our previous text analysis. If we want to optimize performance, one is to eliminate usleep, and the second is to analyze why mutex_lock has to wait so long, and what room can be improved.
This article is an introductory article on off-cpu flame graphs. If you are interested in this, you can refer to:
http://www.brendangregg.com/offcpuanalysis.html

(Finish)

Guess you like

Origin blog.51cto.com/15015138/2555511