Remember a docker problem location (perf, iostat, etc. performance analysis)

    background   

The most recent project involved is to provide container management capabilities based on OpenStack and enrich the company's IaaS platform capabilities. The main daily work is to enhance on the basis of the open source novadocker project (the open source community has stopped development), and to connect with other business components of the company.

Over the weekend, I upgraded the IaaS platform of the downstream department, mainly to upgrade the underlying operating system. The basic use case ran through without any problems. Everyone is happy.

As a result, the bad news came as soon as I arrived at the company on Monday. The performance test found that the throughput of the business process running in the container was 100 times lower than that of the business process running on the host, which made Monday even more gloomy.

   on Monday   

First, I went to the downstream to understand the business model. They said that the business model has been simplified. The current model is: the business process runs in the container, and the shared memory and the Daemon process on the host are used by sharing the IPC namespace with the host. For communication, the whole process does not involve disk reading and writing, network interaction, etc.

Roll up your sleeves and start working. The first step is to locate the bottleneck, and analyze where the problem lies.

   top   

The first command used is naturally top, which is a very powerful command in linux, through which you can basically see all the indicators in the system.

The above is a schematic diagram of the top command running, and the more important indicators have been marked:

1 indicates the system load, which indicates the number of processes currently waiting to be scheduled by the cpu. When this value is less than the number of system vcpus (the number of hyperthreads), it is normal. Once it is greater than the number of vcpus, it means that there are too many concurrently running processes. , some processes can't get cpu time for a long time. The intuitive feeling that this situation gives the user is that any command will be stuck.

2 indicates the total number of processes in the current system. Usually, if the value is too large, the load average will be too large.

3 indicates the idle time of the cpu, which can reflect the busyness of the cpu. When the value is high, it means that the system cpu is in a relatively idle state. If the value is low, it means that the system cpu is relatively busy. It should be noted that sometimes the value is relatively high, indicating that the cpu is relatively idle, but the load average is still relatively high. This situation is probably because there are too many processes, and the process switching takes up a lot of cpu time, thus tying up business operations. The amount of cpu time to use.

4 represents the time that the process IO waits. When the value is higher, it means that the bottleneck of the system may appear in the disk and the network.

5 places indicate the remaining memory of the system, which reflects the memory usage of the system.

6 represents the cpu and memory usage of a single process. For a further description of the meaning of each indicator in the top command, see:

http://www.jb51.net/LINUXjishu/34604.html

Use the top command to check the system status and find that everything is normal. The load average is not high, there are not many tasks, the cpu and memory are still very idle, even the IO wait time is very low, and no process has high cpu and memory usage, everything is harmonious, and there is no bottleneck!

Of course, it is impossible to have no bottlenecks. Since our containers are all core-bound, it is very likely that those cores allocated to the container are in a busy state, and due to the large number of total cores, the CPU usage has been pulled down. So I pressed the "1" key again to switch to the detailed mode:

In this mode, you can see the usage of each vcpu. Everything is still in harmony, strange harmony.

It seems that nothing can be seen from the cpu, so let's continue to see if the disk is the ghost.

 iostate 

The iostate command is a command used to view disk usage. Experience tells us that disk and network have become the biggest suspects affecting performance.

When using the iostate tool, you usually only pay attention to the last line (%util), which reflects how busy the disk is. Although the downstream department has said that the use case they are running is a pure memory scenario that does not involve disk reads and writes. But the customer's words are trustworthy, and the sow can also climb the tree. I still have to run iostate to see how the disk is doing. As a result, it is still very harmonious, and the disk usage is basically zero.

Later, I also tried to observe the network indicators, and found that there was indeed no network throughput. After that, it seems that the problem is not that simple.

   Tuesday   

Although Monday seems to have been a busy day for nothing, an important conclusion has also been reached: this problem is not simple! However, it is simply impossible to analyze at the resource level, and you have to send the big killer of performance analysis-perf.

perf+ flame graph

perf is a very powerful performance analysis tool under linux, through which it can analyze where the main time is spent in the process of running.

I haven't used perf very much before, so when I started to analyze, I naturally used the most common combination of perf + flame graph:

  1. Install perf.

    yum install perf

  2. Download the flame graph tool.

    git clone https://github.com/brendangregg/FlameGraph.git

  3. sampling.

    perf record -e cpu-clock -g -p 1572 (business process id)

    After a period of time (usually 20s is enough), ctrl+c ends sampling.

  4. Use the perf script tool to parse perf.data.

    perf script -i perf.data &> perf.unfold。

    PS: If the program running in the container has many dependencies, there may be more "Unregistered symbol..." errors in the symbols parsed by the command. In this case, you need to --symfsspecify the location of the container through parameters rootfsto solve this problem. The method of obtaining the container rootfsvaries according to the storagedriver of docker. If it is a device mappertype, you can find the rootfslocation of the container through dockerinspect. If it is a overlaytype, you need to export the container through the dockerexport command rootfs. If it is a rich container , generally have external rootfs, you can use it directly.

  5. Fold the symbols in perf.unfold.

    ./stackcollapse-perf.pl perf.unfold &> perf.folded

  6. Finally, generate the svg map.

    /flamegraph.pl perf.folded > perf.svg

You end up with a nice picture like the one below. Usually, if there are some functions in the program that take up a lot of CPU time, they will appear in the style of long horizontal bars in the picture, indicating that the function takes up a lot of CPU time.

However, the perf+ flame graph did not play a big role this time. After repeated statistics, there was no "long horizontal bar" that I dreamed of, and it was still very harmonious.

perf stat

The perf+ flame graph didn't work very well, so I wanted to change the tool and continue to try it, but I searched and consulted the great gods, but I couldn't find a better tool, so I had to continue to study the perf tool.

In addition to the above-mentioned record (recording events) and script (analyzing recorded events) commands, perf has other commands, commonly used are report (similar to script, which parses the events recorded by perf record. The point is that the report directly parses the running hot spots in the program, and the script is more scalable, and can call external scripts to parse the event data), stat (record the number of events triggered by the process within a period of time), top (real-time analysis program) runtime hotspots), list (to list the number of events perf can log), etc.

After trying these commands one by one, I finally made a breakthrough in the perf stat command:

Use perf stat to count the business processes running on the physical machine and the container respectively. It is found that when the business process runs in the container, the trigger times of most events (task-clock, context-switches, cycles, instructions, etc.) are running on the physical machine. One percent of the time on board.

What is the reason for this? There must be something blocking the operation of the program. What is this thing?

It has been analyzed before, not the disk, not the network, not the memory, not the cpu, so what else? ?

   Wednesday   

What is blocking the operation of the program? I can't figure it out, I can't figure it out, I can't figure it out, I can't figure it out, I still have to do it, let's go to the control variable method.

What is the difference between a program running in a container and a program running on a physical machine? We know that docker container=cgroup+namespace+secomp+capability+selinux, then remove these technologies one by one, and see which feature is responsible.

Among the five technologies, the last three are safety-related and can be controlled by switches. After testing, it is found that the performance is still very poor after turning off the last three, indicating that these three technologies are innocent. Troubleshoot cgroups and namespaces.

The first thing to suspect is, of course, cgroups. After all, it is used to limit resources. It is very likely that the wrong restrictions will accidentally limit the business.

cgexec

cgexec is a tool provided by cgroup, which can run the program into a cgroup at startup, so we can run the business program on the physical machine, but put it in the cgroup where the business container is located to see if the performance will decrease .

The specific usage is as follows:

cgexec -g *:/system.slice/docker-03c2dd57ba123879abab6f7b6da5192a127840534990c515be325450b7193c11.scope ./run.sh

With this command, run.sh can be run in the same cgroup as container 03c2dd57. After many tests, it was found that in this case, the running speed of the business process was not affected, and the cgroup was whitewashed, so there is only one truth - the murderer is the namespace.

   Thursday   

Although the murderer has been determined to be the namespace, the namespace family also has a large number of people, including ipc namespace, pid namespace, etc., and the real murderer needs to be further determined.

nsenter

nsenter is a namespace-related tool, through which you can enter the namespace where a process is located. Before the docker exec command appeared, it was the only tool that could enter the docker container. After the docker exec appeared, nsenter became an extremely important tool for docker problem location because it could choose which namespaces to enter.

With the following command, you can enter the mount namespace where the container is located.

nsenter --target $(docker inspect --format '{{.State.Pid}}' 容器 id) --mount bash

In the same way, you can enter the IPC namespace and pid namespace where the container is located by using the following commands.

nsenter --target $(docker inspect --format '{{.State.Pid}}' 容器 id) --ipc --pid bash

After constantly switching the business process between various namespaces, the real culprit was finally further locked: mount namespace. The test found that once the business process is placed in the mount namespace where the container is located, the performance will drop sharply.

Why is this? Why is this? What exactly does mount namespace do, and will it have such a big impact?

It was already 12 o'clock when I left the company, and the expert from the next station was still on the conference call, "I see, this frame is still there at this node, but when it gets here, it's gone, it...", " Is that the reason for the packet loss?", "Don't worry, I'll take a look at this node.".

In the taxi going home, I have been thinking about this problem for four days. I think that the experts at the next station are going to become gods, and they will still encounter difficult problems. Suddenly I feel a sense of collapse. Hu Zhili guessed to hold back the tears, otherwise he was really afraid of scaring the taxi driver.

   Friday   

I woke up in the morning and talked to my girlfriend about it, saying that I recently encountered a big problem, thinking that I haven't encountered such a difficult problem for many years. The last time I encountered a problem of this level was when I was a senior in the lab, and it took a whole week to solve a problem at that time. If I can't solve this problem today, I'll break the record.

I remember that it was necessary to build a eucalyptus cluster in the laboratory. Before that, every build went well, but that time it was strange. No matter how it was built, it was unsuccessful. After a week of research, I found out that the bios time of that server was restored. During the installation process, eucalyptus found that the current time was not within its own valid time (too early), so the installation failed all the time.

Closer to home, why is mount namespace so powerful? What exactly does it affect? I really couldn't figure it out, so I went to ask the Great God, the Great God thought about it, and replied to me, try ldd?

ldd

what is ldd?

Of course, I didn't ask the Great God about this sentence, I turned around and checked it myself.

ldd is the abbreviation of list, dynamic, dependencies, which means to list dynamic library dependencies. Suddenly, the mount namespace isolates its own file system, so different dependency libraries can be used inside and outside the container, and different dependency libraries may cause countless effects.

So I started to compare the dependency library of the business process in the container with the dependency library of the business process on the host through ldd, and finally found that the glibc library in the container was inconsistent with the version of the glibc library on the host, which is probably the reason for the performance degradation.

So after replacing the glibc library version in the container with the glibc library on the host, the performance of the business in the container finally recovered, and the conjecture was confirmed.

Come home from get off work and relax!

When passing the expert at the next station, he just got up to catch the water and asked casually, "How is it, Brother Ping, have you found the missing frame?", "Hey, forget it, keep watching."

   postscript   

Why does inconsistent glibc version inside and outside the container lead to performance degradation?

This is related to the business model. As mentioned earlier, the downstream business model uses shared memory to communicate with the daemon process on the host by sharing the IPC namespace with the host. In an upgrade of glibc, the data structure of the semaphore (as follows) is updated, which will lead to the time-out of each semaphore communication due to inconsistent data formats during shared memory communication, thus affecting the efficiency of the program.

 

read the original

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324932930&siteId=291194637