[Linux] 22. CPU evaluation indicators, performance tools, locating bottlenecks, optimization methodologies: applications and systems

1. Indicators for evaluating CPU

1.1 CPU usage

CPU usage describes the percentage of non-idle time in the total CPU time. According to the different tasks running on the CPU, it is divided into user CPU, system CPU, waiting I/O CPU, soft interrupt and hard interrupt, etc.

  • User CPU usage, including user mode CPU usage (user) and low-priority user mode CPU usage (nice), represents the percentage of time the CPU is running in user mode. High user CPU usage usually indicates a busy application.
  • System CPU usage, which represents the percentage of time the CPU is running in kernel mode (excluding interrupts). High system CPU usage indicates that the kernel is busy.
  • CPU usage waiting for I/O, also commonly called iowait, represents the percentage of time spent waiting for I/O. A high iowait usually indicates that the I/O interaction time between the system and the hardware device is relatively long.
  • The CPU usage of soft interrupts and hard interrupts represents the percentage of time the kernel calls the soft interrupt handler and the hard interrupt handler respectively. Their high usage usually indicates a large number of outages on the system.
  • In addition to the above, there are also steal CPU usage (steal) and guest CPU usage (guest) that are used in virtualized environments, which respectively represent the percentage of CPU time occupied by other virtual machines and the CPU running guest virtual machines. Time percentage.

1.2 Load Average

That is, the average number of active processes in the system. It reflects the overall load of the system and mainly includes three values, which refer to the average load in the past 1 minute, the past 5 minutes and the past 15 minutes.

Ideally, the load average is equal to the number of logical CPUs, meaning that each CPU is exactly fully utilized. If the average load is greater than the number of logical CPUs, it means the load is relatively heavy.

1.3 Context switching

Process context switching, including:

  • Voluntary context switches due to inability to obtain resources;
  • Involuntary context switching caused by forced scheduling by the system.

Context switching itself is a core function that ensures the normal operation of Linux. However, excessive context switching will consume the CPU time of the original running process in saving and restoring data such as registers, kernel stacks, and virtual memory, shortening the actual running time of the process and becoming a performance bottleneck.

1.4 CPU cache hit rate

Since the development of CPU is much faster than the development of memory, the processing speed of CPU is much faster than the access speed of memory. In this way, when the CPU accesses the memory, it inevitably has to wait for the response of the memory. In order to reconcile the huge performance gap between the two, CPU cache (usually multi-level cache) appeared.

As the picture above shows, the speed of CPU cache is between that of CPU and memory, and it caches hot memory data. Based on the growing hotspot data, these caches are divided into three levels of cache: L1, L2, and L3 according to different sizes. L1 and L2 are commonly used in single cores, and L3 is used in multi-cores.

From L1 to L3, the size of the third-level cache increases sequentially, and accordingly, the performance decreases (of course it is still much better than memory). Their hit rate measures the reuse of the CPU cache. The higher the hit rate, the better the performance.

It is summarized as follows:

2. Performance Tools

After mastering the CPU performance indicators, we also need to know how to obtain these indicators, that is, the use of tools.

Do you still remember what tools were used in the previous cases? Here we also review CPU performance tools.

First, the case of load average. We first used uptime to check the average load of the system; and after the average load increased, we used mpstat and pidstat to observe the usage of each CPU and each process CPU respectively, and then found out the reasons for the increase in the average load. The process is our stress testing tool.

Second, the case of context switching. We first used vmstat to check the number of context switches and interrupts of the system; then we used pidstat to observe the voluntary context switching and involuntary context switching of the process; finally we used pidstat to observe the context switching of the thread and found out the context The root cause of the increase in switching times is our benchmark testing tool sysbench.

The third case is the case of increased process CPU usage. We first used top to check the CPU usage of the system and processes, and found that the process with increased CPU usage was php-fpm; then we used perf top to observe the call chain of php-fpm, and finally found out the source of the CPU increase. That is the library function sqrt().

The fourth case is the case where the system's CPU usage increases. We first used top to observe the increase in system CPU, but through top and pidstat, we could not find processes with high CPU usage. So, we re-examined the output of top, and started with processes that had low CPU usage but were in the Running state to find out what was suspicious. Finally, through perf record and perf report, we found that it was a short-lived process that was causing trouble.

In addition, for short-lived processes, I also introduced a specialized tool execsnoop, which can monitor external commands called by the process in real time.

Fifth, the case of uninterruptible processes and zombie processes. We first used top to observe the problem of increased iowait, and found a large number of uninterruptible processes and zombie processes; then we used dstat to find that it was caused by disk reading, so we used pidstat to find the related processes. But we used strace to check the process system calls but failed. Finally, we used perf to analyze the process call chain, and found that the root cause was direct disk I/O.

The last one is the case of soft interrupt. We observed through top that the system's soft interrupt CPU usage increased; then we checked /proc/softirqs and found several soft interrupts with fast changing rates; then we used the sar command to find that it was a network packet problem, and finally used tcpdump, find out the type and source of the network frame, and determine that it is caused by a SYN FLOOD attack.

By this point, you are probably dizzy. It turns out that in just a few cases, we have used more than a dozen CPU performance tools, and each tool has different applicable scenarios! How to distinguish between so many tools? In actual performance analysis, how to choose?

My experience is to understand them from two different dimensions and learn and apply them flexibly.

2.1 Dimension: Starting from CPU performance indicators, that is, when you look at a certain performance indicator, you need to know which tools can do it

Classify and understand performance tools that provide metrics based on different performance metrics. In this way, when actually troubleshooting performance problems, you can clearly know which tools can provide the indicators you want, instead of trying one by one without any basis and hoping for luck.

In fact, I have used this idea many times in previous cases. For example, after using top to discover that soft interrupt CPU usage is high, the next step is to know the specific soft interrupt type. So where can I observe the operation of various soft interrupts? Of course it is the file /proc/softirqs in the proc file system.

Next, for example, if the soft interrupt type we find is network reception, then we must continue to think in the direction of network reception. What is the system's network reception like? What tools can check the network reception status? In our case, dstat is used.

Although you don’t need to memorize all the tools, if you can understand the characteristics of the tools corresponding to each indicator, you will definitely use them more efficiently and flexibly. Here, I have made a table of the tools that provide CPU performance indicators to facilitate you to sort out the relationships and understand the memory. Of course, you can also use it as a "metric tool" guide.

2.2 Dimension: Starting from the tool, that is, after you have installed a tool, you need to know what indicators the tool can provide

This is also very important in actual environments, especially production environments, because in many cases, you do not have permission to install new tool packages and can only maximize the use of the tools already installed in the system, which requires you to Have enough understanding.

Specific to how each tool is used, it generally supports a wealth of configuration options. But don’t worry, you don’t have to memorize these configuration options. You just need to know what tools there are and what their basic functions are. When you really need to use it, just check their manual through the man command.

Similarly, I have also summarized these commonly used tools into a table to facilitate your differentiation and understanding. Naturally, you can also use it as a "tool indicator" guide and just look up the table when needed.

3. How to quickly analyze CPU performance bottlenecks

I believe that by this point, you are already very familiar with CPU performance indicators, and you also know what tools can be used to obtain each performance indicator.

Does that mean that every time you encounter a CPU performance problem, you have to run all the above tools and then analyze all CPU performance indicators?

You probably feel that this simple search method is like searching foolishly. But don’t laugh, because that’s what I did in the first place. It is of course possible to check out all indicators and analyze them together, and it is also possible to find potential bottlenecks in the system.

But the efficiency of this method is really low! Not only is it time-consuming and labor-intensive, in the face of a huge indicator system, you may accidentally overlook a certain detail, resulting in wasted effort. I have suffered like this many times.

Therefore, in actual production environments, we usually want to locate the bottleneck of the system as quickly as possible, and then optimize performance as quickly as possible, that is, we must solve performance problems quickly and accurately.

So is there any way to quickly and accurately find the bottleneck of the system? The answer is yes.

Although there are many CPU performance indicators, you must know that since they all describe the CPU performance of the system, they are not completely isolated. There is a certain correlation between many indicators. To understand the correlation of performance indicators, you need to understand how each performance indicator works. This is why when I introduce each performance indicator, I have to explain the relevant system principles. I hope you can remember this.

For example, if the user CPU usage is high, we should check the user mode of the process instead of the kernel mode. Because the user CPU usage reflects the CPU usage in user mode, while the CPU usage in kernel mode will only be reflected in the system CPU usage.

You see, with this basic understanding, we can narrow the scope of the investigation and save time and effort.

Therefore, in order to narrow down the scope of troubleshooting, I usually first run several tools that support more indicators, such as top, vmstat and pidstat. Why these three tools? Take a closer look at the picture below and you will understand.

In this picture, I have listed the important CPU indicators provided by top, vmstat and pidstat respectively, and used dotted lines to indicate the correlation, which corresponds to the next step of performance analysis.

From this picture, you can find that these three commands include almost all important CPU performance indicators, such as:

  • From the output of top, you can get various CPU usage, zombie processes and load average information.
  • From the output of vmstat, you can get the number of context switches, number of interruptions, running status, and number of processes in the uninterruptible status.
  • From the output of pidstat, you can get the user CPU usage, system CPU usage, and voluntary context switching and involuntary context switching of the process.

In addition, many indicators output by these three tools are related to each other, so I also use dotted lines to represent their correlations. It may be easier for you to understand by giving a few examples.

  • The first example is that the process user CPU usage output by pidstat increases, which will cause the user CPU usage output by top to increase. Therefore, when you find that there is a problem with the user CPU usage output by top, you can compare it with the output of pidstat to see whether a certain process is causing the problem.
    • After identifying the process causing the performance problem, you need to use process analysis tools to analyze the behavior of the process, such as using strace to analyze system calls, and using perf to analyze the execution of functions at all levels in the call chain.
  • In the second example, if the average load output by top increases, you can compare it with the running status output by vmstat and the number of processes in the uninterruptible state to observe which process is causing the increase in load.
    • If the number of uninterruptible processes increases, then I/O analysis needs to be done, that is, using tools such as dstat or sar to further analyze the I/O situation.
    • If the number of running processes increases, you need to go back to top and pidstat to find out what processes are running, and then use process analysis tools for further analysis.
  • As a last example, when you find that the CPU usage of soft interrupts output by top has increased, you can check the changes in various types of soft interrupts in the /proc/softirqs file to determine which soft interrupt caused the problem. For example, if you find that the problem is caused by network reception interruption, you can continue to use the network analysis tools sar and tcpdump to analyze.

4. Performance Optimization Methodology

After we have gone through a lot of hard work and used various performance analysis methods to finally find the bottleneck that caused the performance problem, should we start optimizing immediately? Don’t worry, before taking action, you can take a look at the following three questions.

  • First of all, since we need to optimize performance, how do we judge whether it is effective? Especially after optimization, how much performance can be improved?
  • Second, performance problems are usually not independent. If multiple performance problems occur at the same time, which one should you optimize first?
  • Third, there is not the only way to improve performance. When there are multiple methods to choose from, which one will you choose? Is it okay to always choose the method that maximizes performance?

If you can easily answer these three questions, then you can start optimizing without saying a word.

For example, in the previous case of uninterruptible processes, through performance analysis, we found that the direct I/O of a process caused iowait to be as high as 90%. Is it possible to immediately optimize by using the method of "replacing direct I/O with cached I/O"?

According to what was said above, you can first think about those three points yourself. If you're not sure, let's take a look.

  • The first question is that replacing direct I/O with cached I/O can reduce iowait from 90% to close to 0, and the performance improvement is obvious.
  • The second question is that we found no other performance issues. Direct I/O is the only performance bottleneck, so there is no need to select optimization objects.
  • The third question is that cached I/O is the simplest optimization method we currently use, and this optimization will not affect the functionality of the application.

Okay, these three questions are easy to answer, so there’s no problem optimizing right away.

However, many real-life situations are not as simple as the example I gave. Performance evaluation may have multiple indicators, and performance problems may occur simultaneously. Moreover, optimizing the performance of one indicator may lead to a decrease in the performance of other indicators.

So, what should we do in the face of this complex situation?

Next, we will analyze these three issues in depth.

4.1 Evaluate the performance optimization effect

First, let’s look at the first question, how to evaluate the effect of performance optimization.

Our purpose in solving performance problems is naturally to achieve a performance improvement effect. In order to evaluate this effect, we need to quantify the performance indicators of the system, test the performance indicators before and after optimization, and use the changes in the indicators before and after to compare the performance. I call this method the "three-step" approach to performance evaluation.

  • Determine quantitative measures of performance.
  • Test performance indicators before optimization.
  • Test the optimized performance indicators.
    Let’s look at the first step first. There are many quantitative indicators of performance, such as CPU usage, application throughput, client request latency, etc., which can all be used to evaluate performance. So what indicators should we choose to evaluate?

My suggestion is not to be limited to a single dimension of indicators. You should at least choose different indicators from the two dimensions of application and system resources. For example, take a web application as an example:

  • Application dimensions, we can use throughput and request latency to evaluate application performance.
  • In terms of system resources, we can use CPU usage to evaluate the CPU usage of the system.

The reason why indicators are selected from these two different dimensions is mainly because of the mutually reinforcing relationship between application and system resources.

  • A good application is the ultimate goal and result of performance optimization, and system optimization always serves the application. Therefore, application metrics must be used to evaluate the overall effect of performance optimization.
  • System resource usage is the root cause of application performance. Therefore, it is necessary to use system resource indicators to observe and analyze the source of bottlenecks.

As for the next two steps, it is mainly to compare the performance before and after optimization and present the effect more intuitively. If your first step is to select multiple indicators from two different dimensions, then during performance testing, you need to obtain the specific values ​​of these indicators.

Taking the web application just now as an example, corresponding to the indicators mentioned above, we can choose tools such as ab to test the number of concurrent requests and response delay of the web application. While testing, you can also use performance tools such as vmstat and pidstat to observe the CPU usage of the system and process. In this way, we obtain the indicator values ​​​​in the two dimensions of application and system resources at the same time.

However, there are two particularly important points that you need to pay attention to when performing performance testing.

  • First, avoid performance testing tools from interfering with application performance. Usually, for web applications, the performance testing tool and the target application need to be run on different machines.
    • For example, in the previous Nginx case, I always emphasized the need to use two virtual machines, one of which runs the Nginx service and the other runs the tool that simulates the client, just to avoid this effect.
  • Second, avoid changes in the external environment from affecting the evaluation of performance indicators. This requires that the applications before and after optimization run on machines with the same configuration, and their external dependencies must be completely consistent.

For example, taking Nginx as an example, you can run it on the same machine and use client tools with the same parameters for performance testing.

4.2 Multiple performance problems exist at the same time, how to choose?

Let’s look at the second question. As we said in the opening remarks, system performance always affects the whole system, so performance problems usually do not exist independently. So when multiple performance problems occur at the same time, which one should be optimized first?

In the field of performance testing, a widely circulated saying is the "80% rule", which means that 80% of problems are caused by 20% of the code. Just by finding those 20%, you can optimize 80% of the performance. So, what I want to express is that not all performance problems are worth optimizing.

My suggestion is to use your brain before you start optimizing. Analyze all these performance issues first, find out the most important issues that can maximize performance, and start optimizing from there. The advantage of this is that not only the performance improvement benefits are the largest, but also it is possible that other issues do not need to be optimized to meet the performance requirements.

The key is how to determine which performance issue is the most important. This is actually still the core problem that our performance analysis needs to solve, but the object to be analyzed here has changed from one problem to multiple problems. The idea is actually still the same.

Therefore, you can still use the method I mentioned earlier to analyze them one by one and find their bottlenecks respectively. After analyzing all problems, we can eliminate causally related performance problems according to the relationship between cause and effect. Finally, optimize the remaining performance issues.

If there are still several remaining problems, you have to conduct performance testing separately. After comparing different optimization effects, choose the problem that can significantly improve performance to fix. This process usually takes a lot of time. Here, I recommend two methods that can simplify this process.

First, if it is found that system resources have reached a bottleneck, such as CPU usage reaching 100%, then the first thing to optimize must be system resource usage. After optimizing system resource bottlenecks, we need to consider other issues.

Second, for different types of indicators, first optimize those problems that are caused by bottlenecks and have the largest changes in performance indicators. For example, after a bottleneck occurs, the user's CPU usage increases by 10%, but the system CPU usage increases by 50%. At this time, the system CPU usage should be optimized first.

4.3 When there are multiple optimization methods, how to choose?

Next, let’s look at the third question. When multiple methods are available, which one should be chosen? Is the method that maximizes performance necessarily the best?

Under normal circumstances, of course we want to choose the method that can maximize performance, which is actually the goal of performance optimization.

But be aware that the factors to be considered in reality are not that simple. Most intuitively, performance optimization is not without cost. Performance optimization usually brings about an increase in complexity, reduces the maintainability of the program, and may also cause exceptions in other indicators when optimizing one indicator. That is, it is possible that if you optimize one metric, the performance of another metric becomes worse.

A very typical example is DPDK (Data Plane Development Kit) which I will talk about in the network part. DPDK is a method to optimize network processing speed. It improves network processing capabilities by bypassing the kernel network protocol stack.

However, it has a very typical requirement, which is to occupy a CPU and a certain number of memory huge pages, and always run at 100% CPU usage. So, if you have very few CPU cores, it’s a bit more of a gain than a loss.

Therefore, when considering which performance optimization method to choose, you need to consider many factors. Remember, don’t think about “getting to the top in one step” and try to solve all problems at once; don’t just “use it” and use the optimization methods of other applications unchanged without any thought and analysis.

5. CPU optimization

After clarifying the three most basic issues of performance optimization, let's look at how to reduce CPU usage and improve the parallel processing capability of the CPU from the perspective of applications and systems.

5.1 Application optimization

First of all, from an application perspective, the best way to reduce CPU usage is of course to eliminate all unnecessary work and retain only the core logic. For example, reducing loop levels, reducing recursion, reducing dynamic memory allocation, etc.

In addition, application performance optimization also includes many methods. I have listed the most common ones here for you to write down.

  • Compiler optimization: Many compilers will provide optimization options. If you enable them appropriately, you can get help from the compiler during the compilation stage to improve performance. For example, gcc provides the optimization option -O2, which will automatically optimize the application code when turned on.
  • Algorithm optimization: Using less complex algorithms can significantly speed up processing. For example, when the data is relatively large, O(nlogn) sorting algorithms (such as quick sort, merge sort, etc.) can be used instead of O(n^2) sorting algorithms (such as bubble, insertion sort, etc.).
  • Asynchronous processing: Using asynchronous processing can avoid the program from being blocked waiting for a certain resource, thereby improving the program's concurrent processing capabilities. For example, replacing polling with event notification can avoid the problem of polling consuming CPU.
  • Multi-threading instead of multi-process: As mentioned earlier, compared to process context switching, thread context switching does not switch the process address space, so the cost of context switching can be reduced.
  • Make good use of cache: Frequently accessed data or steps in the calculation process can be cached in memory, so that they can be obtained directly from memory the next time they are used, speeding up program processing.

5.2 System optimization

From a system perspective, to optimize the operation of the CPU, on the one hand, we must make full use of the locality of the CPU cache to accelerate cache access; on the other hand, we must control the CPU usage of the process and reduce the mutual influence between processes.

Specifically, there are many system-level CPU optimization methods. Here I also list some of the most common methods to facilitate your memory and use.

  • CPU binding: Binding a process to one or more CPUs can improve the CPU cache hit rate and reduce context switching problems caused by cross-CPU scheduling.
  • CPU exclusive: Similar to CPU binding, CPUs are further grouped and processes are allocated to them through the CPU affinity mechanism. In this way, these CPUs are exclusively occupied by the specified process. In other words, other processes are not allowed to use these CPUs.
  • Priority adjustment: Use nice to adjust the priority of the process. Positive values ​​lower the priority, and negative values ​​raise the priority. Appropriately lower the priority of non-core applications and increase the priority of core applications to ensure that core applications are prioritized.
  • Set resource limits for processes: Use Linux cgroups to set an upper limit on the CPU usage of a process, which can prevent system resources from being exhausted due to problems with an application itself.
  • NUMA (Non-Uniform Memory Access) optimization: A processor that supports NUMA will be divided into multiple nodes, and each node has its own local memory space. NUMA optimization actually allows the CPU to access only local memory as much as possible.
  • Interrupt load balancing: Whether soft or hard interrupts, their interrupt handlers can consume a lot of CPU. By enabling the irqbalance service or configuring smp_affinity, the interrupt processing process can be automatically load balanced to multiple CPUs.

5.3 Avoid premature optimization

After mastering the above optimization methods, I estimate that many people will not be able to help but bring various optimization methods into actual development even if they do not find the performance bottleneck.

However, I think you must have heard of Knuth's famous saying, "Premature optimization is the root of all evil." I also strongly agree with this. Premature optimization is not advisable.

Because, on the one hand, optimization will increase complexity and reduce maintainability; on the other hand, requirements are not static. Optimizations for the current situation may not adapt to rapidly changing new needs. In this way, when new requirements arise, these complex optimizations may hinder the development of new functions.

Therefore, it is best to optimize performance step by step and dynamically. Instead of pursuing one step, we must first ensure that it can meet the current performance requirements. When it is found that the performance does not meet the requirements or a performance bottleneck occurs, the most important performance issues are selected for optimization based on the results of the performance evaluation.

Be sure to resist the urge to “optimize the CPU performance to the extreme” because the CPU is not the only performance factor. There are more performance issues: such as memory, network, I/O and even architectural design issues.

Without comprehensive analysis and testing, simply improving a certain indicator to the extreme may not necessarily bring overall benefits.

Guess you like

Origin blog.csdn.net/jiaoyangwm/article/details/134504969