Roadmap for understanding and line and load cpu problem

How to calculate the cpu

When we execute the top command, we saw the value of (mainly cpu and load) value is always changing, so it is necessary to look at a simple calculation of cpu Linux system.

cpu cpu is divided into systems and processes, threads cpu, cpu statistics system located in / proc / stat under (screenshot below not cut full):

cpu, these figures are cpu0 behind and in front of us, sy, ni these correspondences, which corresponds to which specific value is not important, interested can check online documentation.

Process cpu statistical values located / proc / {pid} / stat follows:

Thread cpu statistical values located / proc / {pid} / task / {threadId} / stat follows: 

 

All values there is a value from the system start-up time to the current time . Thus, for the practice of the cpu is calculated, two sample sufficiently short time t1, t2:

  • All summed usage cpu t1, s1 to give
  • All summed usage cpu t2, s2 obtained
  • s2 - s1 obtained at all times within this time interval totalCpuTime
  • The first free idle1 - second idle idle2, get free time during the sampling period
  • cpu utilization = 100 * (totalCpuTime - idle) / totalCpuTime

Other times such as us, sy, ni are similar calculation, summed up and said, cpu value of this reaction is the cpu usage within a certain sampling time . So sometimes cpu high, but print out the thread stack found in high cpu thread to query the database waiting, do not be surprised, because the cpu is statistical data within the sampling time.

Assume a certain period of time to observe top cpu user space has been high, it means that this time the user program has been occupying the cpu to do things.

 

Understanding of the load

About the meaning of load, in fact, some articles relate it with the traffic across the bridge together is more appropriate and better understanding of:

Copy the code
A single-core processor can be likened to get the image of a single lane, turn the vehicle traveling on this single lane, after the vehicle in front after passing car can travel. 
If there is no vehicle in front, then you passed; if the vehicle a number, then you need to wait before after the car can pass through. Therefore, some specific code indicates the current traffic situation, such as: · 0.00 equal, thus there is not any traffic on the bridge. This situation between 0.00 and 1.00 are actually the same, all in all very smooth, no passing vehicles may not have to wait through -equal to 1.00, expressed exactly within the tolerance range of the bridge. This situation is not too bad, but some traffic will be blocked, but this situation could cause traffic and slower -than 1.00, then that bridge has been overloaded, serious traffic congestion. So how bad? Such as the case of 2.00 explanation traffic beyond the bridge can withstand twice, then it will double the excess bridge vehicle is anxiously waiting
Copy the code

But the metaphor is a metaphor, after all, we learned from the parable, is represented by a load capacity of the system, but we do not know what the task is to calculate the load in return. On specifically how the task will be attributed to the computing load, you can use the command man uptime look at Linux interpretation of load:

Mean generally meaning that the system load is in an active state or non-interruption state of the process (marked red portion represents the content of the load of the operator). A process running indicates you are using or waiting to use the cpu cpu , a process interrupt status of non-representation is waiting for IO , such as disk IO. The average load to show through three time intervals, we see that 1 minute, 5 minutes, 15 minutes, and the load values of the number of nuclear-related cpu mononuclear the cpu load = 1 indicates that the system has been in the load state, but 4 core cpu load = 1 indicates that the system is 75% free.

Special attention, the Load refers to the average of all core , this cpu and values are different.

Another important point is, check the information found that, although the above has been emphasized that the "process", but the number of threads in the process also will be treated as different processes to calculate , if a process to generate 1000 threads running simultaneously that run the length of the queue is 1000, load average is 1000.

 

The relationship between the number of requests and load

Before I myself have been a misunderstanding: when tens of thousands of requests over, and when the queue behind the request can not be processed, load value is bound to increase. After serious consideration, the idea is really wrong, it is particularly as some write about and share.

With Redis example, we all know that Redis is single-threaded model, which means that at the same time can have numerous requests over, but at the same time only one command is processed (Source https://www.processon.com/view / 5c2ddab0e4b0fa03ce89d14f ):

After a separate thread received the order is ready, the command will be transferred to event distribution, event dispatcher executes the corresponding command processing logic based on the type of command. Since there is only one thread, as long as the line of command behind more than enough to make this one by a thread constantly processing command, then load performance is equal to 1.

Throughout the process, looking back at the load value, and the number of requests it has nothing to do, really, and load is related to the worker thread number, main thread is the worker thread, Timer is working thread, GC thread is a thread work, load is the thread / process as statistical indicators, whether the number of requests is the number of threads eventually need to deal with, and the processing performance of the worker thread will determine the final load value.

For example, assume a service in a thread pool, the number of thread pool 64 is fixed:

  • Normally, a task execution time is 10ms, 10ms processing thread to get the job done, quickly return to the thread pool to wait for the arrival of the next task, naturally there is little thread is running or waiting for IO's, from a statistical point of view load cycle performance It is very low
  • Value certain period of time due to a system problem, a never-ending task 10s are equivalent to the thread has been processing tasks, the survey cycle load inside an expression of the = 64 (without considering the scene outside of this 64 threads)

Thus, in summary, to figure out the relationship between load value and the number of requests, the number of threads is very important to make it clear to them the next working correctly.

 

High load, high cpu troubleshooting ideas

First, throw a view: cpu high is not a problem, load height is a problem caused by the high cpu, load capacity of the system is to determine the basis of indicators .

Why do you say to a single core cpu for example, when our daily cpu at 20%, 30% of cpu when in fact a waste of resources, which means that most of the time cpu and not in work, in theory, a system limit cpu utilization may reach 100 percent, which means that cpu is fully utilized handling compute-intensive tasks, such as the for loop, md5 encryption, new objects and so on. But practically impossible happen, because the application does not consume the cpu IO does not exist is almost impossible, such as reading or read the database file, so cpu is not as high as possible, typically 75% is a need to attract experience alert.

Noted previously mentioned "causes alert" means high cpu is not necessarily a problem, but need to take a look, especially the daily time because traffic is not usually daily, cpu is impossible to hit so high. If the code is just plain does in the normal course of business that did not deal with the problem, if the code which appeared in an infinite loop (infinite loop for example, in the classic problem JDK1.7 HashMap expansion triggered), then a few threads have been occupied cpu, finally cause of increased load.

In a Java application, high cpu investigation usually simple idea, there are relatively fixed practices:

  • ps -ef | grep java, Java applications query process pid
  • top -H -p pid, cpu query occupied the highest thread pid
  • The thread 10 decimal to hexadecimal pid turn thread pid, e.g. 2000 = 0x7d0
  • jstack process pid | grep -A 20 '0x7d0', nid find matching threads, call stack positioning of the reasons causing high cpu

There are many online articles written here stopped, and the process of practice is not the case. Because the cpu is a statistical value of the time period, jstack stack is only a momentary record transient state, is not a two-dimensional thing, and therefore entirely possible to see the code stays in the following locations from the stack line number printed in:

  • Do not consume cpu network IO
  • for (int i = 0, size = list.size(); i < size; i++) {...}
  • Call native methods

If the set in full accordance with the above steps to do so in that case was dumbfounded, trying to think for a long time but can not understand why, simply do not understand why this code causes high cpu. In view of this situation may arise, actual troubleshoot when jstack recommended to print five times at least three times , according to multiple stack content, combined with the relevant code section for analysis, positioning reasons for the high cpu appeared, high cpu may be snippet caused by a bug in the stack instead of printing out a few lines caused .

In addition, high cpu case there is a possible cause, if a 4-core cpu servers we saw a total of up to 100% + cpu, cpu by each of us observed after 1, only one reached 90% +, the other is about 1% (figure below demonstrates top by just 1 after the effect is not the real scene):

Key consideration in this case can not be caused by frequent FullGC. Because we know when there will be FullGC Stop The World this action, multi-core cpu servers, in addition to the GC thread, the Stop The World all the time will be suspended until the end of Stop The World. In several years old garbage collector, for example:

  • Serial Old collectors, full Stop The World
  • Parallel Old collectors, full Stop The World
  • CMS collector, which in the initial marker and mark two concurrent processes, in order to accurately mark the objects need to be recovered, will Stop The World, but compared to the previous two kinds of system greatly reduces the pause time

In any case, when the real occurrence of Stop The World, us low GC threads in occupied cpu work of other pending thread, a natural expression of will to us a very high cpu cpu's and he will appear.

For questions FullGC, the investigation is usually thought:

  • ps -ef | grep java, Java applications query process pid
  • jstat -gcutil pid 1000 1000, one memory every second print case of co-1000 prints, year old observation (O), MetaSpace (MU) memory usage and the number of FullGC
  • Recognizing the frequent occurrence FullGC to view the logs GC, GC log different path for each application configuration
  • jmap -dump: format = b, file = filename pid, reservation site
  • Restart the application, stop bleeding quickly, to avoid bigger problems online
  • dump out the contents, combined with MAT analysis tools to analyze the causes of memory, the investigation appears FullGC

如果FullGC只是发生在老年代区,比较有经验的开发人员还是容易发现问题的,一般都是一些代码bug引起的。MetaSpace发生的FullGC经常会是一些诡异、隐晦的问题,很多和引入的第三方框架使用不当有关或者就是第三方框架有bug导致的,排查起来就很费时间。

那么频繁FullGC之后最终会导致load如何变化呢?这个我没有验证过和看过具体数据,只是通过理论分析,如果所有线程都是空闲的,只有GC线程在一直做FullGC,那么load最后会趋近于1。但是实际不可能,因为如果没有其他线程在运行,怎么可能导致频繁FullGC呢。所以,在其他线程处理任务的情况下Stop The World之后,cpu挂起,任务得不到处理,更大可能的是load会一直升高。

最后顺便提一句,前面一直在讲FullGC,频繁的YoungGC也是会导致load升高的,之前看到过的一个案例是,Object转xml,xml转Object,代码中每处都new XStream()去进行xml序列化与反序列化,回收速度跟不上new的速度,YoungGC次数陡增。

 

load高、cpu低的问题排查思路

关于load的部分,我们可以看到会导致load高的几个因素:

  • 线程正在使用cpu
  • 线程正在等待使用cpu
  • 线程在执行不可被打断的IO操作

既然cpu不高,load高,那么线程要么在进行io要么在等待使用cpu。不过对于后者"等待使用cpu"我这里存疑,比如线程池里面10个线程,任务来的很慢,每次只会用到1个线程,那么9个线程都是在等待使用cpu,但是这9个线程明显是不会占据系统资源的,因此我认为自然也不会消耗cpu,所以这个点不考虑。

因此,在cpu不高的情况下假如load高,大概率io高才是罪魁祸首,它导致的是任务一直在跑,迟迟处理不完,线程无法回归线程池中。首先简单讲讲磁盘io,既然wa表示的是磁盘io等待cpu的百分比,那么我们可以看下wa确认下是不是磁盘io导致的:

如果是,那么按照cpu高同样的方式打印一下堆栈,查看文件io的部分进行分析,排查原因,例如是不是多线程都在读取本地一个超大的文件到内存。

磁盘io导致的load高,我相信这毕竟是少数,因为Java语言的特点,应用程序更多的高io应当是在处理网络请求,例如:

  • 从数据库中获取数据
  • 从Redis中获取数据
  • 调用Http接口从支付宝获取数据
  • 通过dubbo获取某服务中的数据

针对这种情况,我觉得首先我们应该对整个系统架构的依赖比较熟悉,例如我画一个草图:

对依赖方的调用任何一个出现比较高的耗时都会增加自身系统的load,出现load高的建议排查方式为:

  • 查日志,无论是HBase、MySql、Redis调用还是通过http、dubbo调用接口,调用超时,拿连接池中的连接超时,通常都会有错误日志抛出来,只要系统里面没有捕获异常之后不打日志直接吞掉一般都能查到相关的异常
  • 对于dubbo、http的调用,建议做好监控埋点,输出接口名、方法入参(控制大小)、是否成功、调用时长等必要参数,有些时候可能没有超时,但是调用2秒、3秒一样会导致load升高,所以这种时候需要查看方法调用时长进行下一步动作

如果上面的步骤还是没用或者没有对接口调用做埋点,那么还是万能的打印堆栈吧,连续打印五次十次,看一下每次的堆栈是否大多都指向同一个接口的调用,网络io的话,堆栈的最后几行一般都有at java.net.SocketInputStream.read(SocketInputStream.java:129)

 

Java应用load高的几种原因总结

前面说了这么多,这里总结一下load高常见的、可能的一些原因:

  • 死循环或者不合理的大量循环操作,如果不是循环操作,按照现代cpu的处理速度来说处理一大段代码也就一会会儿的事,基本对能力无消耗
  • 频繁的YoungGC
  • 频繁的FullGC
  • High disk IO
  • High network IO

When the first online encounter problems do not panic, because most of the high load problems are concentrated in a few more points there, the following steps may help analyze the problem you organize your thoughts:

  • top to view the user us cpu accounting and spare us (id), the purpose is to confirm whether the high-load high cpu played
  • If it is caused by high cpu, then confirm whether caused by gc, jstat command + gc Blogs can be confirmed
  • gc cause high cpu direct dump, due to non-gc analysis thread stack
  • If it is not caused by high cpu, view the disk io accounting (wa), if so, whether playing thread stack analysis of a large number of file io
  • If it is not caused by high cpu, and is not caused by disk io, check each dependent subsystems calling time-consuming, high time-consuming network calls are likely to be the culprit

Finally or not, when the helpless, jstack print multi-stack analysis of it, you might be able to find a eureka moment cause of the error.

 

Epilogue

First there is the theory, the theory think through, in order to combat the problem encountered when sanity.

Frankly, cpu load and high partial investigation is a very real thing, in this regard I also have a long way to go, surrounded by plenty of rich experience in this than my colleagues. Many people have asked me, the project is relatively simple, there is no such problem I need to troubleshoot online how to do? This problem can only say that the accumulation of more than usual, and more practical is the only way, if there is no real chance, it is recommended three ways:

  • By their own code to simulate a variety of abnormalities, such as FullGC, deadlocks, infinite loop, and then use the tool to check, it may be relatively simple, but oaks from little acorns, then things are complicated by the simple change over
  • Multi-server knocked on top, sar, iostat these commands, memorize the meaning and effect of the output parameters of each command
  • Go online to find some of the others deal with FullGC, articles method of high cpu, standing on the shoulders of giants, look at the road traveled predecessors, summed up some useful points record

Real opportunity to really have the time to grasp, even for colleagues to troubleshoot the problem, you can also figure out the ins and outs of the problem after the fact, over time natural ability in this area will increase up.

Guess you like

Origin www.cnblogs.com/Joy-Hu/p/11819350.html