zprofiler Tools

Transfer: zprofiler three tricks to solve the problem of too high cpu usage

This tool is Ali homegrown profiler tool, see the other articles in this useful tool positioning performance problems. In this reprinted article to learn about.

 

On Friday I met online too high a machine cpu usage problem. The problem itself is relatively simple, but the process of positioning the key functions of multiple zprofiler use, the feeling is a good introduction to the use of this kind of problem cases zprofiler positioning process.

Before you start using zprofiler, first use the perf confirm whether the bottleneck point in the native. (Following actions need root privileges, need help pe operation)
If the server does not perf line installed, you can go to the  http://yum.corp.taobao.com/taobao/6/x86_64/test/aliperf/aliperf-0.3.9- 9.el6.x86_64.rpm  Download rpm package and install it.
Use perf top command to view the hotspot function of the current system.

I.e. the case shown above indicates that, in the hot java code, because the code is java jit executed, Perf see its sign, so the default classified perf- <pid> .map in.
If a hot spot function libjvm.so in, you can contact our team to help further analysis. For example, if a hotspot jit related functions, or generally codecache jit problems related parameters; if gc related functions can be used to analyze zprofiler gclog, and then adjust the parameters related to gc.

Exclusion of other possible, after determining the problem java code, you can do a thread dump, analyze on zprofiler.
Use thread dump mode threads run in the hot stack (load) function, you can see most of the call stack appears in the thread of execution. As shown below:

In fact, the problem here have seen the stack, but because the thread dump is just a snapshot of that time did not dare to believe that so soon find the problem, I still feel a look with Hot method Profiling.

Hot Method Profiling already has an article, but here is not to say, look at the circle of top posts on it.
At that time analysis results as shown below:

The results are very clear, first row function accounted for 99% of the cpu utilization. And the call stack after the unfolding exactly the same with the call stack in front of a hot stack to see inside. It is almost certain the problem lies here.

But the product of a small partner to say this place is a normal call, sql statement for a long time have not changed, the amount of data inside the database is not large. To find out, we decided to do a heap dump, look at what kind of data processing in the end?
After doing heap dump, copied to zprofiler System. Probably looked at "target cluster view," there is nothing particularly large objects.
Then I looked at the "thread Overview", can be "regular match" in the right place, according to the relevant thread thread name filter out.
And then expand the local object can be seen on the call stack layers. As shown below:

Mouse put up, you can see the contents of the object. Here you can see the sql statement is being queried, and related parameters.

Then check out the root cause is a third-party components have not upgraded bug caused.

Transfer: zprofiler three tricks to solve the problem of too high cpu usage

This tool is Ali homegrown profiler tool, see the other articles in this useful tool positioning performance problems. In this reprinted article to learn about.

 

上周五碰到了一个线上机器cpu占用率过高的问题。问题本身比较简单,但是定位过程中动用了多个zprofiler的主要功能,感觉是一个很好的介绍使用zprofiler定位此类问题流程的案例。

在开始使用zprofiler之前,先使用perf确认瓶颈点是否在native中。(以下操作需要root权限,需要pe协助操作)
如果线上服务器没有安装perf,可以到 http://yum.corp.taobao.com/taobao/6/x86_64/test/aliperf/aliperf-0.3.9-9.el6.x86_64.rpm 下载rpm包,然后安装。
使用 perf top 命令,查看当前系统的热点函数。

如上图所示的情况即表明,热点在java代码中,因为java代码是jit执行的,perf看不到其符号,所以默认归入perf-<pid>.map中。
如果热点在libjvm.so中的函数,可以联系我们团队,协助进一步分析。比如如果热点是jit相关的函数,一般是codecache或者jit相关参数的问题;如果是gc相关的函数,可以用zprofiler分析一下gclog,然后调整gc相关的参数。

排除其他可能,确定是java代码的问题之后,可以先做一个thread dump,在zprofiler上分析一下。
使用 thread dump 中的 运行态线程热点堆栈(load) 功能,就可以看到在运行线程中出现的最多的调用栈。如下图所示:

其实这里已经看到出问题的堆栈了,但是因为thread dump只是一个快照,当时没敢相信这么快就找到问题所在,所以还是觉得用Hot method Profiling看一下。

Hot Method Profiling 已经有专门的文章介绍,这里就不多说了,看圈子的置顶帖就可以了。
当时分析结果如下图:

这个结果非常明显,排第一的函数占了99%的cpu占用率。而且展开之后的调用栈跟前面在热点堆栈里面看到的调用栈一模一样。基本可以肯定问题就出在这里。

但是产品的小伙伴说这个地方是正常的调用,sql语句很久没有修改,数据库里面数据量也不大。为了一探究竟,决定做一个heap dump,看一下到底在处理什么样的数据?
做完heap dump之后,拷贝到zprofiler系统上分析。大概看了一下 "对象簇视图",没有什么特别大的对象。
然后又看了一下  “线程概览” ,可以在右边 "正则匹配" 的地方,根据线程名把相关线程过滤出来。
然后展开可以看到各层调用栈上的局部对象。如下图:

鼠标放上去,可以看到对象的内容。这里就可以看到正在查询的sql语句,以及相关的参数。

后来查出来的根本原因是有一个第三方组件没有升级导致的bug。

Guess you like

Origin www.cnblogs.com/jpfss/p/11584248.html