How to use the snapshot (snapshot) feature to quickly locate performance issues

We often encounter this confusion, receives the user or customer feedback platform there is a problem, but after testers build environment was not any way to reproduce failures, leading to the problem can not solve, watched churn.

This is because the production line environment is very complex, very often sporadic bug, but it is difficult to capture. Especially with the prevalence of micro-services, system complexity increases, the line quickly locate faults and timely analysis to solve the enormous challenges facing the past can only rely on people to solve. But the people's problem-solving ability and speed depends on experience, and sometimes even need to cooperate across sectors, so the cost is very high, once the loss of key personnel, the department is not harmonious with the entire fault resolution time will decrease speed.

This time I'm going to bring you an easy to use performance analysis tool - Application Insight (application performance management platform, hereinafter referred to as Ai), as for what it does? Look down you know -

 

 trace & snapshot Features

 

01 What has it wrong business operations?

In running code, often encounter these types of problems:

JVM

Common oom such as memory overflow, memory leaks, gc pause, such as running out of disk

● Database

For example, common database load busy, database server overload, slow implementation of a single sql statement, the database connection pool to obtain a long time, excessive number of accesses to the database connection pool, etc.

● External Service

Common network problems such as an external service, call blocking between different applications, such as index setting is not appropriate

● Other issues

The above are the most common scenario, but there are some issues are deeply hidden in the program, we need in-depth analysis code to find out why.

Thread deadlock, cycle of operation, transaction anomalies, jvm crash, etc.

 

02 What is the trace?

Trace information collection program runtime to query the running status, locate the code needs to be modified and optimized part of the code to improve speed, enhance user experience.

The industry commonly used method has the following three:

 

● Event Method

In the Java language, using jvmti (jvm tools interface) api methods to capture, such as method calls, class loading, unloading class, leaving the thread into the other events, then analyze the code of conduct based on these events.

 

● statistical sampling

Every event interrupt call for a system to collect the current call stack information, call recording function and structure of these functions appear in the call stack, cpu usage information based on the information obtained and the function call graph for each function.

 

● insert code

Inserted in the target program instruction code to the start time of the recording instruction code running, end time, and then through the statistics derived from the function call, the function cpu usage.

 

03  AI of trace

Ai by way of code and insert a snapshot (statistical sampling) to achieve trace, trace including slow trace, error trace, sql trace, snapshot four categories:

 

● Slow trace

When the user's transaction response time exceeds a threshold to be acquired. The default is 2 seconds, but the user can according to their actual situation Ai - set "Setting slow Affairs' Office. 

● error trace

When the program runs abnormal or direct error, we will directly collect error trace, restore the site.

● Sql trace

Sql open platform slow tracking, and tracking thresholds set slow sql, sql when performance is greater than the threshold value, agent stack information recorded in slow sql. The default is 0.5s, the user can according to their actual situation Ai - set "Setting up the database."

Users can open the sql execution plan, crawl to slow implementation of sql.



● snapshot

The principle is a snapshot in time of the call to the customer service many times the code is running slowly snapshot, and the method, time-consuming analysis, the snapshot trace.

Collection rules web transaction average response time of two consecutive minutes over 4 * apdex_T (default two seconds), the number of snapshots per minute and not more than one.

Users do not need to do any configuration, you can use this feature to dynamically adjust the threshold value can also be based on the business.


 

User Case Study

 

01 question types: connectionless service is denied

1) Description: The user feedback can not Log

 

2) questionnaire and analysis

a:登录平台查看登出的接口情况

登录接口为 rest / api / login ,选择该 web 事务后进入查看详情,使用“筛选”进行条件过滤,获取失败的 trace 。


b:查看错误 trace

在总览页面,通过异常分析,看到 http 响应码和异常类、异常信息,这时候我们已经清楚问题原因了。


点击“详情”查看程序的执行情况,可以看到程序循环了三次,均出现错误


在错误详情,查看详情的调用栈,我们排查到“caused by java.net.ConnectException:拒绝连接(connection refused)




3)解决方案

与运维人员配合重启负责注册模块的服务器后,业务恢复正常。

4)建议方案

在报警处针对重要业务服务进行配置,当响应时间超过阈值或者出现频率超过阈值时,提前报警,挽回损失。

4.1 新功能体验

登录平台后可以查看报警状态


进入事务详情,鼠标悬停查看该时间段内发生的最近最严重的报警详情


点击红点跳转报警记录查看该事务在该时间段内的所有详情


4.2 如何配置报警

a:从 Ai 页面点击报警

进入报警页面,再选择报警规则,创建报警规则名称、选择类型、自定义规则的可用与不可用时间(比如节假日不可用等)


b:选择报警对象

目前系统支持按照具体 web 事务、按照不同集群、按照高频 web 事务入口( cpm >10)来进行选择,我们选择按照具体的 web 事务为报警对象。


c:选择严重条件

根据我们的业务述求,只要过去10分钟内该事务的平均响应时间大于2秒,而且至少有5分钟的平均响应时间大于2s, 则触发严重报警。


d:选择警告条件

根据我们的业务述求,只要过去10分钟内该事务的平均响应时间大于1秒,而且至少有5分钟的平均响应时间大于1s, 则触发严重报警。


此时我们报警规则已配置好,当事务触发报警时,便可在 Ai 平台直接查看是否有报警、是否严重。如果您想及时感知报警,可进行进一步的配置。

4.3 如何接收报警

a:创建接收人


b:选择接收方式

Currently Ai default mail and webhook alarm mode, but through open and Cloud Alert (Intelligent alarm platform, referred to as the CA platform), support for SMS, micro-letters, nails and other notification methods, more detailed alarms distribution and alarm compression can be viewed (by free version, operation and maintenance team enough small enough) https://www.aiops.com/CAintroduce.html

c: the recipient and the rules associated alarm

Select the alarm can trigger the behavior and rules


We often use the platform to see Ai alarm junior partner to some strange, I do not see why the alarm status information in Ai platforms it?

 

Yes, Ai depth and alarm function is associated with the development of our new features, I also advance to experience a feeling Bang Bang da! The feature line on the mid-December, small partners please look forward to it ~

 

If you are not OneAPM user, click  https://user.oneapm.com/pages/v2/signup  register now for free trial instantly feel performance optimization it!

 

In the use of the process, if you have any questions or suggestions, please feel free to contact us, we will be happy to help you:

 

qq:321095806

Community: http: //club.oneapm.com

Guess you like

Origin www.cnblogs.com/oneapm/p/11934446.html