Linux Performance Tuning Notes: Application Monitoring

Monitoring indicators

As with system monitoring, building monitoring system prior to application, first need to determine, in the end what indicators need to be monitored. In particular, to be clear, what indicators can be used to quickly identify application performance issues.

Monitoring of system resources, USE method is simple and effective, but it does not mean that its application for monitoring. For example, even in low CPU utilization when the application can not explain there is no performance bottlenecks. Because the application may be because the lock or RPC calls, etc., resulting in slow response.

Therefore, the core indicators of the application, is no longer the use of resources, but requests, error rate and response time . These indicators not only directly related to the user experience, but also reflect the overall application availability and reliability.

With the number of requests, response time and error rate after three gold index, we can quickly know whether application performance problems occur. However, only those indicators apparently still not enough, because the performance problems occur, we also want to be able to quickly locate the "bottleneck area." So, in my opinion, several indicators below, it is also essential when monitoring application.

The first is the resource usage of the application process , such as the process takes CPU, memory, disk I / O, network and so on. Using too many system resources, resulting in a slow application response or increase the number of errors is one of the most common performance problems.

The second, is to call the situation between applications , such as call frequency, number of errors, delays and so on. Because the application is not isolated, it depends on if other application performance problems, performance will be affected by the application itself.

The third is the operation of the internal core logic of the application , such as time-consuming process and the implementation of key aspects of the errors. Since this is an internal state of the application, normally you can not get from the outside directly to the detailed performance data. Therefore, the application in the design and development of these indicators should provide it in order to understand its internal monitoring system can run the state.

Resource use indicators have the application process, you can put the system resource bottlenecks associated with applications up to quickly locate performance problems due to lack of system resources caused;

  • With call indicators between applications, you can quickly analyze a request call chain process, which in the end is the culprit components performance issues;

  • And with the application logic inside the core operating performance, you can go further, direct access to internal applications, positioning in the end is which part of the processing function leads to performance problems.

Based on these ideas, I believe you can build, describe the performance of applications running status indicators. Then these indicators into us on a monitoring system mentioned (such as Prometheus + Grafana), you can like system monitoring, on the one hand by an alarm system, the timely reporting problems to the relevant team process; on the other hand, through an intuitive graphical interface, dynamic display the overall performance of the application.

In addition, due to the operational system will typically involve a series of multiple services to form a complex distributed call chain. In order to quickly locate performance bottlenecks such cross-application, you can also use Zipkin, Jaeger, Pinpoint, and other open source tools to build a full link tracking system.

 

For example, the following figure is an example of Jaeger call chain trace.

Full link tracking can help you quickly locate, in a request processing, which part is the root of the problem. For example, from the image above, you can easily see that this is a problem caused by Redis timeout.

In addition to tracking the whole link can help you quickly locate cross-application performance problems, but also can help you generate calls topology line system. These visual topology, especially useful for analyzing complex systems (such as micro service).

Log Monitoring

Monitor performance metrics, allowing you to quickly locate the position of bottlenecks occur, but only if indicators are often not enough. For example, the same interface, when the request parameter is not passed in the same time, it could lead to a completely different performance problems. So, in addition to the index, we also need to monitor these indicators of context information, and log is the best source of these contexts.

For comparison,

  • Indicators are numerical data for a particular measurement time period, usually processed in a time series manner suitable for real-time monitoring.

  • The log is completely different, log messages are strings a point in time, after the usual need for search engines to index, to query and meta-analysis.

Log Monitoring, the most classic way is to use ELK technology stack that use a combination of Elasticsearch, Logstash and Kibana these three components.

As shown below, it is a classic ELK architecture diagram:

 

some of,

  • Logstash 负责对从各个日志源采集日志,然后进行预处理,最后再把初步处理过的日志,发送给 Elasticsearch 进行索引。

  • Elasticsearch 负责对日志进行索引,并提供了一个完整的全文搜索引擎,这样就可以方便你从日志中检索需要的数据。

  • Kibana 则负责对日志进行可视化分析,包括日志搜索、处理以及绚丽的仪表板展示等。

下面这张图,就是一个 Kibana 仪表板的示例,它直观展示了 Apache 的访问概况。

 

值得注意的是,ELK 技术栈中的 Logstash 资源消耗比较大。所以,在资源紧张的环境中,我们往往使用资源消耗更低的 Fluentd,来替代 Logstash(也就是所谓的 EFK 技术栈)。

小结

今天,我为你梳理了应用程序监控的基本思路。应用程序的监控,可以分为指标监控和日志监控两大部分:

  • 指标监控主要是对一定时间段内性能指标进行测量,然后再通过时间序列的方式,进行处理、存储和告警。

  • 日志监控则可以提供更详细的上下文信息,通常通过 ELK 技术栈来进行收集、索引和图形化展示。

在跨多个不同应用的复杂业务场景中,你还可以构建全链路跟踪系统。这样可以动态跟踪调用链中各个组件的性能,生成整个流程的调用拓扑图,从而加快定位复杂应用的性能问题。

Guess you like

Origin www.cnblogs.com/newcityboy/p/12015718.html