Linux Performance Tuning combat: routine papers - integrated monitoring system thinking (53)

First, the section reviews

In the previous content, I introduced the principle of a lot of performance analysis for you, ideas and related tools. However, in the actual performance of the analysis, a very common phenomenon is that obviously took place performance bottlenecks, but when you log in to the server you want to troubleshoot
candidates, but found that the bottleneck has disappeared. Or, performance problems always occur from time to time, but it is difficult to find the law occurred, it is difficult to reproduce.

When faced with such a scenario, you may find us in front of a variety of tools, methods "fail" the. why? Because they need to be effective at the time of performance problems, and in a post hoc analysis of these scenarios, we can hardly send
play their power up.

Then how to do it? Ignore it? In fact, in the past, many applications are to wait until users complain about slow response, or after a system crash, only to find the performance of the system or application problems. Although ultimately able to find the problem, but apparently, this approach is
not desirable, because a serious impact on the user experience.

To solve this problem, it is necessary to build a monitoring system, the health system and application monitoring together and define a series of strategies, first time alarm notification when a problem occurs. A good monitoring system, not only can expose a variety of real-time systems to ask
questions, but also according to the state to monitor these, automatically analyze and locate the source of approximately bottleneck to more accurately report the problem to the relevant team process.

To do a good job monitoring, the core is a comprehensive, quantifiable indicators, including both systems and applications.

From the system, the monitoring system to cover the overall system resource usage, such as we have said before CPU, memory, disk and file system, network, and other system resources.

From the applications, the monitoring system to cover operational state of the internal applications, which includes both CPU, disk I / O, the overall health of the process, including the need for more interface calls such as time-consuming, process execution errors, internal memory usage, etc. should be the object of
operating conditions with internal procedures.

Today, I'll take a look at you, how to monitor Linux system. In the next section, I will continue to explain the idea of ​​application monitoring for you.

Two, USE law

Before you start monitoring system, you most certainly want to know how to use simple methods to describe the use of system resources. Of course you can use the column learned various performance tools to collect usage are various resources. But do not forget,
the performance index of each resource can have a lot of time-consuming to use too many indicators themselves did not say, it is not easy to establish the overall health of the system for you.

Here, I introduce to you a special USE for performance monitoring (Utilization Saturation and Errors) method. USE law the performance of system resources, simplified into three categories, namely usage, saturation, and the number of errors.

  1. Utilization, represents the percentage of time or resources for capacity services. 100% usage, indicates the capacity has been exhausted or all the time for service.
  2. Saturation indicates the resource busy, usually associated with the length of the waiting queue. 100% saturation, means that the resource can not accept any more requests.
  3. Error number indicates the number of events the error occurred. The more the number of errors, indicating that the more serious problem of the system.

These three categories of indicators, covering common performance bottlenecks of system resources, so often used to quickly locate performance bottlenecks of system resources. In this way, both for CPU, memory, disk and file systems, network hardware resources, or the number of file descriptors, even the
number of connections, connection tracking number and other software resources, USE method can help you quickly locate, which is kind of system resources, there has been a performance bottleneck.

So, for each of the system resources, and what common performance indicator? Recall that the various principles of system resources we have said, is not difficult to think of related performance indicators. Here, I put the common performance indicators painted a table,

So you can view when needed.

However, it should be noted that, USE method only focus on the core indicators to reflect the performance bottlenecks of system resources, but this is not to say that other important indicators. Such as system logs, process resource usage, buffer usage, and other kinds of indicators, we also need to monitor
them. But, they are usually used as an auxiliary performance analysis, while the index of the USE method, directly indicates the resource bottleneck of the system.

Third, the monitoring system

After the master USE methods and performance indicators need to be monitored, the next step is to establish a monitoring system to save these indicators down; then, according to monitoring the status of these automatically analyze and locate the source of approximately bottlenecks; and finally, through the alarm system
system, the timely reporting problems to the relevant team process.

As can be seen, a complete monitoring system typically consists of data acquisition, data storage, and processing of data query, as well as alarms and other visual display modules. So, to start from scratch to build a monitoring system, it is actually a lot of systems engineering

But, fortunately, now there are many open source monitoring tools can be used directly, for example, the most common Zabbix, Nagios, Prometheus, and so on.

Here, I will take Prometheus for example, to introduce you to the basic principles of these components. As shown below, the basic structure is Prometheus:

Look at the data acquisition module. The leftmost Prometheus targets is the data collected objects, and Retrieval is responsible for collecting these data. You can also see from the figure, Prometheus supports both Push and Pull two data acquisition modes.

  1. Pull mode, by the server acquisition module to trigger an acquisition. As long as the acquisition target provides an HTTP interface, you can freely access (which is the most commonly used acquisition mode).
  2. Push mode, the respective target acquisition by the Push Gateway initiative (to prevent loss of data) push index, then taken by the server from the Gateway to pull over (which is the most common in mobile applications acquisition mode).

Due to the need to monitor objects are usually dynamic, Prometheus also provides a mechanism for service discovery, according to the rules can be automatically pre-configured dynamically discover objects need to be monitored. This is very effective in Kubernetes such as container platforms.

The second data storage module. In order to keep monitoring data persistence, figures TSDB (Time series database) module responsible for the collected data to an SSD persistent disk device. TSDB is designed for a time-series data
types database, based on the characteristics of the time index, and the amount of data written in additional ways.

The third is the data query and processing module. Just mentioned TSDB, while storing data, in fact, also provide basic data query and data processing capabilities, and this is PromQL language. PromQL provides a simple query, filtering,
and support basic data processing method is the basis for alarm systems and visual display.

The fourth is the alarm module. AlertManager the upper right corner provides alarm functions, including sending and so on based trigger PromQL language, configuration management and alarm rules alarms. However, although the warning is necessary, but too often the alarm
obviously undesirable. Therefore, AlertManager also supports packet, silence suppression or other means to alert the polymerization grade, and reduce the number of alarms.

The last one is the visual display module. Prometheus's web UI provides a simple visual interface for performing PromQL query, but the results show rather monotonous. However, once with Grafana, we can build a very powerful
graphical interface.

Introduced over these components, you must have a relatively clear understanding of each module. Next, let us continue to understand the overall function of these components combine.

For example, in order to USE method just mentioned, for example, I use Prometheus, you can collect utilization of various resources Linux server CPU, memory, disk, network, and saturation errors indicators. Then, Grafana and
PromQL query, you can put them in an intuitive graphical interface to show the way out.

 

 

IV Summary

Today, I take you sort out the basic ideas of system monitoring together.

The core system monitoring is the use of resources, including CPU, memory, disk and file systems, network hardware resources, as well as file descriptors, number of connections, connection tracking number and other software resources. These resources can be used to build the core of the law by USE
energy index.

USE law the performance of system resources, simplified into three categories, namely usage, saturation, and the number of errors. This is a time in any of the three categories is too high, represents the corresponding system resources may exist performance bottlenecks.

After establishing performance indicators USE method is based on, but also through a complete monitoring system, these indicators from the collection, storage, query processing, to the alarm and visual display and other series. You can be based on Zabbix, Prometheus and other open source
monitoring products, to build this monitoring system. Thus, not only can the bottlenecks of system resources quickly exposed, you can also use historical monitoring, and afterwards to trace the location problem.

Of course, in addition to system monitoring, application monitoring is essential, and I will continue dismantling the next lesson for you.

Guess you like

Origin www.cnblogs.com/luoahong/p/11585512.html