Detailed explanation and practice of common monitoring indicators of Node Exporter

Common monitoring indicators


In this section, we will learn about some common indicators of node monitoring, such as CPU, memory, IO monitoring, etc.

 

 

 

CPU monitoring


For nodes, the first thing we can think of is to monitor the CPU first, because the CPU is the core of processing tasks, and the health status of the current system can be analyzed according to the status of the CPU. To monitor the CPU of the node, you need to use  node_cpu_seconds_total this monitoring indicator. The content of the indicator in the metrics interface is as follows:

time spent by cpu in each mode

# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 13172.76
node_cpu_seconds_total{cpu="0",mode="iowait"} 0.25
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 0.01
node_cpu_seconds_total{cpu="0",mode="softirq"} 87.99
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 309.38
node_cpu_seconds_total{cpu="0",mode="user"} 79.93
node_cpu_seconds_total{cpu="1",mode="idle"} 13168.98
..............................................................

 It can be seen from the description of the interface that this indicator is used to count the time spent in each mode of the CPU. It is a Counter type indicator, which means that it will always increase. This value is actually a cumulative value of the CPU time slice, which means that When the CPU starts to work from the start of the operating system, it starts to record the total time used by itself, and then saves it . Moreover, the accumulated CPU usage time here will also be divided into several different modes, such as user mode usage time, idle time, and interruption time. , kernel mode usage time, etc., that is, the CPU-related information that we usually use the top command to view, and this indicator here will record these modes separately.

Next, let's monitor the CPU of the node. We also know that a CPU time that has been increasing is not meaningful to us. Generally, we prefer to monitor the CPU usage of the node, which is the percentage we see with the top command.

To calculate the CPU usage, you need to figure out the meaning of this usage. The CPU usage is the result of dividing the time sum of all CPU states except the idle state by the total CPU time . Understand After understanding this concept, you can write the correct promql query statement.

To calculate the sum of the CPU time except the idle state, a better way is to directly calculate the CPU time usage in the idle state, and then subtract it from 1 to get the result we want, so first we filter  idle the indicators of the mode, Enter and filter in Prometheus's WebUI  node_cpu_seconds_total{mode="idle"} : (increase is actually the difference between the first sample and the last sample)

To calculate the usage rate, we definitely need to know  idle how long the CPU of the mode takes, and then compare it with the total. Since this is the Counter indicator, we can use the  increase function to get the changes and use the query statement  increase(node_cpu_seconds_total{mode="idle"}[1m]), because the  increase function requires an interval vector to be input . , so here we take the data within 1 minute:

We can see that there are many data with different CPU serial numbers in the query results. Of course, we need to calculate the time of all CPUs, so we aggregate them. What we want to query is the CPU usage of different nodes, so we need to  instance aggregate according to tags. , use the query statement  sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by (instance): (the duration of idle CPU usage in one minute)

In this way, we can get the idle CPU usage time of different nodes within 1 minute, and then compare it with the total CPU time (no need to filter the status mode at this time), and use the query statement  sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance):

Then calculating the CPU usage is as simple as 1 minus multiplied by 100: (1 - sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance) ) * 100. This is the most direct way of querying CPU usage that can be thought of. Of course, as mentioned in the promql syntax we learned earlier, we will use  rate functions instead of  increase functions for calculation, so the final CPU usage is The query statement is: (1 - sum(rate(node_cpu_seconds_total{mode="idle"}[1m])) by (instance) / sum(rate(node_cpu_seconds_total[1m])) by (instance) ) * 100.

It can be compared with the result of the top command (the figure below shows the node2 node), which is basically the same. This is the way to monitor the CPU usage of the node.

 

 

 

memory monitoring


In addition to CPU monitoring, we may be most concerned about the monitoring of node memory. Usually, we basically use the  free command to check the memory usage of the node:

free The output of the command will display the usage of system memory, including physical memory, swap memory (swap) and kernel buffer memory, etc. So to monitor memory, we need to understand these concepts first. Let's first understand  free the output of the following command:

  • Mem 行(the second line) is the memory usage
  • Swap 行(third line) is the usage of swap space
  • total The columns show the total amount of available physical memory and swap space in the system
  • used Columns show the physical memory and swap space that has been used
  • free Columns show how much physical memory and swap space is available for use
  • shared The column shows the amount of physical memory used by the shared
  • buff/cache Columns show the amount of physical memory used by buffers and caches
  • available The column shows the amount of physical memory that can still be used by the application

Among them we need to focus on  free and  available two columns. free is the amount of physical memory that has not been used, and available is the available memory from the perspective of the application. In order to improve the performance of disk operations, the Linux kernel will consume a part of the memory to cache disk data, that is, buffer and cache, so for For the kernel, both buffer and cache belong to the memory that has been used, but when the application needs memory, if there is not enough free memory available, the kernel will reclaim memory from the buffer and cache to meet the application's request. So from the application point of view  available = free + buffer + cache, it should be noted that this is only an ideal calculation method, and the actual data has a large error.

If you want to query memory usage in Prometheus, you can use  node_memory_* related indicators. Similarly, to calculate the usage, we can calculate the available memory and use the promql query statement  node_memory_Buffers_bytes + node_memory_Cached_bytes + node_memory_MemFree_bytes.

Then calculate the usage of available memory, divide it with the total memory, and then subtract it from 1. The statement is  (1- (node_memory_Buffers_bytes + node_memory_Cached_bytes + node_memory_MemFree_bytes) / node_memory_MemTotal_bytes) * 100, so the calculated memory usage of the node.

Of course, if you want to view the memory usage of each item, you can directly use the corresponding monitoring indicators. For example, to view the total memory of the node, you can directly use the  node_memory_MemTotal_bytes indicators to obtain it. 

 

 

 

Disk monitoring


Next is the disk monitoring in the comparison. For disk monitoring, we are not only interested in disk usage, but generally speaking, disk IO monitoring is also very necessary.

Disk capacity monitoring

To monitor the disk capacity, you need to use  node_filesystem_* related indicators. For example, to query the disk space usage of a node, you can also use the total minus the available space for calculation. The disk free space usage  node_filesystem_avail_bytes indicator, but because there will be some that we don't care about Disk information, so we can use  fstype tags to filter the disk information we care about, such as  ext4 or  xfs formatted disks (filter it to get the file system you want):

To query disk space usage, use the query statement  (1 - node_filesystem_avail_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"}) * 100 :

In this way, we can get the disk space usage we care about.

Disk IO monitoring

To monitor disk IO, it is necessary to distinguish whether it is read IO or write IO, read IO usage node_disk_reads_completed indicators, and write IO usage node_disk_writes_completed_total indicators.

Disk read IO  sum by (instance) (rate(node_disk_reads_completed_total[5m])) can use the query statement:

Of course, if you want to  device aggregate based on it, it is also possible, we are all aggregated here.

Disk write IO  sum by (instance) (rate(node_disk_writes_completed_total[5m])) can use query statement:

If you want to calculate the total read and write IO, you can add up `rate(node_disk_reads_completed_total[5m]) + rate(node_disk_writes_completed_total[5m])

 

 

 

Network IO monitoring


The indicator that the uplink bandwidth needs to be used is  that since we are more concerned about the instantaneous changenode_network_receive_bytes of the network bandwidth, we generally use   functions to calculate the network IO, such as the query statement to calculate the uplink bandwidth   : (This can reflect the peak change of the network. )iratesum by(instance) (irate(node_network_receive_bytes_total{device!~"bond.*?|lo"}[5m]))

The indicator used for downlink bandwidth is  node_network_transmit_bytes, and the query statement in the same way is  sum by(instance) (irate(node_network_transmit_bytes{device!~"bond.*?|lo"}[5m])):

Of course, we can also aggregate and calculate separately according to the network card device, and finally we can convert the results to units according to our own needs.

Guess you like

Origin blog.csdn.net/qq_34556414/article/details/123443187