Prometheus query language PromQL CPU usage calculation method

How to calculate CPU usage

After reading several articles about Prometheus's PromQL querying cpu usage rate, they are not particularly thorough. Combined with an English article, I finally figured out how to calculate this indicator.

cpu mode

A cpu needs to run in different modes through time-sharing multiplexing. It can be compared to let different people use the cpu, Zhang Sanshi for a while, Li Sishi for a while. These modes can be viewed with the top command, including:

  • us: the time the user process uses the cpu
  • sy: the time the kernel process uses the cpu
  • ni: the cpu time used by the process whose priority has been changed in the user process space
  • id: idle (no one uses) cpu time
  • wa: CPU time waiting for io
  • hi: cpu time of hard interrupt
  • si: cpu time of soft interrupt
  • st: cpu time used by the hypervisor

These times add up to the total cpu time. What does this time mean?

cpu time

The CPU-related indicators captured by node-exporter are mainly node_cpu_seconds_total

# HELP node_cpu_seconds_total Seconds the cpus spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 230416.36
node_cpu_seconds_total{cpu="0",mode="iowait"} 3.86
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 1.05
node_cpu_seconds_total{cpu="0",mode="softirq"} 302.24
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 3829.27
node_cpu_seconds_total{cpu="0",mode="user"} 4802.39
node_cpu_seconds_total{cpu="1",mode="idle"} 230389.47
node_cpu_seconds_total{cpu="1",mode="iowait"} 30.73
node_cpu_seconds_total{cpu="1",mode="irq"} 0
node_cpu_seconds_total{cpu="1",mode="nice"} 1.09
node_cpu_seconds_total{cpu="1",mode="softirq"} 486.8
node_cpu_seconds_total{cpu="1",mode="steal"} 0
node_cpu_seconds_total{cpu="1",mode="system"} 3928.86
node_cpu_seconds_total{cpu="1",mode="user"} 4919.87

Each indicator is the running time of a certain mode of a certain core cpu,The unit is seconds. The number of seconds that add up the cpu time of each mode of a certain core is
the total number of seconds since the system was started by executing uptime (the first column of /proc/uptime is the sum of the idle time of each core). This understands why these values ​​are monotonically increasing. For example, node_cpu_seconds_total{cpu="0",mode="idle"} 230416.36this value means that the idle time of cpu0 is 230416.36 seconds since the system was turned on to the current time. Dividing it by uptime is the idle rate of cpu0 since booting. It is now 5 o'clock after 5 o'clock, then check this value after 5 o'clock, and it will definitely be greater than or equal to the current value.

Decomposition of the formula for calculating cpu utilization

The following step by step derivation of the calculation formula of cpu usage:

  1. The time increase(node_cpu_seconds_total{cpu="0",mode="idle"}[5m])that cpu0 is idle within 5 minutes:, increase means increment. As mentioned earlier, node_cpu_seconds_total is monotonically increasing. The result of this formula is node_cpu_seconds_total at the current time point minus node_cpu_seconds_total 5 minutes ago, which is 5 CPU time in idle state within minutes.
  2. The percentage of time that cpu0 is idle in 5 minutes:, the increase(node_cpu_seconds_total{cpu="0",mode="idle"}[5m]) / increase(node_cpu_seconds_total{cpu="0"}[5m])denominator is actually 5 minutes = 300 seconds.
  3. Time share a host cpu idle all within 5 minutes, with sum()the accumulated value of each core functions:sum (increase(node_cpu_seconds_total{mode="idle"}[5m])) / sum (increase(node_cpu_seconds_total{mode="idle"}[5m]))
  4. If Prometheus monitors multiple hosts, sum according to each host:sum by (instance)(increase(node_cpu_seconds_total{mode="idle"}[5m])) / sum by (instance)(increase(node_cpu_seconds_total[5m]))
  5. cpu usage rate = 1-cpu idle rate:100 * (1 - sum by (instance)(increase(node_cpu_seconds_total{mode="idle"}[5m])) / sum by (instance)(increase(node_cpu_seconds_total[5m])))

PromQL has a function to calculate the ratio: rate()and irate(), the calculation formula can be simplified as:

100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) 

Reference connection

irate 和 rate

The official document definition irate()is a timing indicator calculated based on the last two data points per second increase rate within a range. For a long time, I didn't want to understand which two data points were and how they were defined. Then I suddenly thought that it was the last two data you collected, and the interval between these two data points is scrape_interval. If used irate(), as long as the period length is greater than the square brackets query scrape_interval, then in one scrape_intervalwithin the query results in different lengths of time it is the same. Take a chestnut: scrape_intervalSet it to 30s, then fill in the square brackets with 1m, 2m, 5m, 10m, and the result of the query is unchanged, as long as the two queries are within 30s. That time the query must be at least equal to scrape_interval, in order to ensure that at least two data points within this time range, greater than scrape_intervaltwo times safer.

rate()It calculates the average increase rate per second of a time series indicator within a range. It only uses the first and last data points. If there are only two data points in the query time range, then its query result is equal to irate(). Its image is smoother and more suitable for warning.

Reference connection

Guess you like

Origin blog.csdn.net/qq_35753140/article/details/105121525