PromQL allows you to easily realize monitoring visualization! Come and find out!

Some key designs in Prometheus, such as focusing on standards and ecology, monitoring target dynamic discovery mechanism, PromQL, etc.

PromQL is the query language of Prometheus, which is flexible and convenient to use, but many people don't know how to make better use of it, and cannot take advantage of it.

PromQL is mainly used for time series data query and secondary computing scenarios.

1 Time series data

It can be understood as a matrix with time as the axis. In the following case, there are three time series, corresponding to different values ​​on the time axis:

^
│     . . . . . . . . . .   node_load1{host="host01",zone="bj"}
│     . . . . . . . . . .   node_load1{host="host02",zone="sh"}
│     . . . . . . . . . .   node_load1{host="host11",zone="sh"}
v
<------- 时间 ---------->

Each point is called a

1.1 sample (sample)

1.1.1 Composition

  • Indicator (metric): metric name and labelsets describing the characteristics of the current sample
  • Timestamp (timestamp): timestamp in ms
  • value: the value of the time sample

PromQL is to query and calculate such a batch of sample data.

2 application scenarios

Query and secondary calculation of time series data.

The first core value of PromQL

2.1 Screening

Query by query selector

query selector

The rendering of each monitoring chart or the processing of each alarm rule is only for a limited number of data, so the first requirement of PromQL is filtering .

Suppose I have two requirements:

  • Query the load of all machines in Shanghai for 1 minute
  • Query the load of all machines prefixed with host0 for 1 minute
# = 做 zone 的匹配过滤
node_load1{
    
    zone="sh"}
# =~ 做 host 的正则过滤
node_load1{
    
    host=~"host0.*"}

The filter conditions written in curly brackets are mainly for label filtering. In addition to the equal sign and regular matching, operators also include not equal to !=and !~.

The metric name can be written in curly brackets. For example, I want to [view the three indicators of load1, load5, and load15 of the Shanghai machine at the same time]. You can do __name__regular filtering on the metric name:

{
    
    __name__=~"node_load.*", zone="sh"}

The three PromQL items given in the above example are called instant query (Instant Query), and the returned content is called instant vector (Instant Vector).

The instant query returns the latest current value. For example, if the query is initiated at 10:00, the data corresponding to 10:00 will be returned. However, monitoring data is reported periodically, and data is not reported every moment. There may be no data coming in at 10:00, and Prometheus will look forward at 9:59, 9:58, and 9:57 Wait for the time point to report the data.

How long should we look forward at most?

--query.lookback-deltaControlled by the Prometheus startup parameter , the default is 5 minutes. From the monitoring point of view, it is recommended to shorten it and change it to 1 min --query.lookback-delta=1m?

Someone uses Telegraf for HTTP detection and configures an alarm rule: the response_code will be alarmed only after 3 minutes != 200. In fact, there is only one data point with response_code!=200, and the alarm is still reported after 3 minutes.

main reason

  • Telegraf's HTTP detection will put the status code in the label by default, resulting in an unstable structure of the label (this behavior is not good, it is best to discard such labels directly or use categraf, blackbox_exporter as a collector), usually code=200, When there is a problem, code=500. In the Prometheus ecosystem, if the label changes, it is a new time series
  • It is query.lookback-deltarelated to , although there is only one abnormal point, that is, there is only one point in the time series with code=500, but every time the alarm rule executes the query, this abnormal point is found, and it is like this for 5 minutes. Therefore, the condition that the alarm should be issued for 3 minutes in a row in the rule is met. This is why it is recommended to --query.lookback-deltashorten the

In addition to instant query, PromQL also has range query (Range Query), and the returned content is called Range Vector.

{
    
    __name__=~"node_load.*", zone="sh"}[1m]

Range query is to add an additional time range of 1min. Immediate queries return one point per metric, range queries return multiple points. Suppose the data is collected every 10 seconds, and there are 6 points in 1 minute, all of which will be returned.

Prometheus official documentation When introducing each function, it will explain the function parameters, indicating Range Vector or Instant Vector.

Another core value of PromQL

2.2 Calculation

There are arithmetic, comparison, logical, aggregation operators, etc.

arithmetic operator

Commonly used symbols such as addition, subtraction, multiplication and division, and modulus.

# 计算内存可用率,就是内存可用量除以内存总量,又希望按照百分比呈现,所以最后乘以100
mem_available{
    
    app="clickhouse"} / mem_total{
    
    app="clickhouse"} * 100

# 计算北京区网口出向的速率,原始数据的单位是byte,网络流量单位一般用bit,所以乘以8
irate(net_sent_bytes_total{
    
    zone="beijing"}[1m]) * 8

comparison operator

Comparison operators are greater than, less than, equal to, not equal to, etc., which are simple but significant. The logic of alarm rules is supported by comparison operators .

mem_available{
    
    app="clickhouse"} / mem_total{
    
    app="clickhouse"} * 100 < 20

irate(net_sent_bytes_total{
    
    zone="beijing"}[1m]) * 8 / 1024 / 1024 > 700

PromQL with comparison operators is the core of alarm rules, such as the alarm of memory availability, which is configured in Prometheus as follows:

groups:
- name: host
  rules:
  - alert: MemUtil
  	# 指定了查询用的 PromQL
    expr: mem_available{
    
    app="clickhouse"} / mem_total{
    
    app="clickhouse"} * 100 < 20
    # 偶尔一次低于 20% 不是啥大事,只有连续1min每次查询都低于20%才告警,这就是 `for: 1m` 意义
    for: 1m
    labels:
      severity: warn
    annotations:
      summary: Mem available less than 20%, host:{
    
    {
    
     $labels.ident }}

The alarm engine will execute the query periodically according to the user's configuration:

  • If you can’t find it, it means everything is normal, and the memory availability rate of no machine is lower than 20%.
  • If it is found, it means that an alarm is triggered, and as many alarms as are found, several alarms will be triggered

Logical Operators

and, or and unless are used for operations between instant-vectors. And is for intersection, or is for union, unless is for difference.

And usage scenarios, regarding disk usage, some partitions are as large as 16T, and some partitions are as small as 50G. It is unreasonable to only use the disk usage as an alarm. For example, it disk_used_percent{app="clickhouse"} > 70indicates that greater than 70%. For small caps, this strategy is reasonable, but for large caps, 70% utilization means there is still a lot of room, which is unreasonable. At this time, we want to add a limit to this strategy. Only when the hard disk with less than 200G is used exceeds 70% will the alarm be issued, and we can use and:

disk_used_percent{
    
    app="clickhouse"} > 70 and disk_total{
    
    app="clickhouse"}/1024/1024/1024 < 200

vector matching

The operation between vectors wants to find a matching element for each entry of the left vector in the right vector. The matching behavior is divided into: one-to-one, many-to-one, one-to-many. The example of disk usage just introduced is a typical one-to-one type. The indicators on the left and right sides have the same labels except for the indicator name, so it is very easy to find the corresponding relationship. But sometimes, we want to use and to find the intersection, but the vector labels on both sides are different, what should we do?

At this point we can use the keywords on and ignoring to limit the set of tags used for matching.

mysql_slave_status_slave_sql_running == 0
and ON (instance)
mysql_slave_status_master_server_id > 0

What this PromQL wants to express is that if the MySQL instance is a slave (master_server_id>0), check the value of its slave_sql_running. If slave_sql_running==0, it means that the slave sql thread is not running.

However, the labels of the two metrics, mysql_slave_status_slave_sql_running and mysql_slave_status_master_server_id, may not be completely consistent. Fortunately, both of them have an instance tag, and the data of the same instance tag represents multiple index data of an instance from a semantic point of view, then we can use the keyword on to specify that only the instance tag is used for matching, ignoring other labels.

The opposite of on is the keyword ignoring. As the name suggests, ignoring is to ignore certain tags and use the remaining tags for matching. Let's take an example from the Prometheus documentation.

## example series
method_code:http_errors:rate5m{method="get", code="500"}  24
method_code:http_errors:rate5m{method="get", code="404"}  30
method_code:http_errors:rate5m{method="put", code="501"}  3
method_code:http_errors:rate5m{method="post", code="500"} 6
method_code:http_errors:rate5m{method="post", code="404"} 21
method:http_requests:rate5m{method="get"}  600
method:http_requests:rate5m{method="del"}  34
method:http_requests:rate5m{method="post"} 120

## promql
method_code:http_errors:rate5m{code="500"}
/ ignoring(code)
method:http_requests:rate5m

## result
{method="get"}  0.04            //  24 / 600
{method="post"} 0.05            //   6 / 120

The examples are all one-to-one correspondence, which is easy to understand. What is difficult to understand is one-to-many and many-to-one. In this case, the keywords group_left and group_right should be used when doing index calculations . left, right Vectors pointing to the side with higher cardinality. Still use the above two indicators method_code:http_errors:rate5m and method:http_requests:rate5m as an example, you can look at the PromQL and output results using group_left.

## promql
method_code:http_errors:rate5m
/ ignoring(code) group_left
method:http_requests:rate5m

## result
{method="get", code="500"}  0.04            //  24 / 600
{method="get", code="404"}  0.05            //  30 / 600
{method="post", code="500"} 0.05            //   6 / 120
{method="post", code="404"} 0.175           //  21 / 120

For example, for the entry method="get"of , there is only one record in the right vector, but there are two records in the left vector, so the side with high cardinality is the left side, so group_left is used.

Here I give another example to illustrate a common usage of group_left and group_right. For example, we use kube-state-metrics to collect the indicator data of each object in Kubernetes. Among them, there is an indicator for pods called kube_pod_labels. This indicator will put some information about the pod into the label. The indicator value is 1, which is equivalent to a piece of meta information.

kube_pod_labels{
[...]
  label_name="frontdoor",
  label_version="1.0.1",
  label_team="blue"
  namespace="default",
  pod="frontdoor-xxxxxxxxx-xxxxxx",
} = 1

Assume that a Pod is at the access layer and has collected many indicators related to HTTP requests. We want to count the number of 5xx requests and hope to draw a pie chart according to the Pod version. There is a difficulty here: the request indicators of the access layer do not have a version label, and the version information only appears in kube_pod_labels. How to link the two?

sum(
  rate(http_request_count{
    
    code=~"^(?:5..)$"}[5m])) by (pod)
*
on (pod) group_left(label_version) kube_pod_labels

Let's break this PromQL into pieces. The part before the multiplication sign is a typical syntax for counting the number of 5xx per second, and group statistics according to the pod dimension.

Then multiply by kube_pod_labels, this value is 1. Any value multiplied by 1 is the original value, so it has no effect on the overall value, and kube_pod_labels has multiple labels, and it is inconsistent with the label of the result vector of the sum statement, so the on(pod) syntax is used to specify only according to the pod label Create correspondence.

Finally, using group_left(label_version), the label_version is appended to the result vector. The part with high cardinality is obviously the part of sum, so use group_left instead of group_right.

aggregate operation

For multiple series of a single indicator, there is also an aggregation requirement. For example, to view the average memory availability or sorting of 100 machines, select the 10 machines with the smallest value.

This requirement uses PromQL built-in aggregate functions.

# 求取 clickhouse 的机器的平均内存可用率
avg(mem_available_percent{app="clickhouse"})

# 把 clickhouse 的机器的内存可用率排个序,取最小的两条记录
bottomk(2, mem_available_percent{app="clickhouse"})

Group statistics

Count the machine memory availability of clickhouse and canal respectively, and use by to specify the dimension of group statistics (without, which is the opposite of by).

avg(mem_available_percent{app=~"clickhouse|canal"}) by (app)

These aggregation operations can be understood as longitudinal fitting . The memory availability rate of 100 machines has 100 lines on the line graph. If you want to fit these 100 lines into a line, it is equivalent to fitting 100 points at each moment into 1 point. How to turn 100 points into 1 point? Average or maximum or something, so there are these aggregation operators.

horizontal fit

That is, functions <aggregation>_over_timesuch as . These functions receive the range vector, because the range vector has multiple values ​​in a period, and <aggregation>it is to operate on these multiple values.

# [2m]:获取这个指标最近 2 分钟的所有数据点。若15秒采集一个点,2min就是8个点
# max_over_time:对这8个点求最大值,相当于对各个时间序列做横向拟合
max_over_time(target_up[2m])

3 Misunderstood functions

increase function

Literally seeks an increment, receives a range-vector, and the range-vector will obviously return a combination of multiple value+timestamp. The intuitive understanding is that if you directly subtract the first value from the last value in the time range, can you get the increment? no!

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-14wDpCXz-1682918864743)(images/623851/5fe64d3408bba9b26b88f556865b2983.png)]

promql: net_bytes_recv{
    
    interface="eth0"}[1m] @ 1661570908
965304237246 @1661570850
965307953982 @1661570860
965311949925 @1661570870
965315732812 @1661570880
965319998347 @1661570890
965323899880 @1661570900

promql: increase(net_bytes_recv{
    
    interface="eth0"}[1m]) @1661570909
23595160.8

The monitoring data is reported once every 10s, so although the two PromQL query times are different:

  • Once 1661570908
  • One time 1661570909, but the original data content of the query is the same, that is, the data corresponding to several time points from 1661570850 to 1661570900.

Intuitively, calculating increase on the data corresponding to these time points is nothing more than subtracting the first value from the last value, that is, 965323899880-965304237246=19662634. But the actual 23595160.8, the difference is big.

In fact, the time when the increase PromQL request is 1661570909, and the time range is [1m], which is equivalent to telling Prometheus that I want to query the increase value between 1661570909-60 (derived from 1661570909-60) ~ 1661570909. But the original monitoring data does not have the values ​​at the two moments of 1661570849 and 1661570909, so what should I do? Prometheus can only extrapolate based on existing data, that is, use the result of the last point value - the first point value, divide by the time difference, and multiply by 60.

( 965323899880.0 − 965304237246.0 ) d i v ( 1661570900.0 − 1661570850.0 ) t i m e s 60 = 23595160.8 (965323899880.0-965304237246.0)\\div(1661570900.0-1661570850.0)\\times60=23595160.8 (965323899880.0965304237246.0)div(1661570900.01661570850.0)times60=23595160.8
In this way, the increase value of 1 minute is finally obtained, which is a decimal.

rate function

increase is the increment in the calculated time period, and there is data extrapolation

The rate function calculates the rate of change per s, and also has data extrapolation. The increase result is divided by the time period size of the range-vector = rate value.

rate(net_bytes_recv{
    
    interface="eth0"}[1m])
== bool
increase(net_bytes_recv{
    
    interface="eth0"}[1m])/60.0

== is followed by a bool, indicating that a bool value is expected to be returned. If it is True, it will return 1, and if it is False, it will return 0. After observing the results, we found that this expression will always return 1, that is, the two PromQL before and after the equal sign are semantically the same.

The rate of change calculated by the rate function is relatively smooth. Because the last value and the first value in the time range are used for data extrapolation, some glitches will be smoothed out. If we want to get more sensitive data, we can use the irate function. irate is calculated by using the last two values ​​within the time range, and the change will be more drastic. Let's use the indicator of network card inbound traffic as a comparison.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-jjA7oHHy-1682918864746)(images/623851/68ac7b49091b60970b9a8f8a106500c3.png)]

The blue line with more drastic changes is calculated by the irate function, while the purple relatively smooth line is calculated by the rate function, the contrast is still very strong.

4 Summary

PromQL core value:

  • filter

    Depending on the query selector, the query is divided into immediate query and range query

  • calculate

    There are arithmetic, comparison, logical, aggregate operators, and vector matching logic

5 FAQ

Prometheus provides a function called absent, which is used to warn of missing data. It is also widely used, but the pit is quite big. Here I leave you with a question: If I want to warn about missing data in node_load1 of 100 machines, how should I configure it? Is it appropriate to use absent to solve this requirement? Can you give the best use case for absent?

For the node_load1 data missing alarm requirement of 100 machines, it is not suitable to use the absent function because:

  1. The Absent function is used to monitor whether an indicator disappears (that is, does not exist), rather than whether the indicator data is missing. If only a certain node is missing for a period of time, Absent will falsely report that it does not exist.

  2. When multiple nodes are involved, each node may fail to send monitoring data to Prometheus due to various reasons, thus triggering false alarms. Therefore, it is necessary to set an alarm for each node separately, that is, set a query statement to request the data of each node, and at the same time ensure that the query results are normal, and distinguish the alarm conditions of each node.

For the best use scenario of the Absent function, it can filter out some invalid alarms. For example, for some rare events or abnormal data points, we need alarms when these events or data appear, but if there are few, we will Confused by a bunch of "false" positive warnings. At this point, the Absent function comes in handy, which can eliminate these rare data and avoid misjudgment of alarms, improving the reliability and accuracy of alarms.

Guess you like

Origin blog.csdn.net/qq_33589510/article/details/130455490