Monitor Prometheus PromQL
^
│ . . . . . . . . . . node_load1{
host="host01",zone="bj"}
│ . . . . . . . . . . node_load1{
host="host02",zone="sh"}
│ . . . . . . . . . . node_load1{
host="host11",zone="sh"}
v
<------- 时间 ---------->
Each point is: a sample (sample), the sample consists of three parts
- Metric: labelsets that describe the characteristics of the current sample
- Timestamp (timestamp): a timestamp accurate to milliseconds
- value: the value of the time sample
filter
Filter selection, two requirements:
-- 查询上海所有机器 1 分钟的负载
node_load1{zone="sh"}
-- 查询所有以 host0 为前缀的机器 1 分钟的负载
node_load1{host=~"host0.*"}
Operators for label filtering:
- not equal to
!=
- regular non
!~
Filter on metric name metric
{
__name__=~"node_load.*", zone="sh"}
The 3 PromQLs are instant queries (Instant Query), returning instant vectors (Instant Vector)
- Returns the latest value at the current time
Prometheus controls the maximum time in the current time
- Default: 5 minutes
- It is recommended to shorten it to 1 minute
--query.lookback-delta=1m
--query.lookback-delta
Range Query, returns a Range Vector
- Added an additional time range of 1 minute, multiple points will be returned
- When the data is collected every 10 seconds, and there are 6 points in 1 minute, they will all be returned
{
__name__=~"node_load.*", zone="sh"}[1m]
operator
arithmetic
-- 计算内存可用率: 内存可用量/内存总量,就 * 100 (百分比呈现)
mem_available{app="clickhouse"} / mem_total{app="clickhouse"} * 100
-- 计算北京区网口出向的速率,原始数据的单位: byte,网络流量单位用bit,就 * 8
irate(net_sent_bytes_total{zone="beijing"}[1m]) * 8
Compare
Comparison operators: greater than, less than, equal to, not equal to
- Generally used to configure alarm rules
mem_available{app="clickhouse"} / mem_total{app="clickhouse"} * 100 < 20
irate(net_sent_bytes_total{zone="beijing"}[1m]) * 8 / 1024 / 1024 > 700
expr
: Specifies to query PromQL, when several items are found, several alarms will be triggeredfor: 1m
: The alarm will only be issued after querying for 1 minute
groups:
- name: host
rules:
- alert: MemUtil
expr: mem_available{
app="clickhouse"} / mem_total{
app="clickhouse"} * 100 < 20
for: 1m
labels:
severity: warn
annotations:
summary: Mem available less than 20%, host:{
{
$labels.ident }}
logic
There are 3 logical operators for instant-vector operations
- and : find the intersection
- or : find the union
- unless : difference set
Only hard drives < 200G, and the utilization rate exceeds 70% :
disk_used_percent{app="clickhouse"} > 70
and disk_total{app="clickhouse"}/1024/1024/1024 < 200
vector matching
The MySQL instance is a slave (master_server_id>0), check the value of slave_sql_running
- When
slave_sql_running==0
: slave sql thread is not running
mysql_slave_status_slave_sql_running == 0
and ON (instance)
mysql_slave_status_master_server_id > 0
## example series
method_code:http_errors:rate5m{method="get", code="500"} 24
method_code:http_errors:rate5m{method="get", code="404"} 30
method_code:http_errors:rate5m{method="put", code="501"} 3
method_code:http_errors:rate5m{method="post", code="500"} 6
method_code:http_errors:rate5m{method="post", code="404"} 21
method:http_requests:rate5m{method="get"} 600
method:http_requests:rate5m{method="del"} 34
method:http_requests:rate5m{method="post"} 120
## promql
method_code:http_errors:rate5m{code="500"}
/ ignoring(code)
method:http_requests:rate5m
## result
{method="get"} 0.04 // 24 / 600
{method="post"} 0.05 // 6 / 120
sum(
rate(http_request_count{code=~"^(?:5..)$"}[5m])) by (pod)
*
on (pod) group_left(label_version) kube_pod_labels
polymerization
polymerization:
-- 求 clickhouse 机器的平均内存可用率
avg(mem_available_percent{app="clickhouse"})
-- 对 clickhouse 机器的内存可用率排序,取最小的两条记录
bottomk(2, mem_available_percent{app="clickhouse"})
Statistics of machine memory availability of clickhouse and canal respectively
-- by:分组统计的维度 (相反: without)
avg(mem_available_percent{app=~"clickhouse|canal"}) by (app)
Aggregate over a range of time periods:
target_up[2m]
: Obtain all data points of the indicator in the last 2 minutes
max_over_time
: find the maximum value of all points within the time period
max_over_time(target_up[2m])
increase
increase function: Find the increment and receive the range-vector
- range-vector will return a combination of multiple value+timestamp
net_bytes_recv{interface="eth0"}[1m] @ 1661570908
965304237246 @1661570850
965307953982 @1661570860
965311949925 @1661570870
965315732812 @1661570880
965319998347 @1661570890
965323899880 @1661570900
increase(net_bytes_recv{interface="eth0"}[1m]) @1661570909
23595160.8
Calculation formula: last point value - first point value / time difference * 60
(965323899880.0−965304237246.0)÷(1661570900.0−1661570850.0)×60=23595160.8
rate
rate function: Find the rate of change per second
- increase result / range-vector time period size
rate(net_bytes_recv{interface="eth0"}[1m])
== bool increase(net_bytes_recv{interface="eth0"}[1m])/60.0
- rate function: find the rate of change, relatively smooth
- irate function: take the last two values in the time range for calculation, and the change is more drastic