Prometheus study notes three-mail alarm rule configuration

Like a whale into the sea, like a bird into the forest

1.
Prometheus email alarm Prometheus's architecture is divided into two parts. The alarm rules and alarms are defined in Prometheus Server. The Alertmanager component is used to process these alarms generated by Prometheus. Alertmanager is the unified processing center for alerts in the Prometheus system. Alertmanager provides a variety of built-in third-party alert notification methods, as well as support for Webhook notifications. Through Webhook users can complete more personalized extensions to alerts.

Introduction to Prometheus Alarm The
alarm capability is divided into two independent parts in the Prometheus architecture. As shown below, by defining AlertRule in Prometheus, Prometheus will periodically calculate the alert rule, and if the alert trigger condition is met, it will send alert information to Alertmanager.
Insert picture description here
An alert rule in Prometheus mainly consists of the following parts:

1. Alarm name: The user needs to name the alarm rule. Of course, for naming, it needs to be able to directly express the main content of the alarm.
2. Alarm rule: The alarm rule is actually mainly defined by PromQL, and its actual meaning is as an expression (PromQL) How long will the query result last (During) before the alarm will be triggered

In Prometheus, a group of related alarms can also be defined uniformly through Group (alarm group). Of course, these definitions are managed uniformly through YAML files.

Alertmanager, as an independent component, is responsible for receiving and processing alarm information from Prometheus Server (or other client programs). Alertmanager can further process these alarms. For example, when a large number of duplicate alarms are received, it can eliminate duplicate alarms, and at the same time group the alarms and route them to the correct notification party. Prometheus has built-in mail, Slack, and more. The notification method is supported, and the integration with Webhook is also supported to support more customized scenarios. For example, currently Alertmanager does not support Dingding, so users can integrate with Dingding robot through Webhook to receive alert information through Dingding. At the same time, AlertManager also provides silent and alarm suppression mechanisms to optimize the alert notification behavior.

Features of
Alertmanager In addition to providing basic alert notification capabilities, Alertmanager also mainly provides alert features such as grouping, suppression, and silence:
Insert picture description here
grouping
** The grouping mechanism can combine detailed alarm information into one notification. **In some cases, for example, due to system downtime, a large number of alarms are triggered at the same time. In this case, the grouping mechanism can combine these triggered alarms into one alarm notification to avoid receiving a large number of alarm notifications at once. It cannot quickly locate the problem.
For example, when there are hundreds of running service instances in the cluster, and alarm rules are set for each instance. If a network failure occurs at this time, a large number of service instances may not be able to connect to the database. As a result, hundreds of alarms will be sent to Alertmanager.
As a user, you may only want to be able to see which service instances are affected in one notification. At this time, the alarms can be grouped according to the service cluster or alarm name, and these alarms can be grouped together to form a notification.
Alarm grouping, alarm time, and alarm receiving method can be configured through the configuration file of Alertmanager.
Suppression
Suppression refers to a mechanism that can stop sending other alarms caused by this alarm repeatedly after a certain alarm is sent.
For example, an alarm is triggered when the cluster is inaccessible, and all other alarms related to the cluster can be ignored by configuring Alertmanager. This can avoid receiving a large number of alarm notifications that are not related to the actual problem.
The suppression mechanism is also set through the configuration file of Alertmanager.
Silent
Silence provides a simple mechanism to quickly silence alarms based on tags. If the received alarm matches the silent configuration, Alertmanager will not send an alarm notification.
The silent setting needs to be set on the Werb page of Alertmanager.

2. Customize Prometheus alarm rules The alarm rules in
Prometheus allow you to define alarm trigger conditions based on PromQL expressions. The Prometheus backend performs periodic calculations on these trigger rules, and when the trigger conditions are met, an alarm notification will be triggered. By default, users can view these alarm rules and alarm triggering status through the Prometheus web interface. After Promthues is associated with Alertmanager, the alerts can be sent to external services such as Alertmanager and these alerts can be further processed through Alertmanager.

Define alarm rules
A typical alarm rule is as follows:

groups:
- name: example
  rules:
  - alert: HighErrorRate
    expr: job:request_latency_seconds:mean5m{
    
    job="myjob"} > 0.5
    for: 10m
    labels:
      severity: page
    annotations:
      summary: High request latency
      description: description info

In the alarm rule file, we can define a group of related rule settings under a group. In each group, we can define multiple alarm rules (rules). An alarm rule mainly consists of the following parts:

alert: The name of the alert rule. expr: Based on the PromQL expression alarm trigger condition, used to calculate whether there is a time series that meets the condition.
for: Evaluation waiting time, optional parameter. It is used to indicate that an alarm will be sent only when the trigger condition lasts for a period of time. The state of the newly generated alarm during the waiting period is pending.
labels: Custom labels, allowing users to specify a set of additional labels to be attached to the alarm.
Annotations: used to specify a set of additional information, such as text used to describe the detailed information of the alert, etc. The content of annotations will be sent to Alertmanager as a parameter when the alert is generated.

In order for Prometheus to enable the defined alarm rules, we need to specify the access path of a set of alarm rule files through rule_files in the Prometheus global configuration file. After Prometheus is started, it will automatically scan the content defined in the rule files under these paths and follow these rules. Calculate whether to send notifications to the outside:

rule_files:
  [ - <filepath_glob> ... ]

By default, Prometheus will calculate these alarm rules every minute. If users want to define their own alarm calculation period, they can override the default calculation period through evaluation_interval:

global:
  [ evaluation_interval: <duration> | default = 1m ]

Templated
In the annotations of the alarm rule file, use summary to describe the summary information of the alarm, and description is used to describe the detailed information of the alarm. At the same time, the UI of Alertmanager will also display alert information based on these two tag values. In order to make the alarm information more readable, Prometheus supports templated label and annotations in the label value.

The value of the specified label in the current alarm instance can be accessed through the $labels. variable. $value can get the sample value calculated by the current PromQL expression.

# To insert a firing element's label values:
{
    
    {
    
     $labels.<labelname> }}
# To insert the numeric expression value of the firing element:
{
    
    {
    
     $value }}

For example, the readability of summary and description content can be optimized through templating:

groups:
- name: example
  rules:

  # Alert for any instance that is unreachable for >5 minutes.
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: page
    annotations:
      summary: "Instance {
    
    { $labels.instance }} down"
      description: "{
    
    { $labels.instance }} of job {
    
    { $labels.job }} has been down for more than 5 minutes."

  # Alert for any instance that has a median request latency >1s.
  - alert: APIHighRequestLatency
    expr: api_http_request_latencies_second{
    
    quantile="0.5"} > 1
    for: 10m
    annotations:
      summary: "High request latency on {
    
    { $labels.instance }}"
      description: "{
    
    { $labels.instance }} has a median request latency above 1s (current value: {
    
    { $value }}s)"

Viewing the alarm status
As shown below, users can view all the alarm rules under the current Prometheus and their current active status through the Alerts menu in the Prometheus WEB interface.
Insert picture description here
At the same time, for pending or firing alarms, Prometheus will also store them in the time series ALERTS{}.
You can query alarm instances through expressions:

ALERTS{
    
    alertname="<alert name>", alertstate="pending|firing", <additional alert labels>}

A sample value of 1 indicates that the current alarm is active (pending or firing). When the alarm transitions from an active state to an inactive state, the sample value is 0.

Example: Define host monitoring alarm
Modify the Prometheus configuration file prometheus.yml, add the following configuration:

rule_files:
  - /usr/local/prometheus/*.rules

Create an alert file alert.rules in the directory /usr/local/prometheus/ with the following content:

groups:
- name: hostStatsAlert
  rules:
  - alert: hostCpuUsageAlert
    expr: sum(avg without (cpu)(irate(node_cpu_seconds_total{
    
    mode!='idle'}[1m]))) by (instance) > 0.5
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {
    
    { $labels.instance }} CPU usgae high"
      description: "{
    
    { $labels.instance }} CPU usage above 50% (current value: {
    
    { $value }})"
  - alert: hostMemUsageAlert
    expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.7
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {
    
    { $labels.instance }} MEM usgae high"
      description: "{
    
    { $labels.instance }} MEM usage above 50% (current value: {
    
    { $value }})"

After restarting Prometheus, visit the Prometheus UI http://127.0.0.1:9090/rules to view the currently loaded rule file. Here I lowered the threshold, but don’t care if you forget to modify the configuration file.
Insert picture description here
Switch to the Alerts tab http://127.0.0.1:9090/alerts to view the active status of the current alert.
Insert picture description here

At this point, we can manually increase the system's CPU usage, verify the Prometheus alarm process, and run the following command on the host:

cat /dev/zero>/dev/null

After running the command, check the CPU usage, as shown in the following figure: After
Insert picture description here
Prometheus detects that the trigger condition is met for the first time, hostCpuUsageAlert shows that an alarm is active. Since the waiting time of 1m is set in the alarm rule, the current alarm state is PENDING. If the alarm condition continues to be met after 1 minute, the alarm will actually be triggered and the alarm state is FIRING, as shown in the following figure:
Insert picture description here

3. Deploy AlertManager
Alertmanager and Prometheus Server are implemented in Golang, and there is no third-party dependency. Generally speaking, we can deploy Alertmanager in the following ways: binary package, container and source installation.

AlertManager commonly used command parameters:

      -h,--help显示上下文相关的帮助(也可以尝试--help-long和--help-man)。
      --config.file =“ alertmanager.yml”
                                 Alertmanager配置文件名。
      --storage.path =“ data /”数据存储的基本路径。
      --data.retention = 120h保留数据的时间。
      --alerts.gc-interval = 30m警报GC之间的间隔。
      --web.external-url = WEB.EXTERNAL-URL
                                 从外部可访问Alertmanager的URL(例如,如果Alertmanager是通过反向提供的
                                 代理)。用于生成返回到Alertmanager本身的相对和绝对链接。如果该网址包含路径部分,
                                 它将用于为Alertmanager服务的所有HTTP端点添加前缀。如果省略,则相关的URL组件将为
                                 自动派生。
      --web.route-prefix = WEB.ROUTE-PREFIX
                                 Web端点内部路由的前缀。默认为--web.external-url的路径。
      --web.listen-address =“:9093”
                                 侦听Web界面和API的地址。
      --web.get-concurrency = 0并发处理的最大GET请求数。如果为负数或零,则限制为GOMAXPROC或8,以两者中的最大值为准
                                 更大。
      --web.timeout = 0 HTTP请求超时。如果为负或零,则不设置超时。
      --cluster.listen-address =“ 0.0.0.0:9094”
                                 集群的监听地址。设置为空字符串以禁用高可用性模式。
      --cluster.advertise-address = CLUSTER.ADVERTISE-ADDRESS
                                 在群集中进行广播的显式地址。
      --cluster.peer = CLUSTER.PEER ...
                                 最初的同伴(可以重复)。
      --cluster.peer-timeout = 15秒
                                 等待对等方之间发送通知的时间。
      --cluster.gossip-interval = 200ms
                                 发送八卦消息之间的间隔。通过降低此值(更频繁),可以传播八卦消息
                                 以增加带宽为代价更快地跨集群运行。
      --cluster.pushpull-interval = 1m0s
                                 八卦状态同步的间隔。将此间隔设置得越低(越频繁),将加快整个过程的收敛速度
                                 较大的群集,但以增加带宽使用为代价。
      --cluster.tcp-timeout = 10s与远程节点建立流连接以进行完整状态同步以及流读写的超时
                                 操作。
      --cluster.probe-timeout = 500ms
                                 在假定它不正常之前,等待被探测节点发出的确认超时。应将其设置为99%
                                 网络上的RTT(往返时间)。
      --cluster.probe-interval = 1s
                                 随机节点探测之间的间隔。将此值设置得较低(更频繁)将导致群集检测到故障
                                 节点更快地增加带宽使用率。
      --cluster.settle-timeout = 1m0s
                                 在评估通知之前,等待集群连接建立的最长时间。
      --cluster.reconnect-interval = 10秒
                                 尝试重新连接到丢失的对等方之间的间隔。
      --cluster.reconnect-timeout = 6h0m0s
                                 尝试重新连接到丢失的对等设备的时间长度。
      --log.level = info仅记录具有给定严重性或更高严重性的消息。下列之一:[调试,信息,警告,错误]
      --log.format = logfmt日志消息的输出格式。下列之一:[logfmt,json]
      --version显示应用程序版本。
tar xf alertmanager-0.21.0.linux-amd64.tar.gz -C /usr/local/ && cd /usr/local/
ln -s /usr/local/alertmanager-0.21.0.linux-amd64/ alertmanager
mkdir -p /usr/local/alertmanager/data/
vim /etc/systemd/system/alertmanager.service
添加如下内容:
[Unit]
Description=alertmanager
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/alertmanager-0.21.0.linux-amd64/alertmanager --config.file=/usr/local/alertmanager/alert-test.yml --storage.path=/usr/local/alertmanager/data/
Restart=on-failure

[Install]
WantedBy=multi-user.target

–Config.file is used to specify the path of the alertmanager configuration file
–storage.path is used to specify the data storage path.

Create an alertmanager configuration file
After being decompressed, Alertmanager will contain a default alertmanager.yml configuration file with the following content:

cd /usr/local/alertmanager
cp alertmanager.yml alert-test.yml
vim alert-test.yml
内容如下:
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

The configuration of Alertmanager mainly consists of two parts: route and receivers. All alarm information will enter the routing tree from the top-level route in the configuration, and the alarm information will be sent to the corresponding receiver according to the routing rules.

A set of receivers can be defined in Alertmanager. For example, multiple receivers can be divided according to roles (such as system operation and maintenance, database administrator). The receiver can associate mail, Slack and other ways to receive alarm information.
A default receiver default-receiver is defined in the current configuration file. Since there is no receiving method here, it is currently only equivalent to a placeholder.

The top-level route is defined using route in the configuration file. The route is a tree structure based on label matching rules. All alarm information starts from the top-level route, enters different sub-routes according to label matching rules, and sends alarms according to the receiver set by the sub-route. Currently, only one top-level route is set in the configuration file and the defined receiver is default-receiver. Therefore, all alarms will be sent to the default-receiver.

Alermanager will save the data locally, the default storage path is data/. Therefore, you need to create the corresponding directory before starting Alertmanager

Start Alertmanager

chown -R prometheus:prometheus /usr/local/alertmanager-0.21.0.linux-amd64/
systemctl status alertmanager.service

After Alertmanager is started, it can be accessed through port 9093, http://192.168.0.10:9093
default listening port: 9093 cluster listening port 9094
Insert picture description here
The alert content received by Alertmanager can be viewed under the Alert menu. Under the Silences menu, you can create silent rules through the UI.

The association between Prometheus and Alertmanager
is divided into two separate parts in the Prometheus architecture. Prometheus is responsible for generating alarms, and Alertmanager is responsible for follow-up processing after alarms are generated. Therefore, after the deployment of Alertmanager is completed, you need to set up Alertmanager related information in Prometheus.

Edit the Prometheus configuration file prometheus.yml and add the following content:

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

Restart the Prometheus service. After success, you can check whether the alerting configuration takes effect from http://192.168.0.10:9090/config.

At this point, try again to manually increase the system CPU usage:

cat /dev/zero>/dev/null

Wait for the Prometheus alarm to be triggered:
Insert picture description here
check the Alertmanager UI and you can see the alarm information received by the Alertmanager.
Insert picture description here

Guess you like

Origin blog.csdn.net/ZhanBiaoChina/article/details/107026815
Recommended