Prometheus study notes four-Alertmanager configuration overview and routing overview

Like a whale into the sea, like a bird into the forest

In Alertmanager, the way of processing alarms is defined by routing (Route). Routing is a tree matching structure based on label matching. Match the corresponding processing method according to the tag that receives the alarm.

Alertmanager is mainly responsible for the unified processing of alerts generated by Prometheus. Therefore, the configuration of Alertmanager generally includes the following main parts:

Global configuration (global): used to define some global public parameters, such as global SMTP configuration, Slack configuration, etc.;
templates: used to define templates for alert notification, such as HTML templates, email templates, etc.;
alert routing (Route): Determine how the current alarm should be handled according to label matching;
receivers: Receiver is an abstract concept, it can be a mailbox or WeChat, Slack or Webhook, etc. The recipient generally cooperates with the alarm routing Use;
inhibition rules (inhibit_rules): Reasonable setting of inhibition rules can reduce the generation of spam alarms

The complete configuration format is as follows:

global:
  [ resolve_timeout: <duration> | default = 5m ]
  [ smtp_from: <tmpl_string> ] 
  [ smtp_smarthost: <string> ] 
  [ smtp_hello: <string> | default = "localhost" ]
  [ smtp_auth_username: <string> ]
  [ smtp_auth_password: <secret> ]
  [ smtp_auth_identity: <string> ]
  [ smtp_auth_secret: <secret> ]
  [ smtp_require_tls: <bool> | default = true ]
  [ slack_api_url: <secret> ]
  [ victorops_api_key: <secret> ]
  [ victorops_api_url: <string> | default = "https://alert.victorops.com/integrations/generic/20131114/alert/" ]
  [ pagerduty_url: <string> | default = "https://events.pagerduty.com/v2/enqueue" ]
  [ opsgenie_api_key: <secret> ]
  [ opsgenie_api_url: <string> | default = "https://api.opsgenie.com/" ]
  [ hipchat_api_url: <string> | default = "https://api.hipchat.com/" ]
  [ hipchat_auth_token: <secret> ]
  [ wechat_api_url: <string> | default = "https://qyapi.weixin.qq.com/cgi-bin/" ]
  [ wechat_api_secret: <secret> ]
  [ wechat_api_corp_id: <string> ]
  [ http_config: <http_config> ]

templates:
  [ - <filepath> ... ]

route: <route>

receivers:
  - <receiver> ...

inhibit_rules:
  [ - <inhibit_rule> ... ]

Note in the global configuration is resolve_timeout. This parameter defines how long the Alertmanager has not received an alert before marking the alert status as resolved (resolved). The definition of this parameter may affect the receiving time of the alarm recovery notification. Users can define it according to their actual scenarios. The default value is 5 minutes.

Tag-based alert processing routing
In the configuration of Alertmanager, an alert routing tree based on tag matching rules will be defined to determine how Alertmanager needs to process an alert after receiving it:

route: <route>

Among them, route mainly defines the routing matching rules for alarms, and which receiver Alertmanager needs to send the matched alarms to. The simplest route definition is as follows:

route:
  group_by: ['alertname']
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'

As shown above: In the Alertmanager configuration file, we only define one route, which means that all alerts generated by Prometheus will be received through the receiver named web.hook after being sent to Alertmanager. Here web.hook is defined as a webhook address. Of course, in actual scenarios, alarm processing is not such a simple matter. For different levels of alarms, we may have completely different processing methods. Therefore, in the route, we can also define more sub-routes, and these routes pass The processing method of label matching alarm, the complete definition of route is as follows:

[ receiver: <string> ]
[ group_by: '[' <labelname>, ... ']' ]
[ continue: <boolean> | default = false ]

match:
  [ <labelname>: <labelvalue>, ... ]

match_re:
  [ <labelname>: <regex>, ... ]

[ group_wait: <duration> | default = 30s ]
[ group_interval: <duration> | default = 5m ]
[ repeat_interval: <duration> | default = 4h ]

routes:
  [ - <route> ... ]

Route matching
Each alarm will enter the routing tree from the top-level route in the configuration file. It should be noted that the top-level route must match all alarms (that is, there can be no matching settings match and match_re), and each route can define its own acceptance People and matching rules. By default, after the alarm enters the top-level route, it will traverse all the child nodes until the deepest matching route is found, and the alarm will be sent to the receiver defined by the route. But if the value of continue is set to false in the route, then the alarm will stop directly after the first child node is matched. If continue is true, the alarm will continue to match subsequent child nodes. If the current alarm does not match any child nodes, the alarm will be processed based on the receiver configuration of the current routing node.

There are two ways to choose the alarm matching: the
first way is based on string verification, by setting the match rule to determine whether the label labelname exists in the current alarm and its value is equal to labelvalue.
The second method is based on regular expressions and verifies whether the value of the current alarm tag meets the content of the regular expression by setting match_re.

If the alarm has been successfully sent notification, if you want to set the waiting time before sending the alarm notification, you can set it through the repeat_interval parameter.

Alarm grouping
Alertmanager can group alarm notifications and combine multiple alarms into one notification. Here we can use group_by to define grouping rules. Based on the tags contained in the alarms, if the tag names defined in group_by are met, these alarms will be combined into one notification and sent to the receiver.

Sometimes in order to collect and send more relevant information at one time, the waiting time can be set through the group_wait parameter. If the current group receives new alarms within the waiting time, these alarms will be combined into one notification and sent to the receiver.

The group_interval configuration is used to define the time interval for sending alarm notifications between the same Group.

For example, when using Prometheus to monitor multiple clusters and applications and database services deployed in the clusters, and define the following alarm processing routing rules to notify exceptions in the cluster.

route:
  receiver: 'default-receiver'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  group_by: [cluster, alertname]
  routes:
  - receiver: 'database-pager'
    group_wait: 10s
    match_re:
      service: mysql|cassandra
  - receiver: 'frontend-pager'
    group_by: [product, environment]
    match:
      team: frontend

By default, all alarms are sent to the cluster administrator default-receiver. Therefore, in the root route of the Alertmanager configuration file, the alarm information is grouped according to the cluster and the name of the alarm.

If the alarm comes from a database service such as MySQL or Cassandra, you need to send the alarm to the corresponding database administrator (database-pager) at this time. A separate sub-route is defined here. If the alarm contains the service tag and the service is MySQL or Cassandra, an alarm notification will be sent to the database-pager. Since there are no attributes such as group_by defined here, the configuration information of these attributes will be inherited from the upper-level router. The database-pager will receive alert notifications grouped by cluster and alertname.

The source of some alarm rules may come from the definition of the development team, and the creator of these alarms is marked by adding the tag team to these alarms. Under the alert route of the Alertmanager configuration file, define a separate sub-route to handle this type of alert notification. If the alert contains the tag team and the value of team is frontend, Alertmanager will respond to the alert according to the tag product and environment. Grouping. If the application is abnormal at this time, the development can clearly know which application in which environment has a problem, and can quickly locate the problem in the application.

Sample file of email alarm:

$ cd /usr/local/alertmanager
$ vim alert-test.yml
global:
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'xxxxxx' # 邮箱的授权密码
  smtp_require_tls: false
templates:
  - '/alertmanager/template/*.tmpl'
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 10m
  receiver: default-receiver
receivers:
- name: 'default-receiver'
  email_configs:
  - to: '[email protected]'
    html: '{ alert.html }'
    headers: {
    
     Subject: "Prometheus[WARN] Test报警邮件" }

Help document: https://yunlzheng.gitbook.io/prometheus-book/

Guess you like

Origin blog.csdn.net/ZhanBiaoChina/article/details/107047101