Alertmanager official document translation

The original text is slightly abridged

Alarm overview

Prometheus alarms are divided into two parts. The alert rules in the Prometheus server send alerts to Alertmanager, which processes these alerts, including silence, suppression, aggregation, and notifications via email, online notification systems, and instant messaging tools.

The main steps for setting alarms and notifications are as follows:

  • Set up and configure Alertmanager
  • Configure Prometheus to communicate with Alertmanager
  • Create alert rules in Prometheus

ALERTMANAGER

Alertmanager handles alerts sent by client programs such as Prometheus server. It is responsible for deduplication, grouping and routing of alarms to the correct recipients such as emails, web pages, etc. It is also responsible for silence and suppression of alarms.

The core concepts of Alertmanager are introduced below. Consult the configuration file to learn more about its usage.

Alarm grouping (Grouping)

Grouping groups alarms of similar nature into one notification. This function is especially effective when a large-scale power outage causes many systems to hang up at the same time, causing massive alarms.

**Example:** Your cluster is running hundreds of instances of a service. When a network partition occurs, half of the service instances cannot connect to the database. Prometheus' alarm rules are set to send an alarm for each service instance. As a result, hundreds of alerts were sent to Alertmanager.

As a user, you want to receive only one message, which contains the specific service instance affected. In this way, you can configure Alertmanager to group alerts by cluster name and alert name, and it can send only one notification.

Alarms are grouped, and the notification time of each alarm group and the notification receiver of each alarm group are configured in the routing tree of the configuration file.

Alarm suppression

Alarm suppression means that if certain alarms occur, specific alarms are suppressed.
**Example:** An alarm informing that the entire cluster is disconnected is triggered. You can configure Alertmanager to not report all alarms about the cluster. This prevents the triggering of hundreds of alarm notifications that are not related to the actual problem.

Alarm suppression is configured in the configuration file of Alertmanager.

Warning silence

Silencing the alarm is to stop the alarm for a period of time. The configuration of alarm silence is based on a matcher, similar to a routing tree. Check whether the incoming alarm matches the regular expression set for alarm silence. If it matches, no alert notification is sent.

Alarm silence is configured on the web interface of Alertmanager.

Client behavior

Alertmanager has special requirements for client behavior. These are only related to advanced use cases that do not use Prometheus to send alerts.

High availability

Alertmanager supports the configuration of highly available clusters. It can be configured by using the --cluster-* parameters.

Do not load balance between Prometheus and Alertmanager, but specify a list of all Alertmanager in Prometheus.

Configuration

Alertmanager is configured through command line parameters and configuration files. Command line parameters configure unchanged system parameters, and configuration files define suppression rules, notification routes, and notification recipients.

The visual editor can assist in building the routing tree.

Run alertmanager -hthe browser command line parameters available.

Alertmanager can reload configuration files at runtime. If the new configuration file format is incorrect, the changes will not be applied and the error will be logged. Process by sending SIGHUPor transmits an HTTP POST request to /-/reloadthe endpoint.

Configuration file

By --config.filespecified load profile parameters.

./alertmanager --config.file=alertmanager.yml

The file is in YAML format and is defined by the following format. The brackets represent optional parameters. Parameters not listed are set to default values.

The general placeholders are defined as follows:

  • <duration>: 时长,匹配的正则表达式 [0-9]+(ms|[smhdwy])
  • <labelname>: 匹配 [a-zA-Z_][a-zA-Z0-9_]* 正则表达式的字符串
  • <labelvalue>: unicode 字符组成的字符串
  • <filepath>: 当前工作目录中的有效路径
  • <boolean>: 布尔值, true 或者 false
  • <string>: 标准字符串
  • <secret>: 一个密文的标准字符串,例如密码
  • <tmpl_string>: 模板格式化的字符串
  • <tmpl_secret>: 模板格式化的密文字符串

Other placeholders are described separately.

The global configuration sets the parameters that take effect in the entire configuration. They are also used as default values ​​for other configuration sections.

global:
  # 配置邮件接收者
  # The default SMTP From header field.
  [ smtp_from: <tmpl_string> ]
  # The default SMTP smarthost used for sending emails, including port number.
  # Port number usually is 25, or 587 for SMTP over TLS (sometimes referred to as STARTTLS).
  # Example: smtp.example.org:587
  [ smtp_smarthost: <string> ]
  # The default hostname to identify to the SMTP server.
  [ smtp_hello: <string> | default = "localhost" ]
  # SMTP Auth using CRAM-MD5, LOGIN and PLAIN. If empty, Alertmanager doesn't authenticate to the SMTP server.
  [ smtp_auth_username: <string> ]
  # SMTP Auth using LOGIN and PLAIN.
  [ smtp_auth_password: <secret> ]
  # SMTP Auth using PLAIN.
  [ smtp_auth_identity: <string> ]
  # SMTP Auth using CRAM-MD5. 
  [ smtp_auth_secret: <secret> ]
  # The default SMTP TLS requirement. 
  # Note that Go does not support unencrypted connections to remote SMTP endpoints.
  [ smtp_require_tls: <bool> | default = true ]

  # 配置微信等 api 接收者
  # The API URL to use for Slack notifications.
  [ slack_api_url: <secret> ]
  [ victorops_api_key: <secret> ]
  [ victorops_api_url: <string> | default = "https://alert.victorops.com/integrations/generic/20131114/alert/" ]
  [ pagerduty_url: <string> | default = "https://events.pagerduty.com/v2/enqueue" ]
  [ opsgenie_api_key: <secret> ]
  [ opsgenie_api_url: <string> | default = "https://api.opsgenie.com/" ]
  [ hipchat_api_url: <string> | default = "https://api.hipchat.com/" ]
  [ hipchat_auth_token: <secret> ]
  [ wechat_api_url: <string> | default = "https://qyapi.weixin.qq.com/cgi-bin/" ]
  [ wechat_api_secret: <secret> ]
  [ wechat_api_corp_id: <string> ]
 
  # 配置网页接收者
  # The default HTTP client configuration
  [ http_config: <http_config> ]

  # 如果告警没有设置结束时间,ResolveTimeout 是 alertmanager 的默认值, 
  # 过了这段时间,如果告警没有更新,就被声明未恢复了。
  # 这和 Prometheus 的告警不冲突,因为它们都设置了结束时间。
  [ resolve_timeout: <duration> | default = 5m ]

# Files from which custom notification template definitions are read.
# The last component may use a wildcard matcher, e.g. 'templates/*.tmpl'.
templates:
  [ - <filepath> ... ]

# 路由树的根节点。
route: <route>

# 接收者列表。
receivers:
  - <receiver> ...

# 抑制规则列表。
inhibit_rules:
  [ - <inhibit_rule> ... ]

A routing block defines a node and its child nodes in the routing tree. If its optional configuration is not set, it will inherit the configuration of its parent node by default.

Each alarm enters the routing tree from the configured top-level route, and the top-level route matches all alarms. Then traverse the child nodes. If the continuesetting is not false, it is at the end of the first node that matches, if a matching node continueis ture, the alarm will continue to match subsequent sibling. If an alarm does not match any child node of a node, the alarm is processed based on the configuration parameters of the current node.

[ receiver: <string> ]
# 对告警进行归组的标签。例如 cluster=A 和 alertname=LatencyHigh
# 的多个告警被归为一组。
# 要按所有标签聚合,使用 “...”作为标签,例如:
# group_by: ['...'] 
# 这样就禁用了归组功能。
[ group_by: '[' <labelname>, ... ']' ]

# 告警是否继续匹配兄弟结点。
[ continue: <boolean> | default = false ]

# 一个告警要匹配结点所要匹配的相等匹配器集合。
match:
  [ <labelname>: <labelvalue>, ... ]

# 一个告警要匹配结点所要匹配的正则匹配器集合。
match_re:
  [ <labelname>: <regex>, ... ]

# 一组告警在发送通知前等待的时间,这段时间用于等待抑制告警到达
# 或者收集更多的同组告警。(通常是0s~数分钟)
[ group_wait: <duration> | default = 30s ]

# 发送一次告警后,等待多久再发送同组新增的告警。(通常 5m 或更久)
[ group_interval: <duration> | default = 5m ]

# 一个告警发送后等待多久再次发送。(通常 3h 或更久)
[ repeat_interval: <duration> | default = 4h ]

# 0 或多个子路由。
routes:
  [ - <route> ... ]

Example

# 包含全部参数的根路由,如果子路由没设置相应参数就继承自此。
route:
  receiver: 'default-receiver'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  group_by: [cluster, alertname]
  # 不匹配下面子路由的所有告警将终止在根节点并发送到 'default-receiver'。
  routes:
  # 所有 service=mysql 或 service=cassandra 的告警
  # 发送到 database-pager。
  - receiver: 'database-pager'
    group_wait: 10s
    match_re:
      service: mysql|cassandra
  # 具有 team=frontend 标签的告警匹配这个子路由。
  # 它们根据 product 和 environment 归组而不根据 cluster
  # 和 alertname。
  - receiver: 'frontend-pager'
    group_by: [product, environment]
    match:
      team: frontend

<inhibit_rule>

Alarm suppression rules are set to stop sending target alarms matching certain matchers after a source alarm that matches a group of matchers is sent. Alarm Alarm source and destination must have the equalsame tag value listed in the list.

Semantically, the missing label and the null label have the same meaning. Therefore, if the source and destination warning alarm did equalall the labels listed in the list, then the alarm suppression rules will be applied.

In order to prevent an alarm from suppressing itself, the source and target alarms that match a rule cannot be suppressed by any alarm (including itself). Nevertheless, it is recommended to ensure that there is no possibility of simultaneous matching when setting the source and target matchers.

# 告警被抑制所必须满足的匹配器。
target_match:
  [ <labelname>: <labelvalue>, ... ]
target_match_re:
  [ <labelname>: <regex>, ... ]

# 存在满足如下匹配器的一个或多个告警的话,
# 告警抑制就生效。
source_match:
  [ <labelname>: <labelvalue>, ... ]
source_match_re:
  [ <labelname>: <regex>, ... ]

# 源和目标告警的如下标签值中至少有一个相等
# 告警抑制才能生效。
[ equal: '[' <labelname>, ... ']' ]

<http_config>

http_config Configure the HTTP client for the receiver to communicate with the HTTP API-based service.

# 注意 `basic_auth`, `bearer_token` 和 `bearer_token_file` 选项时互斥的。

# Sets the `Authorization` header with the configured username and password.
# password and password_file are mutually exclusive.
basic_auth:
  [ username: <string> ]
  [ password: <secret> ]
  [ password_file: <string> ]

# 用 bare token 配置 `Authorization` 请求头。
# password 和 password_file 是互斥的。
[ bearer_token: <secret> ]

# 从配置文件读取 bare token 配置 `Authorization`。请求头 
[ bearer_token_file: <filepath> ]

# 配置 TLS 设置。
tls_config:
  [ <tls_config> ]

# 可选的代理 URL。
[ proxy_url: <string> ]

<tls_config>

Configure TLS connection

# CA certificate to validate the server certificate with.
[ ca_file: <filepath> ]

# Certificate and key files for client cert authentication to the server.
[ cert_file: <filepath> ]
[ key_file: <filepath> ]

# ServerName extension to indicate the name of the server.
# http://tools.ietf.org/html/rfc4366#section-3.1
[ server_name: <string> ]

# Disable validation of the server certificate.
[ insecure_skip_verify: <boolean> | default = false]

The receiver is a named configuration of one or more notification integrations.
We are not actively adding new receivers, and we recommend implementing custom notification integration through webhook receivers.

# 全局唯一的接收器名称。
name: <string>

# 配置各种接收集成。
email_configs:
  [ - <email_config>, ... ]
hipchat_configs:
  [ - <hipchat_config>, ... ]
pagerduty_configs:
  [ - <pagerduty_config>, ... ]
pushover_configs:
  [ - <pushover_config>, ... ]
slack_configs:
  [ - <slack_config>, ... ]
opsgenie_configs:
  [ - <opsgenie_config>, ... ]
webhook_configs:
  [ - <webhook_config>, ... ]
victorops_configs:
  [ - <victorops_config>, ... ]
wechat_configs:
  [ - <wechat_config>, ... ]

<email_config>

# 是否通知故障恢复。
[ send_resolved: <boolean> | default = false ]

# 通知发送的邮件地址。
to: <tmpl_string>

# 发送者地址。
[ from: <tmpl_string> | default = global.smtp_from ]

# 邮件发送的 SMTP host。 
[ smarthost: <string> | default = global.smtp_smarthost ]

# 鉴定 SMTP 服务器的主机名。
[ hello: <string> | default = global.smtp_hello ]

# SMTP 认证信息。
[ auth_username: <string> | default = global.smtp_auth_username ]
[ auth_password: <secret> | default = global.smtp_auth_password ]
[ auth_secret: <secret> | default = global.smtp_auth_secret ]
[ auth_identity: <string> | default = global.smtp_auth_identity ]

# SMTP TLS 要求。
# 注意 Go 不支持与远程SMTP端点的未加密连接。
[ require_tls: <bool> | default = global.smtp_require_tls ]

# TLS 配置。
tls_config:
  [ <tls_config> ]

# 邮件通知的 HTML body。
[ html: <tmpl_string> | default = '{
   
   { template "email.default.html" . }}' ]
# The text body of the email notification.
[ text: <tmpl_string> ]

# 额外的 email 头 键值对。
[ headers: { <string>: <tmpl_string>, ... } ]

<webhook_config>

Configure universal receiver

# 是否通知故障恢复。
[ send_resolved: <boolean> | default = true ]

# 发送 HTTP POST 请求的目的端点url。
url: <string>

# HTTP 客户端的配置。
[ http_config: <http_config> | default = global.http_config ]

Alertmanager sends an HTTP POST request to the configured endpoint in JSON format:

{
    
    
  "version": "4",
  "groupKey": <string>,    // key identifying the group of alerts (e.g. to deduplicate)
  "status": "<resolved|firing>",
  "receiver": <string>,
  "groupLabels": <object>,
  "commonLabels": <object>,
  "commonAnnotations": <object>,
  "externalURL": <string>,  // backlink to the Alertmanager.
  "alerts": [
    {
    
    
      "status": "<resolved|firing>",
      "labels": <object>,
      "annotations": <object>,
      "startsAt": "<rfc3339>",
      "endsAt": "<rfc3339>",
      "generatorURL": <string> // identifies the entity that caused the alert
    },
    ...
  ]
}

<wechat_config>

Send notifications via WeChat API.

# 是否通知故障恢复。
[ send_resolved: <boolean> | default = false ]

# 调用微信 API 用的 key。
[ api_secret: <secret> | default = global.wechat_api_secret ]

# 微信 API URL.
[ api_url: <string> | default = global.wechat_api_url ]

# 用于认证的 corp id。
[ corp_id: <string> | default = global.wechat_api_corp_id ]

# 微信 API 定义的 API 请求数据。
[ message: <tmpl_string> | default = '{
   
   { template "wechat.default.message" . }}' ]
[ agent_id: <string> | default = '{
   
   { template "wechat.default.agent_id" . }}' ]
[ to_user: <string> | default = '{
   
   { template "wechat.default.to_user" . }}' ]
[ to_party: <string> | default = '{
   
   { template "wechat.default.to_party" . }}' ]
[ to_tag: <string> | default = '{
   
   { template "wechat.default.to_tag" . }}' ]

Send alert

Disclaimer: Prometheus automatically processes and sends alarms generated by its configured alarm rules. It is strongly recommended to configure alarm rules in Prometheus based on time series data instead of implementing the client.

Alertmanager has two APIs, V1 and V2, both of which monitor alerts. The warning format of V1 is described in the following code snippet. V2 is designated as the OpenAPI specification, which can be found in the Alertmanager code base. As long as the client is alive, the alarm can be repeated continuously (usually at an interval of 30 seconds to 3 minutes). The client can push an alarm list through a POST request.

The label of each alarm is used to identify the same instance of the alarm to achieve deduplication. Annotations are always set to the most recently received annotations and cannot identify alarms.

startsAtAnd endsAttime stamps are optional. If startsAtomitted, it is automatically set to the current time. Only set when the end time of the alarm is known endsAt. Otherwise, it will be set to the length of time since the last alert was received.

generatorURL The field is the only back link used to identify the source of this alarm in the client.

[
  {
    "labels": {
      "alertname": "<requiredAlertName>",
      "<labelname>": "<labelvalue>",
      ...
    },
    "annotations": {
      "<labelname>": "<labelvalue>",
    },
    "startsAt": "<rfc3339>",
    "endsAt": "<rfc3339>",
    "generatorURL": "<generator_url>"
  },
  ...
]

Notification template reference

Prometheus creates and sends alerts to Alertmanager, which sends notifications to different receivers based on the tags of the alerts. The receiver can be one of many integrations, including: Slack, PagerDuty, email, or customized integration through a common webhook interface.

The notification is sent to the receiver through the template. Alertmanager comes with default receivers, of course they can also be customized. To avoid confusion, the Alertmanager template and the template in Prometheus are different, although the Prometheus template also contains the templates in the labels/annotations of the alert rule.

Alertmanager's notification templates are based on the Go language template system. Please note that some fields are defined as text, while other fields are defined as HTML, which will affect escaping.

data structure

data

Data It is the data structure passed to the notification template and Webhook push.

name Types of Paraphrase
Receiver string The name of the receiver to send the notification (slack, email, etc.).
Status string If at least one alarm is being sent, it is firing, otherwise it is resolved.
Alerts Alert List all alarm objects in the group (see below).
GroupLabels KV The group label to which these alarms are grouped.
CommonLabels KV A label common to all alarms.
CommonAnnotations KV A collection of annotations common to all alarms. A string used to get more information about the alert.
ExternalURL string Back link to send notification.

Alerts Type of function to expose filtering alarms:

  • Alerts.Firing Return the list of alarm objects currently sent by the group
  • Alerts.Resolved Return the list of recovered alarm objects in this group

Alert

Alert Hold the notification template for the alarm.

name Types of Paraphrase
Status string Whether the current state of the alarm is sent or recovered.
Labels KV The label collection attached to the alarm.
Annotations KV A collection of annotations attached to the alarm.
StartsAt time.Time The time when the alert was triggered. If omitted, Alertmanager will set it to the current time.
EndsAt time.Time Only set when the end time of the alarm is known endsAt. Otherwise, it will be set to the length of time since the last alert was received.
GeneratorURL string Identify the back link to the source of the alarm.

Key-value pairs (KV)

KV It is a set of key-value pairs used to identify label and annotation.

type KV map[string]string

The Annotation sample contains two Annotations:

{
  summary: "alert summary",
  description: "alert description",
}

In addition to directly accessing the data (labels and comments) stored as KV, there are some methods for sorting, deleting and viewing the label set:

KV methods

name parameter return value Paraphrase
SortedPairs - List of key/value string pairs Returns a sorted list of key-value pairs.
Remove []string KV Returns a copy of the list of key-value pairs that do not contain the specified key.
Names - []string Returns the key in the label set.
Values - []string Returns the value in the label set.

function

Note that the default function is provided by the Go language template.

String

name parameter return value Paraphrase
title string strings.Title, the first character of each word is capitalized.
toUpper string strings.ToUpper, convert all characters to uppercase.
toLower string strings.ToLower, convert all characters to lowercase.
match pattern, string Regexp.MatchString. Use Regexp to match strings.
reReplaceAll pattern, replacement, text Regular expression replacement is not fixed.
join sep string, s []string strings.Join, join the elements of s to create a single string. The separator string sep is placed between the elements in the result string. (Note: The order of the parameters is reversed to make it easier to pipelining in the template.)
safeHtml text string html/template.HTML, the string is marked as HTML, without automatic escaping.
stringSlice …string Returns a string slice composed of multiple strings passed.

Sample notification template

The following are some examples of different alerts and the corresponding settings of the Alertmanager configuration file (alertmanager.yml). Each uses the Go template system.

Custom Slack notification

In this example, we have customized a Slack notification to send a URL to the organization's Wiki about how to handle a specific alert that has been sent.

global:
  slack_api_url: '<slack_webhook_url>'

route:
  receiver: 'slack-notifications'
  group_by: [alertname, datacenter, app]

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#alerts'
    text: 'https://internal.myorg.net/wiki/alerts/{
   
   { .GroupLabels.app }}/{
   
   { .GroupLabels.alertname }}'

Access annotations in common annotations

In this example, we once again customize Slack sent to the receiver of the text, a common comment CommonAnnotations to access the data sent by the Alertmanager stored summaryand description.

Alert

groups:
- name: Instances
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: page
    # 此处 Prometheus 模板应用告警的 annotation 和 label 字段。
    annotations:
      description: '{
   
   { $labels.instance }} of job {
   
   { $labels.job }} has been down for more than 5 minutes.'
      summary: 'Instance {
   
   { $labels.instance }} down'

Receiver

- name: 'team-x'
  slack_configs:
  - channel: '#alerts'
    # 此处应用 Alertmanager 模板。
    text: "<!channel> \nsummary: {
   
   { .CommonAnnotations.summary }}\ndescription: {
   
   { .CommonAnnotations.description }}"

Traverse all received alarms

Finally, assuming the alert is the same as the previous example, we customize the receiver to cover all alerts received from Alertmanager and print their respective comment summary and description on a new line.

Receiver

- name: 'default-receiver'
  slack_configs:
  - channel: '#alerts'
    title: "{
   
   { range .Alerts }}{
   
   { .Annotations.summary }}\n{
   
   { end }}"
    text: "{
   
   { range .Alerts }}{
   
   { .Annotations.description }}\n{
   
   { end }}"

Define reusable templates

Going back to the first example, we can also provide a file containing a named template, which is then loaded by Alertmanager to avoid complex templates that span multiple lines. Create file below

{
   
   { define "slack.myorg.text" }}https://internal.myorg.net/wiki/alerts/{
   
   { .GroupLabels.app }}/{
   
   { .GroupLabels.alertname }}{
   
   { end}}

Now, the configuration will load the template with the "text" field of the given name, and we provide the path to the custom template file:

lobal:
  slack_api_url: '<slack_webhook_url>'

route:
  receiver: 'slack-notifications'
  group_by: [alertname, datacenter, app]

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#alerts'
    text: '{
   
   { template "slack.myorg.text" . }}'

templates:
- '/etc/alertmanager/templates/myorg.tmpl'

Management API

Alertmanager provides a set of management APIs to simplify automation and integration.

health examination

GET /-/healthy

This endpoint is used for the health check of Alertmanager and returns 200 normally.
Readiness check

GET /-/ready

这个端点用于检查 Alertmanager 是否可以提供服务(如相应请求),正常返回 200。
重加载

POST /-/reload

这个端点触发 Alertmanager 重新加载配置文件。

触发配置重新加载的另一种方法是向 Alertmanager 进程发送 SIGHUP 信号。

Guess you like

Origin blog.csdn.net/qq_35753140/article/details/104550726