原文略有删节

告警概述

Prometheus 的告警分为两部分。Prometheus server 内的告警规则将告警发送到 Alertmanager，后者处理这些告警，包括沉默、抑制、聚合以及通过邮件、在线通知系统和即时通讯工具等方法发送通知。

设置告警和通知的主要步骤如下：

设定和配置 Alertmanager
配置 Prometheus 与 Alertmanager 通信
在 Prometheus 中创建告警规则

ALERTMANAGER

Alertmanager 处理客户端程序如 Prometheus server 发送来的告警。它负责将告警去重、分组和路由到正确的接收者例如邮件、网页等。它也负责沉默和抑制告警。

下面介绍 Alertmanager 的核心概念。查阅配置文件以更详细地了解其使用方法。

告警归组（Grouping）

归组将性质相似地告警归为一个通知。当发生大规模断电导致很多系统同时挂掉从而引发海量告警的时候这个功能尤为有效。

**示例：**你的集群运行着一个服务的成百上千个实例，当发生网络分区时，一半的服务实例连不上数据库了。Prometheus 的告警规则设置为每个服务实例发送一条告警。结果成百上千的告警发送至 Alertmanager。

作为一个用户，你希望只收到一条信息，里面包含了受影响的具体的服务实例。这样你可以配置 Alertmanager 将告警按照集群名称和告警名称归组，它就可以只发送一条通知了。

告警归组，各告警组的通知时间以及各告警组的通知接收者都在配置文件的路由树中进行配置。

告警抑制

告警抑制是指如果发生了某些告警，就将特定的告警压制。
**示例：**一个告知整个集群失联的告警触发了。可以配置 Alertmanager，将有关该集群的所有告警都不报了。这样可以防止触发与实际问题无关的成百上千个告警通知。

告警抑制在 Alertmanager 的配置文件中进行配置。

告警沉默

告警沉默就是直接将告警停报一段时间。告警沉默的配置基于匹配器，类似路由树。检查传入的告警是否与告警沉默设置的正则表达式匹配。如果匹配，就不发送告警通知。

告警沉默在 Alertmanager 的 web 界面进行配置。

客户端行为

Alertmanager 对客户端的行为有特殊的要求。这些仅与不使用Prometheus发送警报的高级用例有关。

高可用

Alertmanager 支持配置高可用集群。可以通过使用 --cluster-* 参数进行配置。

不要在 Prometheus 和 Alertmanager 之间进行负载均衡，而是在 Prometheus 中指定全部 Alertmanager 的列表。

配置

Alertmanager 通过命令行参数和配置文件进行配置。命令行参数配置不变的系统参数，配置文件定义抑制规则、通知路由和通知接收者。

可视化编辑器可以协助构建路由树。

运行 alertmanager -h 浏览可用的命令行参数。

Alertmanager 可以在运行时重新加载配置文件。如果新的配置文件格式不对，变更就不会被应用并且将错误记录为日志。通过向进程发送 SIGHUP 或者发送 HTTP POST 请求到 /-/reload 端点。

配置文件

通过 --config.file 参数指定加载的配置文件。

./alertmanager --config.file=alertmanager.yml

文件为 YAML 格式，由以下格式定义。括号代表参数时可选的。未列出的参数设为默认值。

通用占位符定义如下：

<duration>: 时长，匹配的正则表达式 [0-9]+(ms|[smhdwy])
<labelname>: 匹配 [a-zA-Z_][a-zA-Z0-9_]* 正则表达式的字符串
<labelvalue>: unicode 字符组成的字符串
<filepath>: 当前工作目录中的有效路径
<boolean>: 布尔值， true 或者 false
<string>: 标准字符串
<secret>: 一个密文的标准字符串，例如密码
<tmpl_string>: 模板格式化的字符串
<tmpl_secret>: 模板格式化的密文字符串

其他占位符单独说明。

全局配置设置在整个配置中生效的参数。它们也作为其他配置段的默认值。

global:
  # 配置邮件接收者
  # The default SMTP From header field.
  [ smtp_from: <tmpl_string> ]
  # The default SMTP smarthost used for sending emails, including port number.
  # Port number usually is 25, or 587 for SMTP over TLS (sometimes referred to as STARTTLS).
  # Example: smtp.example.org:587
  [ smtp_smarthost: <string> ]
  # The default hostname to identify to the SMTP server.
  [ smtp_hello: <string> | default = "localhost" ]
  # SMTP Auth using CRAM-MD5, LOGIN and PLAIN. If empty, Alertmanager doesn't authenticate to the SMTP server.
  [ smtp_auth_username: <string> ]
  # SMTP Auth using LOGIN and PLAIN.
  [ smtp_auth_password: <secret> ]
  # SMTP Auth using PLAIN.
  [ smtp_auth_identity: <string> ]
  # SMTP Auth using CRAM-MD5. 
  [ smtp_auth_secret: <secret> ]
  # The default SMTP TLS requirement. 
  # Note that Go does not support unencrypted connections to remote SMTP endpoints.
  [ smtp_require_tls: <bool> | default = true ]

  # 配置微信等 api 接收者
  # The API URL to use for Slack notifications.
  [ slack_api_url: <secret> ]
  [ victorops_api_key: <secret> ]
  [ victorops_api_url: <string> | default = "https://alert.victorops.com/integrations/generic/20131114/alert/" ]
  [ pagerduty_url: <string> | default = "https://events.pagerduty.com/v2/enqueue" ]
  [ opsgenie_api_key: <secret> ]
  [ opsgenie_api_url: <string> | default = "https://api.opsgenie.com/" ]
  [ hipchat_api_url: <string> | default = "https://api.hipchat.com/" ]
  [ hipchat_auth_token: <secret> ]
  [ wechat_api_url: <string> | default = "https://qyapi.weixin.qq.com/cgi-bin/" ]
  [ wechat_api_secret: <secret> ]
  [ wechat_api_corp_id: <string> ]
 
  # 配置网页接收者
  # The default HTTP client configuration
  [ http_config: <http_config> ]

  # 如果告警没有设置结束时间，ResolveTimeout 是 alertmanager 的默认值, 
  # 过了这段时间，如果告警没有更新，就被声明未恢复了。
  # 这和 Prometheus 的告警不冲突，因为它们都设置了结束时间。
  [ resolve_timeout: <duration> | default = 5m ]

# Files from which custom notification template definitions are read.
# The last component may use a wildcard matcher, e.g. 'templates/*.tmpl'.
templates:
  [ - <filepath> ... ]

# 路由树的根节点。
route: <route>

# 接收者列表。
receivers:
  - <receiver> ...

# 抑制规则列表。
inhibit_rules:
  [ - <inhibit_rule> ... ]

一个路由块定义了路由树中的一个结点和它的子节点。它的可选配置如果没设就默认继承它的父节点的配置。

每个告警从配置的顶层路由进入路由树，顶层路由匹配所有告警。然后遍历子结点。如果 continue 设置未 false，它就终结在第一个匹配的结点，如果一个匹配的结点的 continue 是 ture，告警就继续匹配随后的兄弟结点。如果一个告警不匹配一个结点的任何子节点，告警就基于当前结点的配置参数进行处理。

[ receiver: <string> ]
# 对告警进行归组的标签。例如 cluster=A 和 alertname=LatencyHigh
# 的多个告警被归为一组。
# 要按所有标签聚合，使用 “...”作为标签，例如：
# group_by: ['...'] 
# 这样就禁用了归组功能。
[ group_by: '[' <labelname>, ... ']' ]

# 告警是否继续匹配兄弟结点。
[ continue: <boolean> | default = false ]

# 一个告警要匹配结点所要匹配的相等匹配器集合。
match:
  [ <labelname>: <labelvalue>, ... ]

# 一个告警要匹配结点所要匹配的正则匹配器集合。
match_re:
  [ <labelname>: <regex>, ... ]

# 一组告警在发送通知前等待的时间，这段时间用于等待抑制告警到达
# 或者收集更多的同组告警。（通常是0s~数分钟）
[ group_wait: <duration> | default = 30s ]

# 发送一次告警后，等待多久再发送同组新增的告警。（通常 5m 或更久）
[ group_interval: <duration> | default = 5m ]

# 一个告警发送后等待多久再次发送。（通常 3h 或更久）
[ repeat_interval: <duration> | default = 4h ]

# 0 或多个子路由。
routes:
  [ - <route> ... ]

示例

# 包含全部参数的根路由，如果子路由没设置相应参数就继承自此。
route:
  receiver: 'default-receiver'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  group_by: [cluster, alertname]
  # 不匹配下面子路由的所有告警将终止在根节点并发送到 'default-receiver'。
  routes:
  # 所有 service=mysql 或 service=cassandra 的告警
  # 发送到 database-pager。
  - receiver: 'database-pager'
    group_wait: 10s
    match_re:
      service: mysql|cassandra
  # 具有 team=frontend 标签的告警匹配这个子路由。
  # 它们根据 product 和 environment 归组而不根据 cluster
  # 和 alertname。
  - receiver: 'frontend-pager'
    group_by: [product, environment]
    match:
      team: frontend

<inhibit_rule>

告警抑制规则设置当匹配一组匹配器的源告警发出后，停止发送匹配某些匹配器的目标告警。源告警和目标告警必须具有 equal 列表中列出的相同的标签值。

语义上，缺失的标签和空值标签是一个意思。因此，如果源告警和目标告警确实 equal 列表中列举的所有标签的话，告警抑制规则就会被应用。

为了防止一个告警抑制它自己，同时匹配一条规则的源和目标的告警不能被任何告警抑制（包括它自己）。尽管如此，建议在设置源和目标匹配器时确保没有同时匹配的可能性。

# 告警被抑制所必须满足的匹配器。
target_match:
  [ <labelname>: <labelvalue>, ... ]
target_match_re:
  [ <labelname>: <regex>, ... ]

# 存在满足如下匹配器的一个或多个告警的话，
# 告警抑制就生效。
source_match:
  [ <labelname>: <labelvalue>, ... ]
source_match_re:
  [ <labelname>: <regex>, ... ]

# 源和目标告警的如下标签值中至少有一个相等
# 告警抑制才能生效。
[ equal: '[' <labelname>, ... ']' ]

<http_config>

http_config 配置接收者与基于 HTTP API 的服务进行通信的 HTTP 客户端。

# 注意 `basic_auth`, `bearer_token` 和 `bearer_token_file` 选项时互斥的。

# Sets the `Authorization` header with the configured username and password.
# password and password_file are mutually exclusive.
basic_auth:
  [ username: <string> ]
  [ password: <secret> ]
  [ password_file: <string> ]

# 用 bare token 配置 `Authorization` 请求头。
# password 和 password_file 是互斥的。
[ bearer_token: <secret> ]

# 从配置文件读取 bare token 配置 `Authorization`。请求头 
[ bearer_token_file: <filepath> ]

# 配置 TLS 设置。
tls_config:
  [ <tls_config> ]

# 可选的代理 URL。
[ proxy_url: <string> ]

<tls_config>

配置 TLS 连接

# CA certificate to validate the server certificate with.
[ ca_file: <filepath> ]

# Certificate and key files for client cert authentication to the server.
[ cert_file: <filepath> ]
[ key_file: <filepath> ]

# ServerName extension to indicate the name of the server.
# http://tools.ietf.org/html/rfc4366#section-3.1
[ server_name: <string> ]

# Disable validation of the server certificate.
[ insecure_skip_verify: <boolean> | default = false]

接收器是一个或多个通知集成的命名配置。
我们没有积极添加新的接收器，我们建议通过 webhook 接收器实现自定义通知集成。

# 全局唯一的接收器名称。
name: <string>

# 配置各种接收集成。
email_configs:
  [ - <email_config>, ... ]
hipchat_configs:
  [ - <hipchat_config>, ... ]
pagerduty_configs:
  [ - <pagerduty_config>, ... ]
pushover_configs:
  [ - <pushover_config>, ... ]
slack_configs:
  [ - <slack_config>, ... ]
opsgenie_configs:
  [ - <opsgenie_config>, ... ]
webhook_configs:
  [ - <webhook_config>, ... ]
victorops_configs:
  [ - <victorops_config>, ... ]
wechat_configs:
  [ - <wechat_config>, ... ]

<email_config>

# 是否通知故障恢复。
[ send_resolved: <boolean> | default = false ]

# 通知发送的邮件地址。
to: <tmpl_string>

# 发送者地址。
[ from: <tmpl_string> | default = global.smtp_from ]

# 邮件发送的 SMTP host。 
[ smarthost: <string> | default = global.smtp_smarthost ]

# 鉴定 SMTP 服务器的主机名。
[ hello: <string> | default = global.smtp_hello ]

# SMTP 认证信息。
[ auth_username: <string> | default = global.smtp_auth_username ]
[ auth_password: <secret> | default = global.smtp_auth_password ]
[ auth_secret: <secret> | default = global.smtp_auth_secret ]
[ auth_identity: <string> | default = global.smtp_auth_identity ]

# SMTP TLS 要求。
# 注意 Go 不支持与远程SMTP端点的未加密连接。
[ require_tls: <bool> | default = global.smtp_require_tls ]

# TLS 配置。
tls_config:
  [ <tls_config> ]

# 邮件通知的 HTML body。
[ html: <tmpl_string> | default = '{
   
   { template "email.default.html" . }}' ]
# The text body of the email notification.
[ text: <tmpl_string> ]

# 额外的 email 头 键值对。
[ headers: { <string>: <tmpl_string>, ... } ]

<webhook_config>

配置通用接收器

# 是否通知故障恢复。
[ send_resolved: <boolean> | default = true ]

# 发送 HTTP POST 请求的目的端点url。
url: <string>

# HTTP 客户端的配置。
[ http_config: <http_config> | default = global.http_config ]

Alertmanager 向配置的端点发送 HTTP POST 请求的 JSON 格式：

{
    
    
  "version": "4",
  "groupKey": <string>,    // key identifying the group of alerts (e.g. to deduplicate)
  "status": "<resolved|firing>",
  "receiver": <string>,
  "groupLabels": <object>,
  "commonLabels": <object>,
  "commonAnnotations": <object>,
  "externalURL": <string>,  // backlink to the Alertmanager.
  "alerts": [
    {
    
    
      "status": "<resolved|firing>",
      "labels": <object>,
      "annotations": <object>,
      "startsAt": "<rfc3339>",
      "endsAt": "<rfc3339>",
      "generatorURL": <string> // identifies the entity that caused the alert
    },
    ...
  ]
}

<wechat_config>

通过微信 API 发送通知。

# 是否通知故障恢复。
[ send_resolved: <boolean> | default = false ]

# 调用微信 API 用的 key。
[ api_secret: <secret> | default = global.wechat_api_secret ]

# 微信 API URL.
[ api_url: <string> | default = global.wechat_api_url ]

# 用于认证的 corp id。
[ corp_id: <string> | default = global.wechat_api_corp_id ]

# 微信 API 定义的 API 请求数据。
[ message: <tmpl_string> | default = '{
   
   { template "wechat.default.message" . }}' ]
[ agent_id: <string> | default = '{
   
   { template "wechat.default.agent_id" . }}' ]
[ to_user: <string> | default = '{
   
   { template "wechat.default.to_user" . }}' ]
[ to_party: <string> | default = '{
   
   { template "wechat.default.to_party" . }}' ]
[ to_tag: <string> | default = '{
   
   { template "wechat.default.to_tag" . }}' ]

发送告警

免责声明: Prometheus 自动处理发送由其配置的告警规则生成的告警。强烈建议根据时间序列数据在 Prometheus 中配置告警规则，而不要实现客户端。

Alertmanager 有 V1 和 V2 两个 API，都监听告警。V1 的告警格式在如下的代码片段中描述。V2 被指定为 OpenAPI 规范，可以在Alertmanager 代码库中找到该规范。客户端只要活着就可以不断重发告警（通常间隔为30秒到3分钟）。客户端可以通过 POST 请求推送一个告警列表。

每个告警的 label 用于标识告警的相同实例以实现去重。annotations 始终设置为最近收到的注释，并且不能标识告警。

startsAt 和 endsAt 时间戳是可选的。如果 startsAt 省略了，就自动设为当前时间。仅当知道告警的结束时间时才设置 endsAt。否则，它将被设置为自上次收到警报以来的时长。

generatorURL 字段是唯一的反向链接，用于标识客户端中此告警的来源。

[
  {
    "labels": {
      "alertname": "<requiredAlertName>",
      "<labelname>": "<labelvalue>",
      ...
    },
    "annotations": {
      "<labelname>": "<labelvalue>",
    },
    "startsAt": "<rfc3339>",
    "endsAt": "<rfc3339>",
    "generatorURL": "<generator_url>"
  },
  ...
]

通知模板参考

Prometheus 创建并发送告警到 Alertmanager，后者根据告警的标签将它们发送通知到不同的接收器。接收器可以是许多集成中的一种，包括：Slack、 PagerDuty、email、或通过通用 webhook 接口定制的集成。

通知通过模板发送给接收器。Alertmanager 自带默认接收器，当然它们也可以定制化。为了避免混淆，Alertmanager 模板和 Prometheus 中的模板是不同的，尽管 Prometheus 模板也包含告警规则 labels/annotations 中的模板。

Alertmanager 的通知模板基于 Go 语言的模板系统。请注意，某些字段被定义为文本，而其他字段则被定义为 HTML，这会影响转义。

数据结构

数据

Data 是传递给通知模板和 Webhook 推送的数据结构。

名称	类型	释义
Receiver	string	通知发送的接收器名称(slack, email 等).
Status	string	如果至少有一个告警正在发送就是 firing，否则是 resolved。
Alerts	Alert	列举该组中的所有告警对象 (见下)。
GroupLabels	KV	这些告警归组的组标签。
CommonLabels	KV	所有告警共有的标签。
CommonAnnotations	KV	所有告警共有的 annotations 集合。用于获取有关警报的更多信息的字符串。
ExternalURL	string	发送通知的反向链接。

Alerts 类型暴露过滤告警的函数：

Alerts.Firing 返回该组当前发送的告警对象列表
Alerts.Resolved 返回该组已恢复的告警对象列表

告警

Alert 持有告警的通知模板。

名称	类型	释义
Status	string	告警当前状态是发送还是恢复。
Labels	KV	附在告警上的 label 集合。
Annotations	KV	附在告警上的 annotations 集合。
StartsAt	time.Time	告警触发的时间，如果省略，Alertmanager 就将其设为当前时间。
EndsAt	time.Time	仅当知道告警的结束时间时才设置 `endsAt`。否则，它将被设置为自上次收到警报以来的时长。
GeneratorURL	string	标识告警源的反向链接。

键值对（KV）

KV 是一组用于标识 label 和 annotation 的键值对。

type KV map[string]string

Annotation 示例包含两个 Annotation：

{
  summary: "alert summary",
  description: "alert description",
}

除了直接访问存储为KV的数据（标签和注释）外，还有一些用于排序，删除和查看标签集的方法：

KV methods

名称	参数	返回值	释义
SortedPairs	-	键/值字符串对的列表	返回一个排序后的键值对列表。
Remove	[]string	KV	返回不包含指定键的键值对列表的拷贝。
Names	-	[]string	返回标签集中的键。
Values	-	[]string	返回标签集中的值。

函数

注意默认函数由 Go 语言模板提供。

字符串

名称	参数	返回值
title	string	strings.Title, 每个单词的第一个字符大写。
toUpper	string	strings.ToUpper, 将所有字符转换为大写。
toLower	string	strings.ToLower, 将所有字符转换为小写。
match	pattern, string	Regexp.MatchString. 使用Regexp匹配字符串。
reReplaceAll	pattern, replacement, text	正则表达式替换，不固定。
join	sep string, s []string	strings.Join, 连接 s 的元素以创建单个字符串。分隔符字符串 sep 放置在结果字符串中的元素之间。（注意：参数顺序颠倒了，以便更轻松地在模板中进行流水线操作。）
safeHtml	text string	html/template.HTML, 将字符串标记为HTML，不需要自动转义。
stringSlice	…string	返回传递的多个字符串组成的字符串切片。

通知模板示例

下面是一些不同的告警示例及对应的 Alertmanager 配置文件的设置（alertmanager.yml）。每个都使用 Go 模板系统。

自定义 Slack 通知

此例中我们自定义了一个 Slack 通知，以向组织的Wiki发送有关如何处理已发送的特定告警的URL。

global:
  slack_api_url: '<slack_webhook_url>'

route:
  receiver: 'slack-notifications'
  group_by: [alertname, datacenter, app]

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#alerts'
    text: 'https://internal.myorg.net/wiki/alerts/{
   
   { .GroupLabels.app }}/{
   
   { .GroupLabels.alertname }}'

常见注解（annotation）中的访问注解

在此示例中，我们再次自定义发送到 Slack 接收器的文本，以访问 Alertmanager 发送的数据中的常见注释CommonAnnotations中存储的 summary 和 description。

Alert

groups:
- name: Instances
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: page
    # 此处 Prometheus 模板应用告警的 annotation 和 label 字段。
    annotations:
      description: '{
   
   { $labels.instance }} of job {
   
   { $labels.job }} has been down for more than 5 minutes.'
      summary: 'Instance {
   
   { $labels.instance }} down'

Receiver

- name: 'team-x'
  slack_configs:
  - channel: '#alerts'
    # 此处应用 Alertmanager 模板。
    text: "<!channel> \nsummary: {
   
   { .CommonAnnotations.summary }}\ndescription: {
   
   { .CommonAnnotations.description }}"

遍历所有收到的告警

最后，假设警报与前面的示例相同，我们将接收器自定义为覆盖从 Alertmanager 收到的所有警报，并在新行上打印它们各自的注释摘要和描述。

Receiver

- name: 'default-receiver'
  slack_configs:
  - channel: '#alerts'
    title: "{
   
   { range .Alerts }}{
   
   { .Annotations.summary }}\n{
   
   { end }}"
    text: "{
   
   { range .Alerts }}{
   
   { .Annotations.description }}\n{
   
   { end }}"

定义可复用的模板

回到第一个示例，我们还可以提供一个包含命名模板的文件，然后由 Alertmanager 加载该文件，以避免跨越多行的复杂模板。在下面创建文件

{
   
   { define "slack.myorg.text" }}https://internal.myorg.net/wiki/alerts/{
   
   { .GroupLabels.app }}/{
   
   { .GroupLabels.alertname }}{
   
   { end}}

现在，配置将使用给定名称的“text”字段加载模板，并且我们提供了自定义模板文件的路径：

lobal:
  slack_api_url: '<slack_webhook_url>'

route:
  receiver: 'slack-notifications'
  group_by: [alertname, datacenter, app]

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#alerts'
    text: '{
   
   { template "slack.myorg.text" . }}'

templates:
- '/etc/alertmanager/templates/myorg.tmpl'

管理 API

Alertmanager 提供一组管理 API 以简化自动化和集成。

健康检查

GET /-/healthy

这个端点用于 Alertmanager 的健康检查，正常返回 200。
就绪检查

GET /-/ready

这个端点用于检查 Alertmanager 是否可以提供服务（如相应请求），正常返回 200。
重加载

POST /-/reload

这个端点触发 Alertmanager 重新加载配置文件。

触发配置重新加载的另一种方法是向 Alertmanager 进程发送 SIGHUP 信号。

Alertmanager 官方文档翻译