一、部署Alertmanager服务
1.1 alertmanager服务介绍
在Prometheus平台中,警报由独立的组件Alertmanager处理。通常情况下,我们首先告诉Prometheus Alertmanager所在的位置,然后在Prometheus配置中创建警报规则,最后配置Alertmanager来处理警报并发送给接收者(邮件,webhook、slack等)。
软件下载地址:
https://prometheus.io/download/ https://github.com/prometheus/alertmanager/releases
1.2 alertmanager服务部署
Anertmanager可以不用和Prometheus部署在同一台机器,只要服务之间可以通信即可。
# wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.linux-amd64.tar.gz # tar xf alertmanager-0.20.0.linux-amd64.tar.gz -C /usr/local/ # ln -s /usr/local/alertmanager-0.20.0.linux-amd64/ /usr/local/alertmanager # 修改alertmanager配置文件 # cat alertmanager.yml global: resolve_timeout: 5m smtp_smarthost: 'smtp.163.com:25' smtp_from: '[email protected]' smtp_auth_username: '[email protected]' smtp_auth_password: 'password' smtp_require_tls: false route: group_by: ['alertname'] # 分组标签 group_wait: 10s # 分组等待时间,同一组内在10秒钟内还有其它告警,如果有则一同发送 group_interval: 10s # 上下两组间隔时间 repeat_interval: 1h # 重复告警间隔时间 receiver: 'mail' # 接收者是谁 receivers: # 定义接收者,将告警发送给谁 - name: 'web.hook' webhook_configs: - url: 'http://127.0.0.1:5001/' - name: 'mail' email_configs: - to: '[email protected]' # 检查配置文件 # ./amtool check-config alertmanager.yml # 使用systemd来管理alertmanager服务 # cat /usr/lib/systemd/system/alertmanager.service [Unit] Description=https://prometheus.io [Service] Restart=on-failure ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml [Install] WantedBy=multi-user.target # 启动alertmanager服务 # systemctl daemon-reload # systemctl start alertmanager # systemctl enable alertmanager
二、配置Prometheus与alertmanager集成
配置告警规则可参考:https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
# vim /usr/local/prometheus/prometheus.yml global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. alerting: alertmanagers: - static_configs: - targets: - 127.0.0.1:9093 # alertmanager的通信地址 rule_files: - "rules/*.yml" # alertmanager使用的配置文件存放地址 # mkdir /usr/local/prometheus/rules # cat /usr/local/prometheus/rules/general.yml groups: - name: example rules: - alert: InstanceDown expr: up == 0 # 表达式当前实例服务状态,1为正常 for: 3m # 告警持续时间5分钟 labels: severity: page # 告警级别 annotations: summary: "Instance {{ $labels.instance }} down" description: "{{ $labels.instance }} of job {{ $labels.job }} 已经停止工作3分钟." # /usr/local/prometheus/promtool check config /usr/local/prometheus/prometheus.yml Checking /usr/local/prometheus/prometheus.yml SUCCESS: 1 rule files found Checking /usr/local/prometheus/rules/general.yml SUCCESS: 1 rules found # systemctl restart prometheus.service
重启服务之后就能看到新的规则:
停止一个服务进行验证
先中PENDING状态再是FIRING状态
查看邮箱已经收到邮件: