prometheus alarm
prometheus an alarm by alertmanager
Step towards monitoring alarms:
- Alarm is defined in the rules prometheus
rule_files
- alertmanager configure alarm alert action, the packet is suppressed, muting functions
- Rules alertmanager installation route defined to the terminal: the occurrence of alarm information mail, letters and other micro-enterprises
Monitoring host server to download and install alertmanager
alertmanager can prometheus host installed together, it can be deployed independently on a single host. Here deployed in a host
[root@localhost ~]# tar zxf alertmanager-0.19.0.linux-amd64.tar.gz
[root@localhost ~]# mv alertmanager-0.19.0.linux-amd64 /usr/local/alertmanager
alertmanager master configuration file: alertmanager.yml content labeling
global: # 全局配置
resolve_timeout: 5m #解析超时时间
route: # 配置告警发送,接受规则
group_by: ['alertname'] # 根据标签分组
group_wait: 10s # 发送告警等待时间,为了合并相同告警一起发送
group_interval: 10s # 发送告警间隔时间
repeat_interval: 10m # 重复告警时间,控制发送告警频率,根据情况设置
receiver: 'web.hook' # 定义接收者类型:mail,wechat等
receivers: # 定义告警发给谁
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
inhibit_rules: # 告警抑制规则
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
Configuration 163 Mailbox receive alarms
- alertmanager service to configure, modify alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25' # 定义163邮箱服务器端
smtp_from: '[email protected]' #来自哪个邮箱发的
smtp_auth_username: '[email protected]' 邮箱验证
smtp_auth_password: 'XXXXXXXX' # 邮箱授权码,不是登录密码
smtp_require_tls: false # 是否启用tls
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 3m # 发送告警后间隔多久再次发送,减少发送邮件频率
receiver: 'mail' #发送的告警媒体
receivers:
- name: 'mail' # 接收者配置,这里要与接收媒体一致
email_configs:
- to: '[email protected]' #发送给谁的邮箱,多个人多行列出
#inhibit_rules:
# - source_match:
# severity: 'critical'
# target_match:
# severity: 'warning'
# equal: ['alertname', 'dev', 'instance']
- Check the configuration:
[root@localhost alertmanager]# ./amtool check-config alertmanager.yml
Checking 'alertmanager.yml' SUCCESS
Found:
- global config
- route
- 0 inhibit rules
- 1 receivers
- 0 templates
- Start alertmanager
[root@localhost alertmanager]# ./alertmanager --config.file=./alertmanager.yml &
Prometheus arranged to communicate with alertmanager
The above configuration is only configured alertmanagerr service, then configure prometheus communicate with alertmanager
- Prometheus alarm configuration rules
official configuration reference documents:https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
[root@localhost ~]# cd /usr/local/prome/
[root@localhost prome]# vim prometheus.yml
#启用如需配置段
alerting:
alertmanagers:
- static_configs:
- targets:
- 127.0.0.1:9093
# 配置告警规则
rule_files:
- "rules.yml" # 默认路径在配置文件同级目录下
- Edit rules.yml
vim rules.yml
groups: # 报警组
- name: Node
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0 # 监控状态的值为 0时,说明服务异常,1为正常
for: 5m # 保持时间,上面的状态持续时间内都为 0 ,则触发告警
labels:
severity: error
annotations:
summary: "Instance {{ $labels.instance }} 停止工作"
description: "{{ $labels.instance }} of job {{ $labels.job }} 已经停止1分钟以上."
- Check the configuration
[root@localhost prome]# ./promtool check config prometheus.yml
Checking prometheus.yml
SUCCESS: 1 rule files found
Checking rules.yml
SUCCESS: 1 rules found
- Restart prometheus
systemctl start prometheus
After rebooting the server prometheus prometheus alters the page can be seen rules
Testing stopped monitoring agent to other nodes
#停掉cadvisor
[root@localhost ~]# docker stop e9e9499bcf2b
e9e9499bcf2b
After a while http://192.168.235.130:9090/alerts
the state is Firing
, at this time the message has been sent out, and have seen the warning sign in the mailbox mail
Alarm state
- Inactive: the normal state
- Pending: threshold has been triggered, but not yet meet the alarm duration
Firing: the triggering threshold has been satisfied and the alarm duration, the alarm is sent to the recipient
Packets: similar nature classified into a single alert notification
route: # 配置告警发送,接受规则
group_by: ['alertname'] # 根据标签分组
group_wait: 10s # 发送告警等待时间,为了合并相同告警一起发送
group_interval: 10s # 发送告警间隔时间
repeat_interval: 10m # 重复告警时间,控制发送告警频率,根据情况设置
- Suppression: After the alarm is issued, the alarm thus stops repeatedly send the other triggered alarm, eliminates redundant alarms
inhibit_rules:
- source_match: # 高级别告警源
severity: 'critical'
target_match: # 低级别的告警被抑制不会发送
severity: 'warning'
equal: ['alertname', 'dev', 'instance'] #抑制匹配
- Silent: a simple mechanism is silent reminder of a particular time
Silent configuration by alertmanager services 9093 Port services interface to create a silent rule
http://192.168.235.130:9093/#/silences