Prometheus + alertmanager alarm configuration -2

prometheus alarm

prometheus an alarm by alertmanager

Step towards monitoring alarms:

  • Alarm is defined in the rules prometheusrule_files
  • alertmanager configure alarm alert action, the packet is suppressed, muting functions
  • Rules alertmanager installation route defined to the terminal: the occurrence of alarm information mail, letters and other micro-enterprises

Monitoring host server to download and install alertmanager
alertmanager can prometheus host installed together, it can be deployed independently on a single host. Here deployed in a host

[root@localhost ~]# tar zxf alertmanager-0.19.0.linux-amd64.tar.gz 
[root@localhost ~]# mv alertmanager-0.19.0.linux-amd64 /usr/local/alertmanager

alertmanager master configuration file: alertmanager.yml content labeling

global:                # 全局配置
  resolve_timeout: 5m  #解析超时时间

route:             # 配置告警发送,接受规则
  group_by: ['alertname']  # 根据标签分组
  group_wait: 10s          # 发送告警等待时间,为了合并相同告警一起发送
  group_interval: 10s      # 发送告警间隔时间
  repeat_interval: 10m      # 重复告警时间,控制发送告警频率,根据情况设置
  receiver: 'web.hook'     # 定义接收者类型:mail,wechat等
receivers:        # 定义告警发给谁
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
inhibit_rules:   # 告警抑制规则
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

Configuration 163 Mailbox receive alarms

  • alertmanager service to configure, modify alertmanager.yml
global:
  resolve_timeout: 5m

  smtp_smarthost: 'smtp.163.com:25' # 定义163邮箱服务器端
  smtp_from: '[email protected]'  #来自哪个邮箱发的
  smtp_auth_username: '[email protected]' 邮箱验证
  smtp_auth_password: 'XXXXXXXX'   # 邮箱授权码,不是登录密码
  smtp_require_tls: false   # 是否启用tls

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 3m  # 发送告警后间隔多久再次发送,减少发送邮件频率
  receiver: 'mail'    #发送的告警媒体

receivers:
- name: 'mail'        # 接收者配置,这里要与接收媒体一致
  email_configs: 
  - to: '[email protected]' #发送给谁的邮箱,多个人多行列出
#inhibit_rules:
#  - source_match:
#      severity: 'critical'
#    target_match:
#      severity: 'warning'
#    equal: ['alertname', 'dev', 'instance']
  • Check the configuration:
[root@localhost alertmanager]# ./amtool check-config alertmanager.yml
Checking 'alertmanager.yml'  SUCCESS
Found:
 - global config
 - route
 - 0 inhibit rules
 - 1 receivers
 - 0 templates
  • Start alertmanager
[root@localhost alertmanager]# ./alertmanager --config.file=./alertmanager.yml &

Prometheus arranged to communicate with alertmanager

The above configuration is only configured alertmanagerr service, then configure prometheus communicate with alertmanager

  • Prometheus alarm configuration rules
    official configuration reference documents:https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
[root@localhost ~]# cd /usr/local/prome/
[root@localhost prome]# vim prometheus.yml

#启用如需配置段

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 127.0.0.1:9093

# 配置告警规则
rule_files:
   - "rules.yml"   # 默认路径在配置文件同级目录下
  • Edit rules.yml

vim rules.yml

groups:     # 报警组
- name: Node
  rules:

  # Alert for any instance that is unreachable for >5 minutes.
  - alert: InstanceDown
    expr: up == 0   # 监控状态的值为 0时,说明服务异常,1为正常
    for: 5m  # 保持时间,上面的状态持续时间内都为 0 ,则触发告警
    labels:
      severity: error
    annotations:
      summary: "Instance {{ $labels.instance }} 停止工作"
      description: "{{ $labels.instance }} of job {{ $labels.job }} 已经停止1分钟以上."
  • Check the configuration
[root@localhost prome]# ./promtool check config prometheus.yml
Checking prometheus.yml
  SUCCESS: 1 rule files found

Checking rules.yml
  SUCCESS: 1 rules found
  • Restart prometheus
systemctl start prometheus

After rebooting the server prometheus prometheus alters the page can be seen rules

Testing stopped monitoring agent to other nodes

#停掉cadvisor
[root@localhost ~]# docker stop e9e9499bcf2b
e9e9499bcf2b

After a while http://192.168.235.130:9090/alertsthe state is Firing, at this time the message has been sent out, and have seen the warning sign in the mailbox mail

Alarm state

  • Inactive: the normal state
  • Pending: threshold has been triggered, but not yet meet the alarm duration
  • Firing: the triggering threshold has been satisfied and the alarm duration, the alarm is sent to the recipient

  • Packets: similar nature classified into a single alert notification

route:             # 配置告警发送,接受规则
  group_by: ['alertname']  # 根据标签分组
  group_wait: 10s          # 发送告警等待时间,为了合并相同告警一起发送
  group_interval: 10s      # 发送告警间隔时间
  repeat_interval: 10m      # 重复告警时间,控制发送告警频率,根据情况设置
  • Suppression: After the alarm is issued, the alarm thus stops repeatedly send the other triggered alarm, eliminates redundant alarms
inhibit_rules:
  - source_match:  # 高级别告警源
      severity: 'critical'  
    target_match:   # 低级别的告警被抑制不会发送
      severity: 'warning'  
    equal: ['alertname', 'dev', 'instance']  #抑制匹配
  • Silent: a simple mechanism is silent reminder of a particular time

Silent configuration by alertmanager services 9093 Port services interface to create a silent rule

http://192.168.235.130:9093/#/silences

Guess you like

Origin www.cnblogs.com/anay/p/11871018.html