AlertManager alarm convergence (11)

AlertManager alert convergence

1. Alarm grouping

Grouping is to classify alarms of similar nature into single notifications, such as server down and application down. This type of alarm can be divided into a group. When divided into a group, when multiple alarms are triggered at the same time, this It will always be sent to the same email, which can avoid ignoring important information due to too many warning emails.

This email illustrates the meaning of this grouping very well

Since the rules written by prometheus are effective for all servers, all servers only need to create one type of alarm. When multiple hosts trigger this alarm at the same time period, an alarm email will be sent to the administrator at the same time. Types of alarms are mainly distinguished by alertname
Insert picture description here

AlertManager grouping syntax

route:
group_by: ['alertname'] //Group by label, alertname is the name of the alert rule, multiple labels can be separated by a comma
group_wait: 10s //Send alert waiting time, which is within a time range, if any Other alarms are sent together
group_interval: 10s //When a group of alarms are triggered, the interval for the next group of alarms to be triggered
repeat_interval: 10m //The time interval of repeated alarms, that is, when the instancedown alarm is triggered, if it has not been resolved , Then how often do you call the police

2. Alarm suppression

Suppress: When the alarm is sent out, stop sending other alarms caused by this alarm repeatedly

Suppression can prevent the operation and maintenance from receiving a large number of warning emails, and they are all the same alarm, but the level is different. We can limit by suppression that when this alarm triggers a severe level alarm, the warning level alarm will no longer be triggered

Configuration syntax:

inhibit_rules:
  - source_match:								
      severity: 'critical'						//匹配critical标签,先匹配了severity标签值为critical后,不再匹配target_match的条件
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']				//告警中包含的分组名称

3. The alarm is silent

Silence is the value of not triggering a certain alarm for a period of time, and the set enters the maintenance phase

Create a silent rule

Access Port 9093 of AlertManager

1. Click new silence in the upper right corner

Insert picture description here

2. Add silent configuration information
Insert picture description here

3. At this time, the email will not be sent after being triggered by the docker alarm

Insert picture description here

4. Prometheus triggers an alarm implementation process

First, it is monitored by the prometheus system. When the threshold of a monitoring item reaches a certain index, it is judged by the for duration configured in the alarm rule. When the threshold is exceeded within a certain period of time, the alarm is pushed to AlertManager, and AlertManager receives After the alarm is reached, group, suppress, and silence, and finally send the alarm to the mailbox, WeChat, Dingding through the receiver we configured

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_44953658/article/details/113777174