AlertManager alert convergence
1. Alarm grouping
Grouping is to classify alarms of similar nature into single notifications, such as server down and application down. This type of alarm can be divided into a group. When divided into a group, when multiple alarms are triggered at the same time, this It will always be sent to the same email, which can avoid ignoring important information due to too many warning emails.
This email illustrates the meaning of this grouping very well
Since the rules written by prometheus are effective for all servers, all servers only need to create one type of alarm. When multiple hosts trigger this alarm at the same time period, an alarm email will be sent to the administrator at the same time. Types of alarms are mainly distinguished by alertname
AlertManager grouping syntax
route:
group_by: ['alertname'] //Group by label, alertname is the name of the alert rule, multiple labels can be separated by a comma
group_wait: 10s //Send alert waiting time, which is within a time range, if any Other alarms are sent together
group_interval: 10s //When a group of alarms are triggered, the interval for the next group of alarms to be triggered
repeat_interval: 10m //The time interval of repeated alarms, that is, when the instancedown alarm is triggered, if it has not been resolved , Then how often do you call the police
2. Alarm suppression
Suppress: When the alarm is sent out, stop sending other alarms caused by this alarm repeatedly
Suppression can prevent the operation and maintenance from receiving a large number of warning emails, and they are all the same alarm, but the level is different. We can limit by suppression that when this alarm triggers a severe level alarm, the warning level alarm will no longer be triggered
Configuration syntax:
inhibit_rules:
- source_match:
severity: 'critical' //匹配critical标签,先匹配了severity标签值为critical后,不再匹配target_match的条件
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance'] //告警中包含的分组名称
3. The alarm is silent
Silence is the value of not triggering a certain alarm for a period of time, and the set enters the maintenance phase
Create a silent rule
Access Port 9093 of AlertManager
1. Click new silence in the upper right corner
2. Add silent configuration information
3. At this time, the email will not be sent after being triggered by the docker alarm
4. Prometheus triggers an alarm implementation process
First, it is monitored by the prometheus system. When the threshold of a monitoring item reaches a certain index, it is judged by the for duration configured in the alarm rule. When the threshold is exceeded within a certain period of time, the alarm is pushed to AlertManager, and AlertManager receives After the alarm is reached, group, suppress, and silence, and finally send the alarm to the mailbox, WeChat, Dingding through the receiver we configured