How to Efficiently Perform Event Noise Reduction

In terms of event handling, we generally encounter two pain points, one is that there are too many alarm events, and we are overly disturbed, and the other is that important alarms are neglected, which cannot be handled in a closed loop.

Common causes of too many alerts

The most common reason is that the alarm rules are set unreasonably. For example, after many rules trigger an alarm, there is actually no follow-up action. It just serves as a normalized notification. There is no need for investigation or stop loss, and there is not even a long-term TODO. If there are too many such alarms, people will get tired, and when important alarms come, it is easy to ignore them. If such rules are not governed, many useless alarms will be generated over time.

The second common reason is that problems at the bottom layer cause all upper-layer dependencies to be alerted. The lower the bottom layer, the greater the impact. For example, if there is a problem with the basic network, it is normal to send out tens of thousands of alarms.

The third reason is channel mismatch. Some unimportant alarms are also sent through channels that are highly disturbing. Users may feel that a single channel is unreliable and want to use multiple channels to send them simultaneously to ensure the alarm reach rate. This also belongs to the category of unreasonable alarm rule configuration. .

The fourth reason is due to expected maintenance actions. For example, if the program is upgraded and changed, if the process restarts for too long, it may cause related service alarms, or if a machine restarts and forgets to block it in advance, a bunch of related alarms will also be generated.

Each rule should correspond to a specific runbook

The runbook is the alarm handling manual, that is, after the alarm is triggered, there should be a manual reference for which aspects should be checked in detail and how to perform actions. If there is no follow-up action after the alarm occurs, the significance of the alarm is not great.

For non-urgent alarms, actions must also be taken. Although this action may not be processed immediately, at least a low-priority work order or the like must be created, or the alarm threshold should be increased, and the alarm will be issued when the problem is more serious. For the alarms that just want to be notified, they are not actually alarms, and can only be regarded as an alternative means of reporting and inspection. Such "alarms" are handled according to the logic of reports and inspections. "Alarm" is sent to a separate email group or a separate chat group, and you don't need to pay attention to it at ordinary times, just take a look at it every morning before going to work or before leaving work at night, so as to reduce interruptions.

Each alert should be properly graded

​First of all, different levels of alarms should correspond to different processing logics, so that the grading is meaningful, such as different notification channels, different notification scopes, or different ranges of people involved in the processing, and different processing timeliness. If two levels correspond to exactly the same The processing logic can be combined into one level.

If there are many critical alarm rules, there is a high probability that there is a problem, which means that the system architecture is not robust enough, and if something happens, it must be intervened immediately, and the system has no self-healing ability. Such a system requires more operation and maintenance personnel, and it is difficult to clearly explain the value to the boss. what to do? This requires the formulation of access rules for operation and maintenance. Which system should be handed over to operation and maintenance personnel for operation and maintenance, first of all, some information must be provided.

  • Relevant contacts, if there is a problem, you can find someone in time. If you can't get in touch, you can directly contact the R&D leader.
  • Service-related information, such as code warehouse, system architecture, which services depend on, which system parameters depend on, which JVM parameters, common problems and solutions, etc.

Then conduct an access review. If there are obvious problems with the system architecture, there is no way to pass the access requirements, and the operation and maintenance will not be accepted. If the boss requires it, then you can only add people, or clearly state that before the architecture is adjusted, no Responsible for SLAs. If the boss does not accept communication, then change jobs. The boss does not understand operation and maintenance, does not understand stability, and does not trust you.

​Alarm rules support the configuration of effective time

​The businesses of different companies vary greatly. For example, securities companies need high-quality guarantees of stability during trading hours, but it doesn’t matter if some processes are stopped directly during non-trading hours. But if it is a monitoring system, the data is reported all the time, there are no peaks and valleys, and it is necessary to ensure high availability at all times.

​Repeated alarms support the maximum number of times and sending frequency

​Some alarms cannot be recovered for a short time and may be sent repeatedly. For example, check a certain indicator in one minute. If the threshold is exceeded, an alarm will be issued. From a certain moment, the threshold is triggered. After one minute, the first alarm is issued, but it does not recover for a short time. It lasted for 10 minutes. During the minute inspection, it was found that it was still in an alarm state, and an alarm might be issued, but this alarm is far less important than the first one, and there is no need to notify at all. By setting the sending frequency, it can be done, for example, to check after 1 hour. If there is still no recovery after 1 hour, a second alarm will be sent, so that the number of alarms will be greatly reduced.

Alarm events support shielding configuration

Generally, before an expected maintenance action is taken, related alarms are blocked in advance, so as not to receive another alarm during the maintenance period. There are generally two configuration methods for alarm masking, one is configured as a future time period, and the other is configured as a periodic time period. A time period in the future is used to deal with the expected maintenance behavior just introduced. The periodic time period is quite special. For example, if no alarm is required from 1 am to 5 am every day, or no alarm is required on weekends, it can be configured as periodic Blocking rules.

Alarm events support suppression configuration

​A typical usage scenario for alarm suppression is to configure two policies for one indicator, with different priorities and thresholds. If a high-priority alarm is triggered, a low-priority alarm will be suppressed and will not be sent repeatedly.

Alarm event aggregation sending logic

The most effective technical method for event noise reduction sending is actually aggregated sending, which can achieve immediate results. The user's alarm rules are too messy to be configured, and the system cannot control them. However, many alarms are triggered in a short period of time, and the system can aggregate them through technical means before sending them.

Event aggregation generally performs aggregation operations based on two or three dimensions of information, such as the dimension of the alarm receiver, the dimension of time, the dimension of a specified tag (such as a product line), or may not specify an aggregation tag, and only use the receiver dimension and time dimension, so that the aggregation rate will be higher.

From the perspective of alarm aggregation notification, it is enough to aggregate only based on the two dimensions of recipient and time, which has the highest aggregation rate and the least number of interruptions. Although multiple alarm messages will be mixed together and notified to the user, which seems a bit confusing, but users generally do not expect classified event viewing effects in text messages, emails, and instant messaging software. If you want a better viewing effect, you can go to the page to see it. There is a larger operation area on the page. You can aggregate and view it however you want. The alarm has already occurred, and there is a high probability that the computer needs to be turned on to deal with it. The computer has already been turned on, and it is natural to check it on the page.

​This article is a study note for Day 14 in August. The content comes from Geek Time's "Operation and Maintenance Monitoring System Practical Notes". This course is recommended.

Guess you like

Origin blog.csdn.net/key_3_feng/article/details/132271150