How to ensure closed-loop processing of events

The so-called closed loop refers to the entire process of alarm issuance, claiming, collaborative processing, problem recovery, and review and improvement.

Scheduling, dedicated personnel to do special tasks

​This method does not sound tall, but it is indeed very effective. Although I was worried during the shift, I was afraid of being blamed, but because it was a shift system, I always had hope in my heart, and it would be fine to survive this cycle.

The person on shift is the first responsible person during the shift, and will devote 120% of his energy to deal with the problem. Obviously, it is easier to promote the solution of the problem when the person is responsible. Always interrupted by the police.

Scheduling systems are usually not open source. They are usually a function of the event center. PagerDuty provides scheduling capabilities. helps a lot.

Although the on-duty personnel have paid great attention to it during their on-duty period, they will inevitably neglect it, which requires an alarm escalation mechanism.

Alarm escalation mechanism

​Alarm escalation refers to a mechanism in which the system automatically notifies the second-line and third-line personnel if the first responsible person fails to respond in time after receiving the alarm. There may be many reasons why the frontline personnel did not respond in a timely manner, such as the mobile phone was muted and did not hear it, fell asleep at night, or forgot to bring the mobile phone when going out temporarily. At this time, the system finds that an alarm has not been recovered and has not been claimed. After a period of time, it should notify the leader of the on-duty personnel or the second-line backup personnel. If the second-line personnel do not respond for a long time, they should continue to upgrade.

The alarm escalation mechanism requires the cooperation of the claim function, that is, after the front-line personnel receive the alarm, they must tell the system through a certain mechanism: "I have already known the alarm, and now I have started to deal with it, so don't escalate."

Therefore, the escalation mechanism is generally enabled only for serious alarms, and the escalation mechanism does not need to be enabled for warning or notification alarms. Of course, each team can decide on how to set this standard.

Alarm Convergence Logic

The general convergence logic is three-level convergence, event -> alert -> incident.

The convergence logic from event to alert is called first-level convergence. Only this convergence logic is not enough, and the alarm information is still relatively scattered. It is not easy to coordinate based on these scattered alarms, and converge multiple alerts into one incident (fault). It is more convenient to coordinate based on the incident. However, there is a fixed convergence logic from event to alert, which can be automatically converged by the program, but it is difficult to automatically converge from alert to incident.

1. Convergence based on time; 2. Convergence based on time + tags; 3. Convergence based on time + text similarity.

Since there is no way to automatically converge the alarm into a fault, it can be done manually. It is relatively easy to distinguish the key alarms associated with a fault, as long as the key alarms are associated with the fault, and subsequent coordination based on this fault is enough. The so-called collaboration, one is information synchronization and collaborative processing, and the other is joint review and management of follow-up items.

​Fault coordination

First, not all alarms need to be escalated to fault coordination. Generally speaking, if the alarm can be dealt with directly by the on-duty personnel, it will not affect the services of other teams, and there is no need to notify other teams. Usually, there is no need to upgrade to a fault. It is enough to coordinate at the alarm level. The team digests it internally; if the on-duty person and his team cannot handle the alarm alone, it needs to be upgraded to a fault, and people from other teams are brought in to deal with it together.

Multiple teams work together to deal with a fault, and people from different teams will find some different clues, which need to be synchronized to all relevant people in a timely manner. At this time, you can add comments under the fault, and others can see it in time. After the loss is stopped, everyone needs to review according to the fault timeline and produce a series of follow-up items. At this time, the fault management module needs to have the function of follow-up item management, or at least be able to communicate well with the task management system.

With such a fault coordination mechanism, the probability of faults being dealt with will be greatly increased. In the future, with some operational statistics methods, the average fault stop loss time of each team will be counted, and the red and black lists will be established. Everyone will have a higher Enthusiasm to deal with failures. Of course, no matter how enthusiastic people are, they are not as fast as machines. If some alarms can be directly related to automatic processing logic, it will undoubtedly greatly increase the closed-loop rate of events.

​Automatic alarm handling

Many monitoring systems can be configured with Webhooks to automatically call back an HTTP interface when an alarm is triggered to connect some automated logic so that the alarm event can be automatically processed unattended. For example, if a certain service in a computer room hangs up, the logic of the webhook is to automatically call the flow-cutting interface to cut off the service flow, so as to achieve the purpose of stopping the loss.

​The logic of automatic alarm processing may not necessarily be able to achieve self-healing of alarms. Sometimes it is very valuable to just use this mechanism to catch the scene. For example, when a process hangs up, I want to know some running conditions of the machine at that time, such as the occupancy of various resources, system log information, etc., we can use the method of automatic alarm processing to Automatically running a script to capture some on-site information on the machine at that time is much more efficient than manually logging in to the machine to check after receiving an alarm.

 

 This article is a study note for Day15 in August. The content comes from Geek Time's "Operation and Maintenance Monitoring System Practical Notes". This course is recommended.

Guess you like

Origin blog.csdn.net/key_3_feng/article/details/132301224