How to do a crash recovery

First of all, we must first figure out the purpose of the review. The purpose of the review is to learn from failures, find our technical and management deficiencies, and then continue to improve. Although we don't want failures to happen, learning from failures is the best way to improve the capabilities of the team and employees, so we must look at failures dialectically.

At the same time, avoid making the review process and purpose into accountability or punishment, which will have a great blow to the team atmosphere and employee enthusiasm.

In the review process, technical support still plays a key role.

  • Call a review meeting. The fault information will be sent to the participants involved in fault handling in advance, and the issues that need to be discussed during the review process will be prepared, and it will be decided whether to invite business personnel to participate in the meeting according to the situation;
  • Organize the meeting process. Coordinate and control the discussion in the meeting, also known as field control;
  • Assign responsibilities to faults. Play a judgment role similar to a "judge" and execute according to the above-mentioned standards;
  • Define follow-up improvement actions and responsible persons, enter them into the system and track them regularly.

In the review meeting, there are usually the following key links.

1. Brief review of faults . A brief description is mainly given of the time point when the fault occurred, the impact of the fault, the recovery time, and the main handler or team.

2. Troubleshooting timeline review . Technical support will briefly record the processing process during the troubleshooting process, such as the time point of each operation, the responsible person, the operation result, and even the intermediate communication and collaboration process, such as who was called at what time and how long it took Online, etc., this process requires objective truth. After the business resumes, it will be sent to the processor for verification and supplementation. The role of this timeline is very critical, it can reproduce the entire fault handling process relatively realistically.

3. Discuss the timeline . After reviewing the above timeline, we will raise questions that existed during the process, which will put a certain amount of pressure on the main handlers, so we must remain true to the matter and not to the person. Usually we will question the link that takes too long and unreasonable to process, such as why the alarm did not find the problem, but the user complained and fed back? Why did it take a long time from the failure to when someone came online to respond? Why is there no plan for current limiting, downgrading, and switching in the corresponding scenario? Why did the plan fail to take effect? Why didn't you do grayscale release and verification, etc.? Through the discussion of these issues and details, we will find out the obvious deficiencies and record the improvement points in the process.

4. Determine the root cause of the failure . By discussing the details, judge the root cause of the failure, and discuss the improvement measures for the root cause of the failure again. In this session and the previous session, there are usually many discussions and even debates. The role of technical support is to control the situation and discuss the facts. We must not let the discussion get out of control and evolve into mutual accusations and criticism. Once there is such a sign , technical support must intervene in time and give a warning.

5. Fault classification and responsibility . After the root cause is determined, combined with the previously confirmed fault impact surface, the fault can be graded and assigned responsibility. Here we also need to rely on the fault criteria we introduced earlier. However, when determining responsibility, the team of the responsible party and relevant processing personnel will be present, and a small-scale notification will be made. This is mainly to consider the personal feelings of the responsible person. If there is no objection, a fault completion report will be formed; if there is any objection, it can be fed back to the superior supervisor until the technical team leader (CTO or technical VP).

6. Issue a failure completion report . The main content of the fault completion report includes detailed fault information, such as time point, impact area, timeline, root cause, responsible team, follow-up improvement measures, and common problems and suggestions summarized through this fault. The main purpose of this is to ensure the transparency of information, and at the same time take a warning, hoping that other teams can also check for gaps and make up for gaps, so as not to make the same mistakes.

In addition to routine fault emergency and fault review, we will also regularly summarize fault cases within a period. For example, according to a quarterly, semi-annual, and annual cycle, it is easier to find some common problems, so that the R&D team can plan for stability construction.

This article is a study note for Day 14 in April. The content comes from "Zhao Cheng's Operation and Maintenance System Management Course" in Geek Time . This course is recommended.

Guess you like

Origin blog.csdn.net/key_3_feng/article/details/130162778