Failure Rating and Responsibility

The first step in fault management is to understand the fault. Only by facing the fault correctly can we find a more reasonable way to deal with it.

This requires two tasks: one is to track online fault handling and organize fault recovery, and the other is to formulate fault grading and responsibility standards, and at the same time have the right to grade and assign faults.

Therefore, one of the keys here is that we must have clear fault grading standards. This standard is mainly to determine the degree of fault impact, and all stakeholders can judge and evaluate based on a unified standard.

In reality, because all parties are affected by the failure differently and have different understandings of the impact of the failure, the following two dispute scenarios often appear during the review process.

1. The technical support judges that the fault is serious, but the responsible party thinks it is no big deal and should not judge the fault level to be so high;

2. The technical support thinks that the impact of the fault is small, but the affected party thinks it is very serious, and the fault level should not be judged so low.

For such a long time, the fault level needs to be set to 5 levels from P0 to P4, where P0 is the highest and P4 is the lowest. For e-commerce, it is mainly measured by indicators related to money such as transaction decline, payment decline, and advertising revenue loss. For other services, such as user IM, etc., mainly distinguish the types of services and formulate grading standards that meet the characteristics of the services. Two examples are as follows.

Example of transaction link failure rating criteria:

Example of User IM Fault Rating Criteria:

The standards for grading faults will be communicated and discussed in point-to-point details between technical support and various business R&D teams, and from the perspective of business impact, factors such as impact area and impact duration will be connected in series. In this way, even if disputes arise later, there will be corresponding standard references. This standard may not cover some failure effects or special cases, but technical support can make "discretion" based on its own experience. At the same time, the standards are revised and improved every quarter or half a year.

Different fault ratings lead to different strategies for fault response. Generally speaking, for P2 and above failures, all relevant responsible persons need to go online to deal with them immediately, and resume business in time. For P3 or P4 questions, the requirements will be relaxed appropriately. Throughout the process, technical support will give a basic judgment, and then organize and convene a temporary failure emergency team to deal with it.

The fault grading standard is mainly used to determine the fault level, so that the parties involved in the fault will not be too entangled in the level standard. The main purpose of fault determination is to determine the responsible party. This requires clear standards for fault determination.

1. Avoid buck-passing. For example, I think it is your responsibility, and you think it is my responsibility. Everyone is arguing and even slandering and attacking.

2. Face up to the problem and take it seriously. It is not for punishment, but as the responsible party or responsible team, we must face up to the problem, find out our own shortcomings, and as the main person responsible for improvement, implement or promote improvement measures.

Regarding the determination of responsibility, there are the following dimensions for reference.

1. Change execution

For example, if the changing party fails to notify the affected party in time, or fails to conduct a sufficient assessment in advance, the responsibility of the changing party lies with the changing party; The actual impact of the operation has greatly exceeded expectations, causing the affected party to be unprepared and malfunctioning, and the responsibility lies with the changing party.

2. Service dependency

For example, if the interface is called privately, or the calling method does not conform to the agreed rules, the responsibility lies with the caller; if the server does not have a clear example or explanation, causing problems with the caller, the responsibility lies with the server, etc.

3. Third Party Liability

For example, IDC power failure in the computer room, server failure, operator network failure, etc., if it is indeed caused by force majeure, the responsibility lies with the third party; but the failure is caused by its own redundancy or failure plan, and the responsibility lies with the application owner.

With such a principle, the occurrence of discordant atmosphere can be effectively reduced during fault recovery. Because the business form and characteristics of each company are different, the specific content inside may be different, and the above-mentioned liability determination standards may not be fully applicable, so they are only for example reference. If you are deeply troubled by fault determination in your daily life, it is recommended to clarify the rules as soon as possible and reach an agreement with all parties, so as to minimize the occurrence of buck-passing.

This article is a study note for Day 12 in April. The content comes from Geek Time "Zhao Cheng's Operation and Maintenance System Management Course". This course is recommended.

Guess you like

Origin blog.csdn.net/key_3_feng/article/details/130117787