background

Online failures are an inevitable part of technological growth, and we can learn valuable lessons and become more and more experienced from them. However, not every team or technical classmate can handle failures in a rational and scientific manner . Based on daily practical work experience and personal experience, I compiled a list of three-character scriptures and correct ✅cases and wrong ❌cases for the team to quickly troubleshoot problems or suspected problems. This checklist will help you quickly troubleshoot when you encounter problems, without worrying about making mistakes in a high-pressure environment and missing key steps . By mastering this list, you will be able to better control the scene , so as to avoid losses caused by negligence. Let us keep calm in the face of failures, conduct investigations in an orderly manner, and continuously improve our technical level and problem-solving capabilities.

Three Character Classic

Remarks: The following is not a strict order. It needs to be adjusted according to the actual situation, and multiple channels can be parallelized. For example, if you know the general link of the problem, stop the bleeding first. If you are not clear, report to the division of labor, initially locate the general link, and then quickly stop the bleeding.

All the means and actions taken during the fault handling process are to restore the business as the highest priority, and the restoration of the on-site rapid hemostasis solution is higher than all other links such as finding the cause of the fault.

Don't panic, report first

Organize the meeting first, and clarify the division of labor

description, not conclusion

Stop bleeding first, then position

Watch monitoring, watch logs

Find the rules, test first

Look at the input, look at the output

Stay on site, feedback in real time

Don't panic, report first

1. When dealing with emergencies (online problems or suspected problems), first report the problem to the team .

2. Give full play to the strength of the team , and through brainstorming, you can find the root of the problem more quickly and take corresponding measures to solve it.

✅ Positive example:

Suppose a member of a team encounters an online problem with business or R&D feedback from other teams. He first escalated the problem to the TL/architecture in the group
Then share details of the issue with team members via online meetings or instant messaging. Team members quickly engaged in discussions, brainstorming, and working together to analyze the cause of the problem and possible solutions.
After everyone's joint efforts, we finally found the root of the problem and successfully solved it. During this process, the team members gave full play to the power of teamwork and improved the efficiency of problem-solving.

❌Counterexample:

Suppose a member of a team encounters an online problem with business or R&D feedback from other teams. Instead of escalating the problem to the group, he tried to solve it himself. As a result, this problem was not resolved in a timely manner, but instead aggravated and affected online losses.

Organize the meeting first, and clarify the division of labor

1. The fault commander (problem finder or team leader/structure) has the right to call corresponding business, product, development or other necessary resources and quickly organize meetings.

2. The fault commander defines the responsibilities of each role and the division of labor , which can effectively improve the efficiency of fault handling.

✅ Positive example:

The fault commander (problem finder or TL/architecture) has the right to call relevant business, product, development and other team members to quickly organize a meeting. In the meeting, clarify the roles and responsibilities of each member
Such as product personnel, developers, testers, etc., and the division of labor is clear. In this way, everyone can clearly know the tasks they need to complete, thereby effectively improving the efficiency of troubleshooting.
Through brainstorming, team members can quickly find the root cause of the problem and take corresponding measures to solve it.

❌Counterexample:

The fault commander (problem finder or TL/architecture) did not convene relevant business, product, development and other team members in a timely manner. He tried to solve the problem by himself, but due to lack of professional business knowledge and experience, the problem was not solved in time. At the same time, other team members did not pay enough attention because they did not know the seriousness of the problem. In this case, the fault commander needs to spend more time and energy to coordinate the resources of all parties, which may eventually lead to the failure to effectively solve the problem and affect the progress of the entire project.

description, not conclusion

1. Let the problem finder describe the phenomenon found (time, scope of influence, degree of influence), rather than the conclusion of judgment (because the conclusion of judgment may be wrong, which will mislead everyone to investigate the direction)

2. Please avoid expressing your own judgments and opinions too much when describing the problem phenomenon, because this may affect your investigation direction. We need to maintain an objective and rational attitude in order to better analyze the problem.

3. At the same time, please pay attention to your own way of thinking, and avoid letting your brain become a racetrack for other people's thoughts . When discussing issues, we can offer our own opinions and suggestions, but please ensure that these opinions are based on facts and evidence, and not personal subjective assumptions.

✅ Positive example:

Suppose that in a software development team, when a performance problem is encountered, the problem finder describes the following phenomenon:

Time: Between 9:00 am and 10:00 am on August 18, 2023, users found obvious performance degradation during use.
Scope of influence: It affects the main functional modules such as user login and data query.
Degree of impact: Due to performance degradation, user request response time increases significantly, and some users cannot complete operations normally.

In this example, the problem finder provides specific phenomenon information, enabling team members to quickly locate the problem and take corresponding measures to optimize and fix it.

❌Counterexample:

Suppose in a software development team, when a performance problem is encountered, the problem finder only gives his own judgment:

Time: Between 9:00 am and 10:00 am on August 18, 2023.
Scope of influence: It may be a problem with the database connection pool.
Impact degree: Very serious, causing most users to be unable to use the system normally.

In this example, the problem finder did not provide specific phenomenon information, but only gave his own judgment and conclusion. This can lead to confusion for team members troubleshooting and not being able to pinpoint exactly what the problem is. At the same time, because the judgment conclusion may be wrong, it may also mislead the team members, causing them to waste time and energy on troubleshooting.

Stop bleeding first, then position

1. In all the means and actions taken during the fault handling process, the highest priority is to restore the business, and the recovery of the on-site hemostasis plan is higher than the search for the cause of the fault .

2. Rapid hemostasis: We can use switch technology, rollback technology and other means to quickly restore system functions and avoid service interruption.

3. Daily emergency plan: Please prepare emergency plans in advance, including disaster recovery strategies for key businesses, failover procedures, etc., so that the plan can be activated quickly in the event of a failure and reduce the impact of the failure on the business.

4. Please do not pay too much attention to finding the cause of the failure during the troubleshooting process. Although finding the cause of the failure is the key to solving the problem, in an emergency, we need to give priority to how to quickly stop the bleeding and resume business. Only under the premise of ensuring the stable operation of the business, can we have more time and energy to deeply analyze the cause of the failure and solve the problem fundamentally.

✅ Positive example:

Case 1: Assuming that in a software development team, when encountering a system failure, the problem finder quickly took quick measures to stop the bleeding and restore the system function. At the same time, team members have formulated detailed emergency plans in their daily work, so that they can quickly start the plan when similar failures occur, reducing the impact of failures on business. In this example, the problem finder and team members made restoring the business the highest priority and quickly took quick measures to stop the bleeding, ensuring system stability and availability. At the same time, they also pay attention to the formulation of daily emergency plans, making full preparations for possible failures in the future.
Case 2: During the online release, if an error starts to be reported, and everything is normal before the release? Then don't worry about anything, roll back first, and then troubleshoot the problem after returning to normal.
Case 3: The application has been running stably for a long time, but suddenly the process starts to exit? It is likely to be a memory leak, you can use the restart method, restart some machines and observe to see if the bleeding has stopped.

❌Counterexample:

Assume that in a software development team, when encountering a system failure, the problem finder spends a lot of time looking for the cause of the failure, ignoring the importance of quickly stopping the bleeding and restoring business. At the same time, the team members did not formulate a detailed emergency plan in their daily work, resulting in confusion in the handling process when similar failures occurred, and the failure to quickly restore system functions. In this example, the problem finder and team members were too focused on finding the cause of the failure, and neglected the importance of quickly stopping the bleeding and restoring business. As a result, when a fault occurs, the processing process is inefficient, and the system function cannot be restored in time, which has a great impact on the business.

Look at the monitoring, look at the log:

1. Collect and analyze information such as UMP performance indicators, Logbook error logs, and MDC system operating status in order to more accurately determine the problem.

2. Communicate with relevant teams to jointly analyze the cause of the problem and formulate a solution. During the problem-solving process, remain calm and patient, and follow best practices in troubleshooting to ensure that issues are resolved in a timely and effective manner.

✅ Positive example:

Assume that in a production environment, the system suddenly suffers from performance degradation. As an oncall person, after receiving a problem report, first collect information such as system operation status and performance indicators through various monitoring tools. Then, analyze possible causes of the problem based on the information collected, and communicate and collaborate with relevant teams. Finally, a solution was formulated and implemented, which successfully solved the problem of system performance degradation and restored the normal production environment.

❌Counterexample:

Assume that in a production environment, the system suddenly fails, causing the business to fail to run normally. As an oncaller, after receiving the problem report, he did not locate the general direction first, nor did he collect relevant information through monitoring tools for analysis. Instead, they directly try to fix the problem, but due to the lack of accurate information and analysis, it is difficult to find the root cause of the problem, resulting in the problem not being effectively resolved and affecting the normal operation of the business.

Find the rules, test first

1. Through various monitoring tools, such as UMP (traffic, tp99, availability rate) and log analysis, we can find patterns and understand the performance of the system before and after going online. For example, we can compare the log data of yesterday and last week to see if there are similar problems. At the same time, we can also monitor changes in UMP traffic to determine whether the system is affected by abnormalities .

2. If we find that there are similar doubt points before, it may mean that the problem has nothing to do with today's launch. We need to continue to dig deeper to find the root cause.

3. For each doubt point, we should conduct tests and verifications according to the priority . This ensures that we address the most critical issues first, before they affect the proper functioning of the system.

4. If the problem still exists during the test, we should adjust the plan in time and conduct the test again . This helps us find the source of the problem more quickly and take appropriate steps to fix it.

5. During the entire investigation process, we should maintain communication and collaboration. If you encounter a difficult or uncertain situation, you can communicate with other team members to solve the problem together.

✅ Positive example:

Assume that in a production environment, the system suddenly suffers from performance degradation. As an oncaller, I collected UMP and log data through various monitoring tools, and found that similar problems existed yesterday and last week. At the same time, it was observed that the UMP traffic did not change significantly compared with before. Based on this information, the operation and maintenance personnel can preliminarily judge that the problem has nothing to do with today's launch, and may be caused by other reasons. They experimented and verified in order of priority, and eventually found that the problem was due to a configuration parameter being set incorrectly. By adjusting the configuration and retesting, the problem was solved and the system performance returned to normal.

❌Counterexample:

Assume that in a production environment, the system suddenly fails, causing the business to fail to run normally. As an oncaller, without detailed monitoring and analysis, he directly tried to fix the problem. However, due to the lack of accurate information and analysis, the problem cannot be effectively solved, and even more serious mistakes may be caused by wrong operations. In this case, the oncall personnel need to adjust the plan in time and re-test to ensure that the problem is handled correctly.

Look at the input, look at the output

1. First, confirm the input and output parameters that need to be compared . These parameters may include request parameters, response data, etc. Make sure you know exactly what you need to compare. During the comparison, watch for differences in parameter values. If discrepancies are found, further analyze possible causes, such as parameter passing errors, interface logic problems, etc.

2. If it is found that the problem is caused by the interface logic, you can try to roll back some N machines to the previous version, and then test whether the interface works normally again. If the problem persists, further troubleshooting and code fixes may be required.

3. Based on the results of the comparison and the investigation process, sum up experience and lessons, and put forward suggestions for improvement, so as to avoid similar problems from happening again.

✅ Positive example:

Assume that in a production environment, the system suddenly suffers from performance degradation. As an operation and maintenance personnel, through technical rollback of a certain machine or comparison of drainage, it is found that the input parameters do not match expectations, resulting in the interface being unable to process requests correctly. After careful investigation, it was found that it was caused by a configuration parameter error. After the problem is fixed, the system performance returns to normal and the business runs normally.

❌Counterexample:

Assume that in a production environment, the system suddenly fails, causing the business to fail to run normally. As an operation and maintenance personnel, he directly tried to fix the problem without any comparison and troubleshooting. However, due to the lack of accurate information and analysis, the problem cannot be effectively solved, and even more serious mistakes may be caused by wrong operations. In this case, the operation and maintenance personnel need to adjust the plan in time and re-test to ensure that the problem is correctly handled.

Stay on site, feedback in real time

1. When troubleshooting and processing, it is very important to keep the status quo and record the measures taken and the solutions tried (for example, do not restart all machines, but keep one on-site machine)

2. Record in detail, including the measures taken and the solutions tried.

3. No progress is still progress, and timely feedback is also required.

✅ Positive example:

Assume that in a production environment, the system suddenly has a problem of performance degradation. The operation and maintenance personnel on duty reserved one machine and removed the traffic, and restarted the other machines in batches. Machine, Dump application snapshot (commonly used snapshot types are generally thread stack and heap memory mapping) to analyze the reason.

❌Counterexample:

Assume that in a production environment, the system suddenly has a problem of performance degradation. The operation and maintenance personnel on duty restart all the machines directly, so that the scene can be quickly stopped and restored. because

If you find that the information I provided is wrong or there is a more suitable list, please feel free to correct it and contact me to supplement it. Thank you so much for your feedback and help! I'll fix it and provide more accurate information as soon as possible.

Author: JD Logistics Feng Zhiwen

Source: JD Cloud Developer Community Reprinted from Yuanqishuo Tech, please indicate the source

[Stability] Have you learned the three-character scripture that reveals the team's quick troubleshooting? | JD Logistics Technical Team

background

Three Character Classic

Don't panic, report first

Organize the meeting first, and clarify the division of labor

description, not conclusion

Stop bleeding first, then position

Look at the monitoring, look at the log:

Find the rules, test first

Look at the input, look at the output

Stay on site, feedback in real time

Guess you like