Essential skills for technical people: problem-solving methodology—troubleshooting

Because many systems, especially IT systems or some power systems and communication systems, run 24/7. If a fault occurs, our operation and maintenance personnel are required to find the fault quickly and then solve the fault quickly and economically. For example, some systems in hospitals that support surgeries may even threaten the patient's life if a malfunction cannot be resolved quickly. Therefore, troubleshooting is a very important skill and technical requirement for our operation and maintenance personnel.

Essential skills for technical people: problem-solving methodology—troubleshooting Essential skills for technical people: problem-solving methodology—troubleshooting

What is troubleshooting?

troubleshooting is the process of finding the root cause of a problem and solving it and correcting it. The goal of troubleshooting is to return the device/system to normal working status.

Because many systems, especially IT systems or some power systems and communication systems, run 24/7. If a fault occurs, our operation and maintenance personnel are required to find the fault quickly and then solve the fault quickly and economically. For example, some systems in hospitals that support surgeries may even threaten the patient's life if a malfunction cannot be resolved quickly. Therefore, troubleshooting is a very important skill and technical requirement for our operation and maintenance personnel.

Troubleshooting is not only required at work, but also in life. Some time ago, I was playing Honor of Kings with a friend and encountered a glitch. Every night when I play this game, around 8 or 9 o'clock, the network quality deteriorates and the operation becomes very laggy. I am very troubled. As an operation and maintenance personnel, or a technical personnel's instinct, I wonder what is the problem with the network? How to fix it? So I went through a troubleshooting process. I did some tests on all the wireless networks at home and China Unicom's broadband, and tried to optimize the configuration of the wireless router. Finally, I determined that the 2.4G channels near our home and neighbors were too crowded and the interference was too serious, so during the evening rush hour Everyone has Internet needs and will interfere with each other. Later, I switched the channel to 5G, and the world became quiet, and I could play games with peace of mind.

General approach to problem solving

Later, I thought about it, is there a very scientific and standardized process or method. If you follow this method step by step, any fault or problem can be solved? Although the problems are diverse, the actual problem-solving methods are also diverse, and specific problem-solving processes can be developed for specific scenarios and problems. In specific work, some people are doing SA, some are doing network, and some are doing DBA. Each specific direction will have some troubleshooting methods related to the profession and problem scenarios.

For common problems, are there common solutions and steps that can be followed?

This is a relatively general method summarized by the author of the book "troubleshooting and maintaining cisco IP network". He divided the entire troubleshooting process into 7 steps, from defining the problem, to collecting clues and information, to analyzing, hypothesizing, and eliminating possibilities, and finally solving the problem.

In troubleshooting some complex systems or complex problems, we can follow this solution process to abstract and define the problem, and then solve it step by step.

Specific strategies and techniques

Outside of this standard process and method, we may encounter some relatively simple or more intuitive problems, and we can use some specific strategies and tips to troubleshoot more quickly.

Walkthrough prerequisites

We often encounter the problem that the TV does not respond when the switch is turned on? Why can't the computer turn on? This problem is most likely due to the power being unplugged or a power outage. Derived from this matter, any system requires some necessary prerequisites, or prerequisites, to operate. When an abnormality occurs in a system or service, you need to go back and understand what dependencies the system has and what prerequisites it has. Whether these conditions existed and were normal before, but now the conditions are not met, so some failures have occurred. .

For example, if a motorcycle stops moving while driving, is it out of gas? For some very mature or well-commercialized products, such as the iPhone, the user manual will list the conditions for normal operation and what conditions to stay away from, such as high and low temperatures, etc., and will make a very clear statement. definition.

However, in the operation and maintenance process of some self-developed systems, the documentation and instructions of these systems are often not particularly complete, so the prerequisites need to be investigated based on system abnormalities or problems. In addition, it is also necessary to consult with the R&D personnel or designers. After some in-depth communication, we found some prerequisites of the system, and then used them as a clue for troubleshooting. This is the first very basic troubleshooting method. Everyone has solved similar problems, and most problems are often caused by very common causes that our experience and intuition can help solve.

The most streamlined system

Let’s move on to the next problem-solving strategy. Does anyone have experience installing computers? A computer system has many components, such as CPU, memory, power supply, chassis, monitor, optical drive, mouse, audio, network card, etc. When we install the computer, we do not need to install it all at once. We often install the power supply, motherboard, CPU, and memory, and then we can try whether the system can work normally. If the system can light up, it means that the most important component of the system is OK. Therefore, from the perspective of troubleshooting, when locating faults, you can try to streamline a very complex system with many functions and components into the most basic system. After the test is OK, you can then remove other system components one by one. Add it in, so that you can find and solve this problem with twice the result with half the effort.

Restore default state/restart

The other scenario is similar to the first scenario. After a long period of operation, the system is not working normally. How to solve it? Restart. There was an unwritten rule at my previous employer that important systems should be checked before holidays. If they have not been restarted for a certain number of days, a planned restart would be arranged to avoid abnormal status caused by long-term operation of the system. .

Therefore, some restart solutions can be used to restore the fault to the initial state of the system and solve the fault. This is a very powerful fault resolution method. Of course, unintended consequences need to be considered before restarting, such as a possible startup failure that may lead to worse consequences. In addition to rebooting, you can also reinstall/rebuild the system to make a copy of the default or working system.

Replace one component at a time and only

When we find through some analysis and location that the fault may occur in a certain subsystem or certain modules, is there any way to quickly locate the problem? You can try to replace one of the parts and test it. Using this method, you can accurately locate the fault point step by step through troubleshooting, and then solve it. This provides us with valuable experience when encountering similar problems in the future. When using this method, it is important to note that only one component is replaced at a time. If other components need to be replaced after the test is completed, the previous changes must be restored to their original state first. Otherwise, multiple problems may arise due to changes, affecting and interfering with problem resolution.

write at the end

Troubleshooting is both a science and an art. In addition, you can also try to reproduce the problem, change the startup and configuration sequence, etc. In practice, according to time, resources, scenario conditions and restrictions, choose the most suitable strategy to complete troubleshooting. Happy troubleshooting!

about the author

Teng Chuanyong, Meituan Cloud Architect. He has been engaged in system and service operation and maintenance work at Baidu and eBay. His work involves basic service operation and maintenance, large-scale system deployment and optimization, virtualization, etc. Joined Meituan in 2012 and is responsible for operation and maintenance, mainly focusing on basic service operation and maintenance, data center and network construction, cloud computing environment construction and operation and maintenance, etc.

Guess you like

Origin blog.csdn.net/yaxuan88521/article/details/134225352