The old driver's "four strokes" take you to play transfer dimension

The old driver's "four strokes" take you to play transfer dimension

Zhe Ye 360 ​​Cloud Computing

The old driver's "four strokes" take you to play transfer dimension

The heroine's declaration:

After posting so many "technical articles", the young master feels that it is necessary to post a "theoretical essay". Like martial arts practice, both skills and internal skills are required; this time an operation and maintenance engineer with seven years of work experience is invited. Let’s write something for the Hulk official account. The consistent style is still the main practical article. The focus is on the knowledge and accumulation of this position. First, I will give you a short article on how to deal with online faults at a high speed and effectively. I hope it will be beneficial to you. .
PS: Rich first-line technology and diversified forms of expression are all in the "HULK first-line technology talk", please pay attention!

Preface

Every operation and maintenance engineer will encounter sudden and emergency failures. Sometimes the core failures count on every second. After all, you may be facing hundreds of millions of users. Every second of service stoppage may lose one or two in Wudaokou. Room one hall.
The old driver's "four strokes" take you to play transfer dimension
As an operation and maintenance engineer, in addition to the mentality of not being surprised at the time when Taishan collapses, he must master a set of problem-solving methodology and implement it unswervingly. Only in this way can he be smart and free when a failure occurs. To solve the core problem.

Preface

From a philosophical point of view, what doctors show to people is illness, and when a malfunction occurs, we are also seeing a machine. Therefore, this methodology, I sum it up, is "seeing, hearing, and asking."

hope

"Look at his five colors to know his illness."

The key here is to collect macro-level information, obtain the surface phenomenon of the fault, and make comprehensive judgments.

"In order to be on the battlefield, we must watch everywhere and listen to all directions." Usually, you need to obtain network systems, domain name systems, load balancing systems, Web systems, back-end database systems, cache systems, and others. All kinds of business logic support the state of the system. After obtaining a piece of information, ignore other issues and dive into the horns and start analyzing.

smell

"He who knows what he hears, hears the five tones to avoid his illness."

Machines and systems are not humans, and we cannot point you where they are uncomfortable. We need to allow them to tell us our own symptoms. This requires a set of feature-rich, easy-to-use, and stable monitoring and alarm systems.

Real-time alarms can inform us that there is a problem with the machine and the system at the first time. Detailed alarm data can provide strong content support for subsequent processing. The hierarchical monitoring and alarm mechanism allows operation and maintenance engineers to eliminate interference factors as if they were peeling away. And the first time to find the problem.
The highest level of "Wen" is embodied in the fault early warning system. It is our most ideal scenario to train an early warning model based on the system's business model, growth and other basic data, and notify us before a fault occurs.

ask

"Anyone who asks and knows, asks the five flavors that he wants to know where his illness originated."

After obtaining the status of each system and receiving the alarm, we have enough data to support the failure and make a preliminary judgment.

In order to consolidate our judgment, we need to do a detailed tracking of the identified problems, that is, the so-called "no broad knowledge unless asked."

For example, ask the development team whether a certain function has just been launched; ask a certain operation and maintenance staff whether there have been changes to the configuration or adjustments to the architecture; ask whether the network team has been subjected to abuse; even ask whether the product and operation are currently started What promotion.

If the problem is caused by the change, can it be solved by a quick rollback? If not, the development and operation and maintenance team need to formulate how to respond to the strategy to restore the current service as soon as possible.

cut

"The person who knows the pulse by cutting the pulse, examines the cunkou, and regards the deficiency and actuality, so as to know the disease, where the disease is."

Through the three steps of "watching, smelling and asking", most of the macroscopic faults have basically been located and even solutions have been found. If the faults are caused by system-level microscopic factors, such as OOM, disk damage, port occupation, deadlocks and even frequent restarts , Then we need to master the common Profiling, Tracing and OS Tuning tools of the operating system to "cut the pulse".

This is the place that best reflects the value of our operation and maintenance engineers. We must be proficient in using command-line tools such as {io,mp,vm}stat, sar, {l,s,d}trace, oprofile, etc., by analyzing the output and logs Solve details.
The old driver's "four strokes" take you to play transfer dimension

to sum up

In addition to the above methods, you also need to pay attention to double check your core operations with other engineers at any time to ensure that it will not cause "faults" on "faults"; at each stage, you will regularly inform the relevant core team and your own boss. Progress, to ensure the synchronization of processing information; after each failure recovery, calm down and reflect on the processing process to improve experience and efficiency; and based on the cause of the failure, how to optimize the current system design.

Think about it when you press Enter to take control of everything, maybe you are greeted by the encouraging gaze of the bosses, and the applause of gratitude or admiration from the once incredible development team surrounding you.The old driver's "four strokes" take you to play transfer dimension

Guess you like

Origin blog.51cto.com/15127564/2668177