About the operation and maintenance of the fault recovery disk articles -Case Study

Later on recovery disk failure, the English name of the Case Study is necessary to do, of course, depending on the level of failure, each failure can not do both Case Study, unless adequate staff and time;

 A document capability is the ability, general engineer of the document with little capability or general, but in fact there are general types of document templates, based on the content of the template can be filled more with less.

Failure to have a record, every company should have a wiki, these re-set should be recorded, you can learn a lot. Case Study will take a lot of time, but mid-level and major fault is still necessary.

Described below is re-set the whole routine:


 

Fault description

       xxx status code alarm business, cloud storage MySQL3 station host is down,  the root cause is the host downtime is located.

Failure Replay

  1. 16:00 failure to start
  2. Found alarm status code xxx 16:02
  3. 16:03 op View alarms, web normal machine, also received three database machine down machine alarm.
  4. 16:06  xxxxx
  5. 16:11 cloud vendor feedback 3 single physical machine where the abnormal cloud host goes down, the current operation and maintenance colleagues in the emergency treatment
  6. 16:14 cloud vendor feedback physical machine is starting
  7. 16:22 Jinshan feedback starts successfully, and thermal migration
  8. 16:23 cloud host machine is started, start the database alarm  (Alarm Status Code Recovery case 5xx)

the reason:

    Host physical failure where the cloud host multiple servers at the same time lead to downtime.

The impact surface

     1. Fault Time: 06/16 06/16 16:00 ~ 16:23   (this time period is downtime 23min)
     2. Impact Services: xxxx
     3. Loss ratio: 11.35%          
           Total error count: 66312 

           The total amount of requests: 584472   

Subsequent optimization

  1.  The cloud host broken, distributed on a physical host nowhere.

The above is a simple failure recovery disk model, the first step is to restore the entire fault according to the time line beginning to end of the process, the second is to identify the problem (root cause), the third is to see what specific improvements and optimization, to avoid recurrence of similar failures.

 

Guess you like

Origin www.cnblogs.com/topicjie/p/11111805.html