Production system data loss recovery case

Today I post a different content from the past, this is a record from production practice. I just did some editing and revision work.

Teacher Liu summarized and reflected on the process of troubleshooting by recording real cases. The article not only described Mr. Liu's ideas for solving the problem, but also clearly recorded the method of solving the problem. Reading this article carefully, you will find that Mr. Liu has clear ideas and strict implementation methods for troubleshooting and repairing faults. In addition, through this article, you can also discover a major hobby of Mr. Liu, go to GitHub to find resources and use them. This move not only reflects the advantages of open source software, but also reflects the concept of open source enthusiasts. Next, let us read the original text of Teacher Liu.

Author: Liu Baozhen

Position: Architect, currently working in a technology subsidiary of a large asset management company, with many years of experience in planning and designing large private clouds, familiar with software development processes, and currently fascinated in researching software development models based on DDD and agile. I have an in-depth understanding of distributed architecture, and I also hope to exchange experience in software architecture and cloud computing architecture with friends. I would also like to thank Mr. Xu for reviewing and revising this article.

Recovery after loss of production system data

1. Background and general ideas

On February 25, 2020, a large number of WeChat Moments reprinted that Weimeng suffered a major system failure, and the core production data has not been restored within 36 hours, so I thought of a case I dealt with two weeks ago in which the developer accidentally deleted the production data. During the recovery process, I also summarized the troubleshooting process and sorted out the knowledge I have learned, hoping to have a warning effect for the students of operation and maintenance.

I received a WeChat notification at 23:00 on the 13th, can I help you recover the data?

The system environment information is as follows:

Operating system: RHEL7.5

Workflow platform: open source activity

Business application: call activity to generate process data of the application.

The database used by the workflow: MYSQL 5.7 Community Edition, one master and two backup.

At 23:05, began to intervene in the failure of data loss.

Confirm a general idea to solve the problem:

 1. 找到是什么人在什么时间点做了什么操作?

 2. 这个操作对系统的影响有多大,是否对其他系统有影响?确认这个操作是不是正常业务体现?

 3. 确认数据库里受到影响的日志的时间段。

 4. 在仿真环境复盘整个故障。

 5. 制定技术恢复方案,在仿真环境验证数据恢复方案。

 6. 在仿真环境验证数据恢复后应用是否正常。

 7. 备份生产环境数据,应用数据恢复方案到生产环境。

 8. 生产环境绿灯测试,无误后,恢复完成。

Since the restoration of production data is a major data adjustment, it needs to be reported to the leader for approval, and a complete data rollback plan is required.

2. Data recovery process and technical analysis

It took 5 minutes to clarify the idea of ​​dealing with this problem, and the next step is to consider the specific data recovery. In the process of dealing with this problem, there are two difficulties to be solved.

  1. Confirm the start and end of the binlog to be restored.

  2. According to the beginning and end of binlog, confirm the data recovery plan and whether it is necessary to exclude other interfering data that occurs during this time period.

Solve the first problem first:

  1. Ask the developer. The developer gave the operation rest interface at around 20:20 in the evening and called the activity (hereinafter referred to as workflow) platform to delete the process template, which caused all the process instances under the process template to be deleted. In the process There are 5 processes in transit under the template that have not yet been processed.

  2. According to the description of the developer, log in to the database of the workflow platform, check the binlog file of the database at around 20:20, and back up the binlog file No. 11.

  3. Copy binlog to a development server and parse it through mysqlbinlog. The parsing command is: mysqlbinlog -v --base64-output=decode-rows --skip-gtids=true --start-datetime='2020-02-13 20:10:00' --stop-datetime='2020- 02-13 21:30:00' -d {$DBNAME} mysql-bin.000011 >>aa.log dbname is desensitized.

  4. Observing the parsed sql, no large number of delete operations were found at 20:20. It is confirmed that the developer's words are not credible, and the first principle of fault diagnosis: no one can believe all words, nor can they not believe it, with doubts Come find evidence to prove his statement.

  5. Continue to look through the parsed binlog, a large number of delete and update operations began to appear at 20:30, and I began to wonder if this was a problematic time period.

  6. Summarize the sql in this paragraph, summarize how many tables need to be operated, the type of operation on these tables, and the type of data operated (business ID). Confirm with colleagues on the workflow platform, delete a workflow template, is it related to the change of these tables, colleagues on the workflow platform confirm that this is the process, and the hope of data recovery is born!

  7. According to previous experience accumulation, there is an open source project binlog2sql on github, which can translate binlog events into SQL statements or reverse SQL statements. I immediately felt that this problem should be "easy" to solve.

  8. Based on the above thinking, start to install the binlog2sql tool in the simulation environment. This tool is a python program. You need to install the python environment and the required third-party libraries. For specific usage, please refer to: https://github.com/danfengcao /binlog2sql, and thank you again, Mr. Cao, the author of the tool .

  9. In the simulation environment, restore the problematic instance in the production environment, and at the same time point the url of the application jdbc to the newly restored instance on the workflow platform.

The above process has solved the first problem, and then we have to solve the second problem.

  1. In the above steps, the failure of the production environment has been reproduced in the simulation environment, and the tool for converting binlog to sql has been installed in the simulation environment.

  2. Use the binlog2sql tool to parse out the incorrectly executed sql, and let the workflow platform confirm it at the same time. At the same time, let the workflow colleagues confirm that no other applications are operating the database during this time period.

  3. Colleagues in the workflow confirmed that all SQL was generated by misoperation. The specific confirmation method is as follows:

(a) Create a workflow template in the simulation environment.

(b) Create several test cases on this template

(c) Delete the workflow template through the interface and observe the sql generated by the application to confirm whether the sql provided by me is correct.

At the same time, the workflow platform confirms that there is no other application operation during the problem time period, and it feels like victory is in sight, and the problem can be easily solved.

  1. Through binlog2sql to produce reverse sql, apply sql to the simulation environment, the problem can be solved, carefully observe the reverse sql file, find that there are some garbled characters in it, check the table where the garbled field is located, and find that the definition of the table is like this.

Production system data loss recovery case

One of the fields in the table is a longblob field, and the generated insert SQL cannot be executed. How to deal with this problem?

  1. This problem has reached a deadlock here. Seeing that the problem can be solved immediately, I found that there is a table data that cannot be inserted through sql. I asked my colleagues on the workflow platform whether this table is important and got a reply. Without the data in this table, the system cannot Running.

  2. Consider another way of thinking. Since SQL is generated by binary binlog, you can consider generating a reverse binary binlog, and then apply this reverse binlog to the database, this problem is solved.

  3. With this idea, I went to github to look through the project. Sure enough, there is one: https://github.com/Meituan-Dianping/MyFlash Once again, thank you very much for the open source myflash project of Meituan Dianping.

  4. Using myflash to generate a reverse binary file, apply the file to the database, the workflow platform is tested in a simulation environment, and the data is perfectly reproduced.

Three, reflection on the problem

Through the above analysis, this problem can basically be solved easily. Ask yourself a few questions:

  1. Why not use backup and recovery for database recovery?

On this system, the data has been backed up and it is fully prepared every day. The reason why this recovery cannot be used is that there are many application process engines in the workflow platform. Once the point-in-time recovery is performed, the system data of other applications will be Recovery will cause other systems to lose part of the data.

  1. Why not table-based data recovery?

Because the workflow platform is an open source platform, the correlation between data models is particularly strong. If table-based recovery is used, it is easy to cause problems with data constraints.

Reflection:

  1. Why is there a loss of data in the production environment?

Developers go beyond the simulation environment during the production launch process and go directly to production. They are not rigorous about the production launch process. Although there is a management process, they do not perform well in the process of the process.

  1. The technical capabilities of R&D personnel, R&D personnel are not familiar with activity, nor are they familiar with the process of modifying the process template. Improving the technical capabilities of R&D personnel must be on the agenda.

Four, follow-up questions

In combination with the above analysis process, some auxiliary strategies need to be specified to improve the release process.

  1. The release process is automated, application code is released automatically, and human involvement is avoided as much as possible.

  2. The application release process is standardized, and all scripts and the steps of new applications that are launched must be verified before they can be launched.

This concludes the full text.

Guess you like

Origin blog.51cto.com/15080016/2642085