Lecture 86: Enterprise-level data backup and recovery case using MySQLDump and Binlog logs

1. Enterprise-level data backup and recovery case description

Case background:

The MySQL version of an Internet company is 5.7.35, the operating system is Centos7.5, the data volume is about 100G, and the daily data increment is about 10M.

Data backup strategy:

Use mysqldump to back up the entire database at 0:00 every night, and also back up the Binlog logs.

Fault description:

At 3 pm on a Wednesday, due to some reasons, all the data in the database was damaged, resulting in the platform being unable to be used normally.

Troubleshooting process:

  • 1) First, issue a platform maintenance announcement and announce the suspension of service, and all staff will work hard to repair the data.
  • 2) Then assess the extent of data corruption:
    • Whether all the data is lost? If all the data is lost, it is recommended to restore the data directly in the database in the online environment.
    • Is only part of the data lost? If part of the table data is lost, it may be caused by misoperation by operation and maintenance or developers. It is recommended to export the data of some tables from backup and Binlog, and then perform data recovery in the database of the pre-release environment. After the recovery is completed, it will be verified by test colleagues. If there is no problem, it will be restored to the online environment.
  • 3) After it is determined that all data has been lost, directly restore the full backup of Wednesday morning in the online database and trace it back to the data status as of Wednesday morning.
  • 4) After the full data recovery is completed, intercept the Binlog logs from the early morning to the time of the failure, and then perform data recovery.
  • 5) Have test colleagues verify the consistency of the data.
  • 6) After everything is fine, resume online use.

process result:

After about 30 to 40 minutes of processing, the platform recovered.

Let's start simulating this case.

2. The first step: full data backup in the early morning of Wednesday

Automatic data backup on Wednesday morning:

[root@mysql ~]# mysqldump -uroot -p123456 -A -R --triggers -E --master-data=2 --single-transaction > /data/backup/all_db_bak-`date +%F`.sql

Check the data backup status after work on Wednesday.

1.检查备份是否存在
[root@mysql ~]# ll /data/backup/all_db_bak-`date +%F`.sql
-rw-r--r-- 1 root root 884223 72 0:45 /data/backup/all_db_bak-2022-07-02.sql

2.检查备份中的内容

3.记录凌晨备份文件中的Binlog状态信息(备份开始时间点的Position号和GTID号)
#每天养成习惯,记录下备份文件中备份开始时间段的Binlog状态,当天出现数据问题,没有备份时,可以快速找到数据对应的Binlog位置。
-- CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000012', MASTER_LOG_POS=14417;
SET @@GLOBAL.GTID_PURGED='e0a2c0cc-f835-11ec-8a3c-005056b791aa:1-66';

3. The second step: simulate the business operations from the early morning backup on Wednesday to before 3 p.m.

Simulate the business operations that occurred after the backup was completed in the early morning of Wednesday and before the database exception occurred at 3 p.m.

CREATE TABLE xscjb (
	xh INT COMMENT '学号',
	xm VARCHAR ( 20 ) COMMENT '姓名',
	ywcj INT COMMENT '语文成绩',
	sxcj INT COMMENT '数学成绩',
	yycj INT COMMENT '英语成绩'
) COMMENT '学生成绩表';
    
insert into xscjb VALUES (1, '小明', 45, 75, 93 );
insert into xscjb VALUES (2, '小红' , 47, 56, 25);
insert into xscjb VALUES (3, '小兰', 82, 91, 89);
insert into xscjb VALUES (4, '小黄', 88, 75, 66);
insert into xscjb VALUES (5, '小李', 93, 96, 91);
insert into xscjb VALUES (6, '小江', 97, 67, 65);
insert into xscjb VALUES (7, '小王', 75, 58, 32);

update xscjb set ywcj = '100' where xh = '7';
update xscjb set sxcj = '77' where xm = '小兰';
update xscjb set yycj = '99' where xm = '小王';

delete from xscjb where xh = '4';

When the database is abnormal, the last look of the xscjb business table.

image-20220702220935456

4. The third link: Simulating abnormal data loss in the database causing the platform to become unusable

The simulation database file is damaged and data is lost, making the platform unusable.

直接将平台的数据库删除就行了。
mysql> drop database db_1;
[root@mysql ~]# rm -rf /data/mysql/db_1/

Simulation result: The platform library database is all damaged and collapsed, but the database instance can still be used. (During this simulation, all files in the database were not deleted because Binlog is also in this path. If all the simulated database files are damaged, the database will need to be reinitialized during repair, and Binlog will also be overwritten. Therefore, only the simulation is done here. The platform library database file is damaged and all is lost.)

If you only delete all the files of a database on the disk, you cannot delete this database interactively. You need to restart the MySQL instance before you can delete it. Therefore, when you delete the database, you must first delete the database and then delete the disk. document.

5. The fourth step: Publish the service suspension announcement and all employees will enter the data recovery processing stage.

It was already 3 o'clock in the afternoon, and the database was suddenly damaged. The platform library files were all damaged due to so-and-so, making the platform inaccessible.

The service suspension announcement has been issued, and the problem processing and resolution phase will begin.

After some analysis, it was determined that all database files corresponding to the platform were damaged, and all data in the database was lost, not some table data. We need to urgently enter the data repair stage, and the estimated time-consuming is unknown! ! ! !

6. The fifth step: Expedited data recovery process!

6.1. Restore the full backup data of the platform library in the early hours of today

1.找到今日凌晨时的数据备份文件。
[root@mysql ~]# ll /data/backup/all_db_bak-2022-07-02.sql 
-rw-r--r-- 1 root root 884223 72 0:45 /data/backup/all_db_bak-2022-07-02.sql

2.直接在线上生产库中还原凌晨备份的数据。
mysql> set sql_log_bin=0;
mysql> source /data/backup/all_db_bak-2022-07-02.sql 

At this time, the platform library has been restored from the full backup, but the full backup only contains data before this morning, and the data from this morning to the present has not been restored.

image-20220702222802625

6.2. Restore data from Binlog that has not been backed up from early morning to now.

We have restored the data of the platform library from the full backup to the state in the early morning today, but the data from early morning to before the failure was lost before it could be backed up. Let's restore the unbacked data from the Binlog.

Next we need to intercept the Binlog from the early morning of today to the present time. We all know that the most troublesome thing in Binlog log interception is to find the starting point and the end point. The end point is easy to find. The platform library has been down, so the last GTID transaction number in the Binlog, It's the end.

Since we added --maste-data=2this parameter when using mysqldump backup, this parameter will help us record in the backup file: the most recent GTID number in Binlog calculated from the time when the backup started, the name of the currently used Binlog log, and the latest event in Binlog A Position identification number. With these three pieces of information, it is no longer difficult to find the starting point. We can quickly specify which Binlog logs we want to intercept.

The GTID number is easier to find. We intercept the Binlog data with the GTID number.

1) Determine the range of Binlog log GTID numbers to be intercepted

The following two lines are the status of the Binlog log found in the backup file. From this we can know that the currently used Binlog is mysql-bin.000012, the Position number of the latest event is 14417, and the latest GTID number is 66.

And the recorded GTID number format is e0a2c0cc-f835-11ec-8a3c-005056b791aa:1-66like this, which actually tells us that in this backup file, all logs generated in the range of GTID numbers 1-66 are backed up.

-- CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000012', MASTER_LOG_POS=14417;
SET @@GLOBAL.GTID_PURGED='e0a2c0cc-f835-11ec-8a3c-005056b791aa:1-66';

We have just restored the data from the full database backup, which means that all the data in the range of GTID numbers 1-66 has been restored, but the range from GTID number 67 to the time of database crash has not been restored.

Next, let's take a look at the Binlog event information to see if GTID number 67 is new data, and then obtain the latest GTID number when the database crashed.

At GTID number 67, we see that a new table has been created. This operation is the new business logic. Yes, it is intercepted from here. The record in the backup file is accurate.

image-20220702225046179

The starting point is found to be GTID number 67, so it is still early to reach the end point, so go directly to the end. The last GTID number is the last transaction when the database crashes, and the last GTID number is 78.

image-20220702225529891

2) Intercept the Binlog logs generated from early morning to the current time

In the previous step, we have determined that the starting point of GTID is 67 and the end point is 78. Let’s start intercepting this part of the Binlog log.

[root@mysql ~]# mysqlbinlog --skip-gtids --include-gtids='e0a2c0cc-f835-11ec-8a3c-005056b791aa:67-78' /data/mysql/mysql-bin.000012 > /data/backup/sjbkjd-binlog.sql

3) Recover data from Binlog from early morning to database crash time period

mysql> set sql_log_bin=0;
mysql> source /data/backup/sjbkjd-binlog.sql

7. The sixth link: Verify the accuracy of data

The database has been fully restored, and the data from the early morning to the database crash has also been restored from the Binlog. Next, test colleagues will check the accuracy of the data.

mysql> select * from db_1.xscjb;
+------+--------+------+------+------+
| xh   | xm     | ywcj | sxcj | yycj |
+------+--------+------+------+------+
|    1 | 小明   |   45 |   75 |   93 |
|    2 | 小红   |   47 |   56 |   25 |
|    3 | 小兰   |   82 |   77 |   89 |
|    5 | 小李   |   93 |   96 |   91 |
|    6 | 小江   |   97 |   67 |   65 |
|    7 | 小王   |  100 |   58 |   99 |
+------+--------+------+------+------+
6 rows in set (0.00 sec)

The data is accurate and all restored successfully.

7. The seventh link: The platform is back online

At this time, all the data has been restored, the platform can be used normally, and it is announced that it is online again.

Guess you like

Origin blog.csdn.net/weixin_44953658/article/details/135449189