MySQL master-slave replication scenario single table data error causes replication termination how to quickly repair

Scene description:

If the table t data on the slave database is inconsistent with the master database, causing replication errors, the entire database has a large amount of data, and redoing the slave database is slow. How to recover the data in this table alone?
It is generally considered that it is impossible to repair single-table data, because the state of each table is inconsistent.
The following lists the problems and solutions that the backup single table will face when it is restored to the slave database.

1. Description of the demonstration environment:

Dell physical server r620 Both
network environments are intranet
master: 192.168.1.220
slave: 192.168.1.217
OS system environment: centos7.8 X86_64 minimal installation, close iptables, close selinux
test software version: mysql5.7.27 binary package
in advance Configure MySQL master-slave replication based on Gtid to
create simulated test data, simulate failure scenarios to
repair MySQL master-slave replication
pt-table-checksum, verify whether the repaired MySQL master-slave replication data is consistent

Two, configure master-slave replication

The MySQL installation process is no longer described here, Baidu itself

To configure master-slave replication
to configure a new slave for a master machine, remember to add the parameter --set-gtid-purged=ON when mysqldump backs up data.
Knowledge Supplement:
1. Regular backup is to add --set-gtid-purged= OFF solve the warning during backup
[root@localhost ~]# mysqldump -uroot -p'dXdjVF#(y3lt' --set-gtid-purged=OFF --single-transaction -A -B |gzip> 2020-09-17 .sql.gz
2. For the backup made when building the master-slave, you don’t need to add the --set-gtid-purged=OFF parameter, but need to add --set-gtid-purged=ON
[root@localhost ~]# mysqldump -uroot -p'dXdjVF#(y3lt' --set-gtid-purged=ON --single-transaction -A -B --master-data=2 |gzip> 2020-09-17.sql.gz
prompt:
in To build master-slave replication, don’t turn it off. During daily backup, you can turn it off.
--set-gtid-purged=AUTO,ON,OFF
1.--set-gtid-purged=OFF can be used in daily backup parameters . 2
.-- = SET-gtid the oN-Purged parameters required to build the master copy from the environment configuration

The specific steps for configuring master-slave replication based on Gtid are as follows:

master library:

 grant replication slave on *.* to rep@'192.168.1.217' identified by 'JuwoSdk21TbUser'; flush privileges;
 mysqldump -uroot -p'dXdjVF#(y3lt' --set-gtid-purged=ON --single-transaction -A -B --master-data=2 |gzip > 2020-09-20.sql.gz 

Slave library operation:

 [root@mysql02 ~]# mysql < 2020-09-17.sql 
 mysql>  change master to master_host='192.168.1.220',master_user='rep',master_password='JuwoSdk21TbUser',MASTER_AUTO_POSITION = 1;start slave;show slave status\G
ERROR 29 (HY000): File '/data1/mysql/3306/relaylog/relay-bin.index' not found (Errcode: 2 - No such file or directory)
ERROR 29 (HY000): File '/data1/mysql/3306/relaylog/relay-bin.index' not found (Errcode: 2 - No such file or directory)
Empty set (0.00 sec)

The reason is that the storage path of the relay-log is configured in the slave machine configuration my.cnf, but the path does not exist in the slave server, and an error is reported. Create the directory, authorize the mysql permission, and then change the master again

mkdir -p /data1/mysql/3306/relaylog/
cd /data1/mysql/3306/
chown -R mysql.mysql relaylog
mysql> change master to master_host='192.168.1.220',master_user='rep',master_password='JuwoSdk21TbUser',MASTER_AUTO_POSITION = 1;start slave;show slave status\G

The master-slave replication configuration is complete.

3. Prepare test data and simulate failure

Create a simulation demo table on the master library, have timers and stored procedures, and write data to the test table regularly to facilitate the creation of the test table for the following master-slave replication failure recovery demonstration
:

 CREATE TABLE `test_event` (
`id` int(8) NOT NULL AUTO_INCREMENT, 
`username` varchar(20) COLLATE utf8_unicode_ci NOT NULL,
`password` varchar(20) COLLATE utf8_unicode_ci NOT NULL, 
`create_time` varchar(20) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`) #主键ID
) ENGINE=innodb AUTO_INCREMENT=0 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

Create a timer and write a piece of data every second after 1 minute from the current time:

delimiter $$
create event event_2 
on schedule every 1 second STARTS   CURRENT_TIMESTAMP + INTERVAL 1 MINUTE
COMMENT 'xiaowu create'
do 
    BEGIN
           insert into test_event(username,password,create_time) values("李四","tomcat",now());
    END $$
delimiter ;

Create a new test table txt similar to the above method, write data regularly

Slave library simulation failure:


 insert into test_event(username,password,create_time) values("李四","tomcat",now());
 insert into test_event(username,password,create_time) values("李四","tomcat",now());
 delete from txt where id=200;

Then delete the record with id=200 on the master library.
Master operation: delete from txt where id=200;

At this time, the slave library checks the replication status and has stopped replication:

[root@mysql02 ~]#  mysql -e "show slave status\G"|grep -A 1 'Last_SQL_Errno'
               Last_SQL_Errno: 1062
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '8a9fb9a3-f579-11ea-830d-90b11c12779c:42083' at master log mysql-bin.000001, end_log_pos 18053730. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

Four, failure recovery

scene 1

If the replication error is reported, the master-slave replication has not been repaired using methods such as skipping errors and replication filtering. The master database data has been updated, and the slave database data is stuck in an error state (assuming the GTID is 8a9fb9a3-f579-11ea-830d-90b11c12779c:1-42083).
Repair steps:
backup table test_event on the master database (assuming the backup snapshot GTID is 8a9fb9a3-f579-11ea-830d-90b11c12779c:1-42262);
restore to the slave database;
start replication.
The problem here is that the replication start position is 8a9fb9a3-f579-11ea-830d-90b11c12779c:42084, and the data status of the table test_event from the library is ahead of other tables.
8a9fb9a3-f579-11ea-830d-90b11c12779c:42084-42262 In these transactions, as long as there is a transaction that modifies the table test_event data, it will cause replication errors, such as primary key conflicts and record non-existence (while 8a9fb9a3-f579-11ea-830d-90b11c12779c: 1-42084, the transaction that reported the previous replication error must be the transaction that modified table t)
Solution: Skip 8a9fb9a3-f579-11ea-830d-90b11c12779c:42084-42262 in these transactions that modify table t when starting replication.

The correct repair steps:

  1. Back up the table test_event on the master database (the backup snapshot GTID is 8a9fb9a3-f579-11ea-830d-90b11c12779c:1-42262) and restore it to the slave database;
  2. Set replication filter, filter table t:
    CHANGE REPLICATION FILTER REPLICATE_WILD_IGNORE_TABLE = ('dbtest01.test_event');
  3. Start copying, and stop copying when playback reaches 8a9fb9a3-f579-11ea-830d-90b11c12779c:1-42262 (the data in all tables on the slave library are in the same state at this time, which is consistent);
    START SLAVE UNTIL SQL_AFTER_GTIDS = '8a9fb9a3-f579 -11ea-830d-90b11c12779c:1-42262';
  4. Delete the replication filter and start replication normally.
    Note: Use mysqldump --single-transaction --master-data=2 here to record the GTID corresponding to the backup snapshot

The detailed steps are as follows:

A. To dump the table test_event that caused replication to stop on the master library:

mysqldump -uroot -p'dXdjVF#(y3lt'  --single-transaction dbtest01 test_event --master-data=2 |gzip >$(date +%F).test_event.sql.gz
[root@localhost ~]# mysqldump -uroot -p'dXdjVF#(y3lt'  --single-transaction dbtest01 test_event --master-data=2 |gzip >$(date +%F).test_event.sql.gz
mysqldump: [Warning] Using a password on the command line interface can be insecure.
Warning: A partial dump from a server that has GTIDs will by default include the GTIDs of all transactions, even those that changed suppressed parts of the database. If you don't want to restore GTIDs, pass --set-gtid-purged=OFF. To make a complete dump, pass --all-databases --triggers --routines --events. 

B. Get the snapshot gtid value of a separate backup table:

8a9fb9a3-f579-11ea-830d-90b11c12779c: 1-42262

[root@mysql02 ~]# gzip -d 2020-09-17.test_event.sql.gz
[root@mysql02 ~]# grep -A6 'GLOBAL.GTID_PURGED' 2020-09-17.test_event.sql 
SET @@GLOBAL.GTID_PURGED='8a9fb9a3-f579-11ea-830d-90b11c12779c:1-42262';
--
-- Position to start replication or point-in-time recovery from
--
-- CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=18130552;

C. Restore this table to the slave library. Because GTID_EXECUTED is not a null value, it fails to import the table test_event to the slave library. The specific error is as follows: Slave
library operation:

[root@mysql02 ~]#  mysql dbtest01 < 2020-09-17.test_event.sql 
ERROR 1840 (HY000) at line 24: @@GLOBAL.GTID_PURGED can only be set when @@GLOBAL.GTID_EXECUTED is empty.

 mysql> select  @@GLOBAL.GTID_EXECUTED;
+----------------------------------------------------------------------------------------+
| @@GLOBAL.GTID_EXECUTED                                                                 |
+----------------------------------------------------------------------------------------+
| 5ec577a4-f401-11ea-bf6d-14187756553d:1-2,
8a9fb9a3-f579-11ea-830d-90b11c12779c:1-42082 |
+----------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> show master status\G
*************************** 1. row ***************************
             File: mysql-bin.000001
         Position: 368620
     Binlog_Do_DB: 
 Binlog_Ignore_DB: 
Executed_Gtid_Set: 5ec577a4-f401-11ea-bf6d-14187756553d:1-2,
8a9fb9a3-f579-11ea-830d-90b11c12779c:1-42082
1 row in set (0.00 sec)

The solution is to log in to the slave library:
mysql> reset master;
this operation can empty the GTID_EXECUTED value of the current library

[root@mysql02 ~]#  mysql dbtest01 < 2020-09-17.test_event.sql 

mysql> show master status\G
*************************** 1. row ***************************
             File: mysql-bin.000001
         Position: 154
     Binlog_Do_DB: 
 Binlog_Ignore_DB: 
Executed_Gtid_Set: 8a9fb9a3-f579-11ea-830d-90b11c12779c:1-42262
1 row in set (0.00 sec)

D. Open copy filter online:

mysql> CHANGE REPLICATION FILTER REPLICATE_WILD_IGNORE_TABLE = ('db_name.test_event');
Query OK, 0 rows affected (0.00 sec)

[root@mysql02 ~]# mysql -e "show slave status\G"|egrep 'db_name.test_event'
  Replicate_Wild_Ignore_Table: db_name.test_event

E. Start copying, and stop copying when playing back to 8a9fb9a3-f579-11ea-830d-90b11c12779c:42262 (At this time, the data in all tables on the slave library are in the same state and consistent)

mysql> START SLAVE UNTIL SQL_AFTER_GTIDS ='8a9fb9a3-f579-11ea-830d-90b11c12779c:42262';
Query OK, 0 rows affected, 1 warning (0.03 sec)

mysql> 

Although the SQL thread is no at this time, the replication no longer reports an error:

[root@mysql02 ~]# mysql -e "show slave status\G"|egrep 'Last_SQL_Error|Slave_IO|Slave_SQL'
               Slave_IO_State: Waiting for master to send event
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
               Last_SQL_Error: 

F. Turn off replication filtering online:

mysql> CHANGE REPLICATION FILTER REPLICATE_WILD_IGNORE_TABLE = ();
Query OK, 0 rows affected (0.00 sec)

mysql> 
[root@mysql02 ~]# mysql -e "show slave status\G"|egrep 'db_name.test_event|IO_Running|SQL_Running'
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
      Slave_SQL_Running_State: 

G. Start the slave replication SQL thread:


mysql> start slave sql_thread;
Query OK, 0 rows affected (0.04 sec)

Master-slave replication recovery:

[root@mysql02 ~]# mysql -e "show slave status\G"|egrep 'IO_Running|SQL_Running'
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates

Note: Use mysqldump --single-transaction --master-data=2 here to record the GTID corresponding to the backup snapshot

Five, verify the consistency of master-slave data

Use the verification tool pt-table-checksum to verify. For details on how to install and use, please refer to the following blog post address:
https://blog.51cto.com/wujianwei/2409523


[root@localhost bin]# time /usr/local/percona-toolkit/bin/pt-table-checksum h=192.168.1.220,u=ptsum,p='ptchecksums',P=3306 --ignore-databases sys,mysql  --truncate-replicate-table  --replicate=percona.ptchecksums --no-check-binlog-format --nocheck-replication-filters --recursion-method="processlist"   2>&1 | tee 2020-09-18-pt-checksum.log

Checking if all tables can be checksummed ...
Starting checksum ...
            TS ERRORS  DIFFS     ROWS  DIFF_ROWS  CHUNKS SKIPPED    TIME TABLE
09-18T07:49:09      0      0     9739          0       4       0   0.747 dbtest01.hlz_ad
09-18T07:49:10      0      0    64143          0       4       0   0.968 dbtest01.hlz_ad_step
09-18T07:49:16      0      0   741424          0      10       0   6.014 dbtest01.hlz_bubble
09-18T07:49:18      0      0   499991          0       5       0   1.610 dbtest01.test01
09-18T07:49:25      0      0  3532986          0      13       0   7.802 dbtest01.test02
09-18T07:49:26      0      0   126863          0       1       0   0.976 dbtest01.test_event
09-18T07:49:27      0      1    30294          0       1       0   0.582 test01.txt

real    1m22.725s
user    0m0.387s
sys 0m0.078s

It is found that the test01.txt table in the main library is inconsistent with the test01.txt in the slave.

The reason is: Just now in the simulation demonstration, the delete action
delete from txt where id=200 was executed on the slave library ; resulting in one less record in the slave library table txt than the master library txt table

Repair data:

[root@localhost bin]# /usr/local/percona-toolkit/bin/pt-table-sync h=192.168.1.220,u=ptsum,p=ptchecksums,P=3306 --databases=test01 --tables=test01.txt  --replicate=percona.ptchecksums  --charset=utf8  --transaction --execute

Check again, the data is consistent


[root@localhost bin]# time /usr/local/percona-toolkit/bin/pt-table-checksum h=192.168.1.220,u=ptsum,p='ptchecksums',P=3306 --ignore-databases sys,mysql  --truncate-replicate-table  --replicate=percona.ptchecksums --no-check-binlog-format --nocheck-replication-filters --recursion-method="processlist"   2>&1 | tee 2020-09-18-pt-checksum.log
Checking if all tables can be checksummed ...
Starting checksum ...
            TS ERRORS  DIFFS     ROWS  DIFF_ROWS  CHUNKS SKIPPED    TIME TABLE
09-18T09:48:10      0      0     9739          0       4       0   0.784 dbtest01.hlz_ad
09-18T09:48:11      0      0    64143          0       4       0   0.995 dbtest01.hlz_ad_step
09-18T09:48:16      0      0   741424          0       9       0   4.224 dbtest01.hlz_bubble
09-18T09:48:17      0      0   499991          0       5       0   1.470 dbtest01.test01
09-18T09:48:24      0      0  3532986          0      13       0   6.403 dbtest01.test02
09-18T09:48:24      0      0   133999          0       1       0   0.894 dbtest01.test_event
09-18T09:48:25      0      0    37431          0       1       0   0.511 test01.txt

real    0m15.676s
user    0m0.359s
sys 0m0.055s

Six, with a total of netizens test cases

Case 2 is attached, which is basically the same as Scene 1. I won't go into details here, and leave it to interested netizens.
The following briefly describes the scenarios and recovery methods:

If the replication error is reported, the master-slave replication can be repaired using methods such as skipping errors and replication filtering. The master and slave data are constantly being updated.
Error repair steps:
backup table t on the master database (assuming the backup snapshot GTID is aaaa:1-10000);
stop copying from the database, GTID is aaaa:1-20000;
restore table t to the slave database;
start replication.

Reason analysis:
The problem here is that the replication start position is aaaa:20001, aaaa:10000-20000. These transactions will not be played back on the slave library. If there is a transaction that modifies table t data, the slave library will lose this transaction. part of data

Solution: From the start of the backup to the start of replication, lock table t to ensure that there are no transactions that modify table t in aaaa:10000-20000.

Correct repair steps:

Add a read lock to
the main library table t; backup table t on the main library;
stop copying from the library and restore table t;
start copying;
unlock table t.

Suggested solution: If you don't want to lock the table t, you can directly suspend the copy from the library, and then use the scenario one to resume. This avoids locking the table.

Guess you like

Origin blog.51cto.com/wujianwei/2535067