Common master-slave replication error handling cases

1. Principle of master-slave replication

For MySQL master-slave replication, we first need to know the following points:

1. Necessary conditions for enabling master-slave replication

1)主库开启log-bin,binlog会记录主库所有变更操作,该日志是主从复制的核心日志。
2)主/从库server-id不一致,server-id是数据库的唯一标识,若主从数据库server-id一致将无法创建主从复制关系

2. The general process of master-slave replication

1)当两个数据库构建成为主从复制关系时,从库会启动IO线程和SQL线程;主库启动相关的dump线程。
2)主库发生的所有变更操作都会记录到binlog日志中
3)当主库发生变更时,主库的dump线程会通知从库的IO线程,IO线程根据具体的binlog文件以及位点信息将对应的日志写入到从库的relay log中;
4)从库的SQL线程解析应用relay log日志进行回放,以保证从库数据与主库一致

2. Basic investigation methods

1. Copy synchronization information

mysql> show slave status\G

Focus on parameters:

Master_Log_File: IO线程读取到的binlog file文件
Read_Master_Log_Pos: IO线程读取到的binlog file文件的位点信息
Relay_Log_File: SQL线程当前正在应用的relay log文件
Relay_Log_Pos: SQL线程当前应用relay log文件的位点信息
Relay_Master_Log_File: SQL线程当前应用记录对应的主库binlog file文件
Exec_Master_Log_Pos: SQL线程当前应用记录对应主库binlog file文件的位点信息
Slave_IO_Running: IO线程状态
Slave_SQL_Running: SQL线程状态
Last_SQL_Errno: 主从复制中断报错编码
Last_SQL_Error: 主从复制报错具体信息
Seconds_Behind_Master: 一定程度上反应了主从复制延迟
Master_UUID: 主库的uuid
Retrieved_Gtid_Set: 当前IO线程读取到的gtid集
Executed_Gtid_Set: 当前SQL线程应用到的gtid集

2. Database error log

mysql> show variables like 'log_error';
+---------------+------------------------------+
| Variable_name | Value                        |
+---------------+------------------------------+
| log_error     | /data/mysql57/logs/error.log |            //在数据库中查看error log路径
+---------------+------------------------------+
1 row in set (0.00 sec)

3. System log

# dmesg -T

Three, common error cases

3.1 Last_Errno - 1032

1. Specific error

root@mysql57 17:17:  [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
              Master_Log_File: binlog.000006
          Read_Master_Log_Pos: 230
               Relay_Log_File: relaylog.000002
                Relay_Log_Pos: 2024
        Relay_Master_Log_File: binlog.000005
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
          Exec_Master_Log_Pos: 1905
              Relay_Log_Space: 3518
               Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Update_rows event on table db1.t1; Unknown error 1032, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.000005, end_log_pos 2126
             Master_Server_Id: 33061
                  Master_UUID: 1d24c492-83eb-11ea-86cd-000c2913f5b2
             Master_Info_File: mysql.slave_master_info
           Retrieved_Gtid_Set: 1d24c492-83eb-11ea-86cd-000c2913f5b2:5-12
            Executed_Gtid_Set: 1d24c492-83eb-11ea-86cd-000c2913f5b2:1-11,
346cd00d-8ea9-41b7-8fea-67a6a483470f:1-24,
66c8918d-83eb-11ea-9d7d-000c29242ae2:1-2
                Auto_Position: 1
1 row in set (0.00 sec)

2. Troubleshooting

From Last_SQL_Errno and Last_SQL_Error, we can see that the interruption of the master-slave replication is due to the inability to perform an UPDATE operation, so we first need to find the corresponding UPDATE specific statement based on the master-slave replication and the BINLOG FILE of the error message and the POSITION location information. And check the specific reasons in the database that caused the execution of the statement to report an error.

1) Locate the problem SQL

According to the information provided by the master-slave replication error, we have two methods to locate the specific slow SQL. The first method is to find the corresponding SQL through the location in the master log directory based on the information provided by Relay_Master_Log_File, Exec_Master_Log_Pos, and Last_SQL_Error; the second method is to find the corresponding SQL through the location in the slave log directory based on Relay_Log_File and Relay_Log_Pos.

# 通master寻找问题SQL
# /usr/local/mysql57/bin/mysqlbinlog -vv binlog.000005 --start-position=1905 --stop-position=2126
### UPDATE `db1`.`t1`
### WHERE
###   @1=1 /* INT meta=0 nullable=0 is_null=0 */
###   @2=NULL /* VARSTRING(80) meta=80 nullable=1 is_null=1 */
###   @3=NULL /* INT meta=0 nullable=1 is_null=1 */
### SET
###   @1=1 /* INT meta=0 nullable=0 is_null=0 */
###   @2='cc' /* VARSTRING(80) meta=80 nullable=1 is_null=0 */
###   @3=23 /* INT meta=0 nullable=1 is_null=0 */

2) View the table data of the specific table error on the slave side of the replication error report, you can find that the record updated by the update statement does not exist in the slave at all, which causes the update operation to report an error. If the slave reports an error related to this type, it is generally due to the inconsistency of the master-slave data, which causes the binlog of the master library to be transmitted to the slave library, and the corresponding record cannot be matched when the slave library is applied and an error is reported.

root@mysql57 23:00:  [db1]> select * from `db1`.`t1`;
+----+------+------+
| id | name | age  |
+----+------+------+
|  2 | aa   |   12 |
|  3 | bb   |   14 |
|  4 | aa   |   12 |
+----+------+------+
3 rows in set (0.00 sec)

3. Problem solving

For the replication interruption caused by inconsistent master-slave data, we can judge whether we can skip the error report directly or if we still need to manually correct the relevant records after skipping the error report according to the specific error report scenario. What we need to do in this replication error report is to skip the error report and manually correct the record to ensure the consistency of the master-slave data.

mysql> stop slave;
mysql> set gtid_next = '1d24c492-83eb-11ea-86cd-000c2913f5b2:12';
mysql> begin;commit;
mysql> set gtid_next = 'AUTOMATIC';

# 重新开启主从同步并检查复制状态
mysql> start slave;
mysql> show slave status\G


# 订正数据
mysql> set sql_log_bin=0;
mysql> insert into t1 values(1,'cc',23);
mysql> set sql_log_bin=1;

3.2 Last_Errno - 1396

1. Specific error

mysql> show slave status \G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
              Master_Log_File: mysql-bin.000055
          Read_Master_Log_Pos: 541873929
               Relay_Log_File: relaylog.000177
                Relay_Log_Pos: 719899849
        Relay_Master_Log_File: mysql-bin.000053
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
          Exec_Master_Log_Pos: 719899639
              Relay_Log_Space: 45383924486
               Last_SQL_Errno: 1396
               Last_SQL_Error: Error 'Operation CREATE USER failed for 'brc_acrm'@'%'' on query. Default database: ''. Query: 'CREATE USER 'brc_acrm'@'%' IDENTIFIED BY PASSWORD '*A41CBDC67109B86378A51F6FC37A24333BAEA9E7''
                  Master_UUID: 42a444a1-c303-11e8-952b-005056901fca
             Master_Info_File: mysql.slave_master_info
           Retrieved_Gtid_Set: 42a444a1-c303-11e8-952b-005056901fca:1-177094
            Executed_Gtid_Set: 42a444a1-c303-11e8-952b-005056901fca:1-172612,
d8ea4112-c303-11e8-952f-00505690f262:1-12
                Auto_Position: 1
1 row in set (0.00 sec)

2. Troubleshooting

From Last_SQL_Errno and Last_SQL_Error, we can see that the specific execution of the error operation is a related action to create a user. Based on the error, we infer to see if we can find some useful clues in the mysql.user table.

From the following SQL results, we can see that the main library binlog transferred an operation to create a brc_acrm@% user, but the information of the brc_acrm@% user already existed in the slave library at this time, so the create user operation performed an error, and the master-slave replication Basic location of the interruption cause

mysql> select user,host from mysql.user where user='brc_acrm';
+----------+-----------+
| user     | host      |
+----------+-----------+
| brc_acrm | %         |
+----------+-----------+
1 rows in set (0.00 sec)

3. Problem solving

1) Comparing the authorization of the user under the master-slave database, from the perspective of authorization, the master-slave library is completely consistent. It is guessed that the development students may manually create users in the master/slave library through the super account.

主:
mysql> show grants for 'brc_acrm'@'%';
+---------------------------------------------------------------------------------------------------------+
| Grants for brc_acrm@%                                                                                   |
+---------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO 'brc_acrm'@'%' IDENTIFIED BY PASSWORD '*A41CBDC67109B86378A51F6FC37A24333BAEA9E7' |
+---------------------------------------------------------------------------------------------------------+
1 row in set (0.04 sec)

从:
mysql> show grants for 'brc_acrm'@'%';
+---------------------------------------------------------------------------------------------------------+
| Grants for brc_acrm@%                                                                                   |
+---------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO 'brc_acrm'@'%' IDENTIFIED BY PASSWORD '*A41CBDC67109B86378A51F6FC37A24333BAEA9E7' |
+---------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

2) In view of the above situation, we can choose to skip the current GTID transaction to resume the master-slave replication, the specific operations are as follows:

# 关闭主从同步
mysql> stop slave;

# 跳过错误
mysql> set gtid_next = '42a444a1-c303-11e8-952b-005056901fca:172613';
mysql> begin;commit;
mysql> set gtid_next = 'AUTOMATIC';

# 重新开启主从同步并检查复制状态
mysql> start slave;
mysql> show slave status\G

3.3 Last_Errno - 1418

1. Specific error

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
              Master_Log_File: mysql-bin.000068
          Read_Master_Log_Pos: 230
               Relay_Log_File: relay-log.000178
                Relay_Log_Pos: 16477419
        Relay_Master_Log_File: mysql-bin.000064
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
          Exec_Master_Log_Pos: 240368941
              Relay_Log_Space: 18473673            
               Last_SQL_Errno: 1418
               Last_SQL_Error: Error 'This function has none of DETERMINISTIC, NO SQL, or READS SQL DATA in its declaration and binary logging is enabled (you *might* want to use the less safe log_bin_trust_function_creators variable)' on query. Default database: 'electric'. Query: 'CREATE DEFINER=`ydan-user`@`%` FUNCTION `GetAccess`(id varchar(50)) RETURNS varchar(50) CHARSET utf8
BEGIN
	declare num int;
  declare result varchar(10);
  select count(*) into num from ou_userrole where userid=id and ROLEID like 'leader%'  ;
  if num>0 then set result='access';
	else set result ='no';
  end if;
  return result;
END'
             Master_Server_Id: 85857738
                  Master_UUID: d3c62cfa-7e24-11ea-8e2e-faf8ce492e00
             Master_Info_File: mysql.slave_master_info
           Retrieved_Gtid_Set: d3c62cfa-7e24-11ea-8e2e-faf8ce492e00:1-1982
            Executed_Gtid_Set: d3bf96b2-7e24-11ea-8fb1-fa38f0896800:1-10708449,
d3c62cfa-7e24-11ea-8e2e-faf8ce492e00:1-688
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

2. Troubleshooting

It can be seen from Last_SQL_Errno and Last_SQL_Error that the specific error of this master-slave synchronization interrupt is an error created by a function. Since both the master and slave have the log-bin parameter turned on, the functions in the master and slave have certain hidden dangers to data consistency in this case, so in the case of unclear whether the function will affect the data consistency, MySQL defaults Will prevent such operations.

mysql> show variables like 'log_bin_trust_function_creators';
+---------------------------------+-------+
| Variable_name                   | Value |
+---------------------------------+-------+
| log_bin_trust_function_creators | OFF   |         //该参数默认为OFF,表示开启log-bin的slave阻止创建未明确指定类型的函数
+---------------------------------+-------+
1 row in set (0.01 sec)

# 函数创建指定类型有
DETERMINISTIC       //不确定是否会影响数据一致性,
NO SQL              //没有SQl语句,保证不影响数据一致性
READS SQL DATA      //只是读取数据,保证不影响数据一致性

3. Problem solving

mysql> stop slave;
mysql> set sql_log_bin=0;
mysql> set global log_bin_trust_function_creators=TRUE;
mysql> 手动执行对应的存储过程
mysql> set global log_bin_trust_function_creators=0;  
mysql> set sql_log_bin=1;

3.4 Last_Errno - 1594

1. Specific error

1) MySQL replication information

mysql> show slave status\G
*************************** 1. row ***************************
              Master_Log_File: mysql-bin.000400
          Read_Master_Log_Pos: 363471322
               Relay_Log_File: relay-log.000236
                Relay_Log_Pos: 197156629
        Relay_Master_Log_File: mysql-bin.000397
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
          Exec_Master_Log_Pos: 197156408
              Relay_Log_Space: 1936343145
               Last_SQL_Errno: 1594
               Last_SQL_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 19670500
                  Master_UUID: 5d4178f0-d6eb-11e9-bf91-0242f977238f
             Master_Info_File: mysql.slave_master_info
           Retrieved_Gtid_Set: 5d4178f0-d6eb-11e9-bf91-0242f977238f:398658257-399871744:399871746-496665236,
5e07d619-d6eb-11e9-beff-0242e77c7e67:145857044-146322581
            Executed_Gtid_Set: 012ef0fc-4ae3-11e9-8406-28a6db6245e4:1-31634397,
5d4178f0-d6eb-11e9-bf91-0242f977238f:1-491760413,
5e07d619-d6eb-11e9-beff-0242e77c7e67:1-146297997,
e807c5a5-4ae2-11e9-845f-2c55d3e93d14:1-50267012
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

2)error log

2020-07-27T02:56:05.036042Z 2 [ERROR] Error in Log_event::read_log_event(): 'read error', data_len: 100, event_type: 31
2020-07-27T02:56:05.036069Z 2 [ERROR] Error reading relay log event for channel '': slave SQL thread aborted because of I/O error
2020-07-27T02:56:05.036096Z 2 [ERROR] Slave SQL for channel '': Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (y
ou can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in
 the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error
_code: 1594
2020-07-27T02:56:05.036109Z 2 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000397' 
position 197156408.
2020-07-29T15:43:29.114678Z 74970 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using 
the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2020-07-29T15:43:29.686895Z 74970 [ERROR] mysqld: Binary logging not possible. Message: Either disk is full or file system is read only while rotating the binlog. Aborting the server.
15:43:29 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

2. Troubleshooting

1) From the master-slave replication information, we can find that the replication interruption is because the SQL thread cannot parse the relay log. The relay may be damaged due to some reasons. In this case, we can manually report the error position through the mysqlbinlog tool and the master-slave replication. Click the information to analyze the relay log and confirm the error report.

2) Continue to check the error log. From the error log, we can find that the reason for the damage of the relay log this time may be due to I/O

3) Generally, the situations that may cause relay log damage are: database restart, network jitter, insufficient space, log cannot be written, etc.

3. Problem solving

For the error report that the relay log cannot be parsed normally, we need to re-designate the master-slave replication for recovery. If the master-slave replication is in non-GTID mode, you can change the master again according to the position information prompted by the error log. If you are using GTID mode, then we only need to specify MASTER_AUTO_POSITION=1. The specific operations are as follows:

mysql> stop slave;
mysql> CHANGE MASTER TO
  MASTER_HOST='172.26.44.1',
  MASTER_USER='rds_repl',
  MASTER_PASSWORD='xxx',
  MASTER_PORT=3044,
  MASTER_AUTO_POSITION=1;
mysql> start slave;

3.5 Last_Errno - 1872

1. Specific error

1) MySQL master-slave replication error

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: 
              Master_Log_File: mysql-bin.000402
          Read_Master_Log_Pos: 311570312
               Relay_Log_File: relay-log.000124
                Relay_Log_Pos: 311570536
        Relay_Master_Log_File: mysql-bin.000402
             Slave_IO_Running: No
            Slave_SQL_Running: No
          Exec_Master_Log_Pos: 311570251
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1872
               Last_SQL_Error: Slave failed to initialize relay log info structure from the repository
                  Master_UUID: 5d4178f0-d6eb-11e9-bf91-0242f977238f
             Master_Info_File: mysql.slave_master_info
           Retrieved_Gtid_Set: 5d4178f0-d6eb-11e9-bf91-0242f977238f:448797409-499521040,
5e07d619-d6eb-11e9-beff-0242e77c7e67:146080492-146337012
            Executed_Gtid_Set: 012ef0fc-4ae3-11e9-8406-28a6db6245e4:1-31634397,
5d4178f0-d6eb-11e9-bf91-0242f977238f:1-499521040,
5e07d619-d6eb-11e9-beff-0242e77c7e67:1-146337012,
d00323ab-8ce0-11e9-99bc-0242065ece34:1-32,
e807c5a5-4ae2-11e9-845f-2c55d3e93d14:1-50267012
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

2)error log

2020-07-31T03:51:03.357467Z 0 [ERROR] Failed to open the relay log '/data/mysql/binlog/relay-log.000124' (relay_log_pos 311570536).
2020-07-31T03:51:03.357484Z 0 [ERROR] Could not find target log file mentioned in relay log info in the index file '/data/mysql/binlog/relay-log.index' during relay log initialization.
2020-07-31T03:51:03.409170Z 0 [ERROR] Slave: Failed to initialize the master info structure for channel ''; its record may still be present in 'mysql.slave_master_info' table, consider deleting it.
2020-07-31T03:51:03.409211Z 0 [ERROR] Failed to create or recover replication info repositories.
2020-07-31T03:51:03.409247Z 0 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
2020-07-31T03:51:03.409259Z 0 [ERROR] mysqld: Slave failed to initialize relay log info structure from the repository
2020-07-31T03:51:03.409267Z 0 [ERROR] Failed to start slave threads for channel ''
2020-07-31T04:55:43.905528Z 198 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
2020-07-31T04:55:50.286848Z 198 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
2020-07-31T04:57:22.006869Z 198 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872

3) System log

Jul 31 11:51:31 dbinstance3 dockerd: time="2020-07-31T11:51:31.757931233+08:00" level=info msg="Container 53feb8e40d8d4b7712675f1cef144c243a34b068e0193b568cb4e6de8cb7a681 failed to exit within 10 seconds of signal 15 - using the force"
Jul 31 11:51:34 dbinstance3 containerd: time="2020-07-31T11:51:34.110864629+08:00" level=info msg="shim reaped" id=53feb8e40d8d4b7712675f1cef144c243a34b068e0193b568cb4e6de8cb7a681
Jul 31 11:51:34 dbinstance3 dockerd: time="2020-07-31T11:51:34.120638146+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 31 11:51:34 dbinstance3 containerd: time="2020-07-31T11:51:34.464111797+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/53feb8e40d8d4b7712675f1cef144c243a34b068e0193b568cb4e6de8cb7a681/shim.sock" debug=false pid=23184

2. Troubleshooting

1) According to the Last_SQL_Error and errorlog logs, we can see that the master-slave replication was interrupted because the relay-log.000124 could not be opened, and the initialization failed when the relay log was reinitialized through relay-log.index in the subsequent preparations.

2) According to the above error, we enter the datadir of the database, check the relay-log.index file, and find that the relog number of the file starts to count again from 1, and relay-log.000124 is missing

3) Looking at the system log, we found a more acceptable docker failed exit message (the database was started by the docker container). Based on this error, we checked the container startup time of the database instance through docker ps, and found that the startup time and the failure time happened just coincide . However, because the docker does not output related logs, it is inconvenient for us to continue to investigate what caused the relay log to be lost.

3. Problem solving

For the case where the entire relay log is lost and the initialization of the relay log fails, we also need to re-designate the master-slave replication for recovery. If it is not in GTID mode, you need to specify the specific binlog file and location information; if it is in GTID mode, specify MASTER_AUTO_POSITION=1.

mysql> reset slave;
mysql> CHANGE MASTER TO
  MASTER_HOST='172.26.44.1',
  MASTER_USER='rds_repl',
  MASTER_PASSWORD='xxx',
  MASTER_PORT=3044,
  MASTER_AUTO_POSITION=1;
mysql> start slave;

Guess you like

Origin blog.csdn.net/weixin_37692493/article/details/107827003