Remember the Mysql master-slave replication delay caused by an MDL lock, Waiting for dependent transaction to commit, Waiting for table metadata lock

Remember the Mysql master-slave replication delay caused by an MDL lock, Waiting for dependent transaction to commit, Waiting for table metadata lock

Digression

In a production environment, for better performance, we often see the situation where the master node writes and the slave node reads only, that is, read-write separation; this article records a special master-slave replication delay encountered during read-write separation.

1. Master-slave replication delay

Received master-slave replication delay alarm

告警内容:(敏感信息做了屏蔽)

             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 13754354
              Relay_Log_Space: 14690374
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: xxxx
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 53153306
                  Master_UUID: xxxxxx-3c7f-11e8-969a-005056a16d70
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Waiting for dependent transaction to commit
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: xxxxxxx-3c7f-11e8-969a-005056a16d70:153908122-204096262
            Executed_Gtid_Set: xxxxxxx-3c7f-11e8-969a-005056a16d70:1-204095398

After receiving the alarm, immediately log in to the slave library:
the phenomenon viewed in the show slave status of the slave library is the same as the received alarm.

Two, master-slave replication delay analysis

It can be clearly observed from the content of the alarm that the master-slave replication does have a delay, and the CPU usage of the slave node server is relatively high at this time, so use show processlist to view the operations being performed by the slave node:

Insert picture description here

You can see that many select statements have been executed for a long time (this kind of statement is really a pit), guess whether it is caused by these select statements, check again and find a detail:
Insert picture description here
this is the Mysql master-slave replication process, it is obvious Seeing Waiting for table metadata lock, followed by a create index statement to create an index, it can be judged that MDL (metadata lock) has occurred; the process creates an index for the master node, and when the index creation operation is played back from the node, it is The select statement is blocked, causing delays in master-slave replication.

PS: What is MDL metadata lock?
MySQL 5.5 version introduced the MDL lock (metadata lock), used to solve or ensure the consistency between DDL operations and DML operations. For details, please refer to a blog post I reprinted:
https://blog.csdn.net/Tah_001/article/details/107931747

Three, problem solving

After locating the cause, communicate with the relevant personnel of the business system, you can directly kill the slow query select statement above. After the related select statement is killed, the MDL lock is released and the master-slave replication returns to normal.

Four, summary

The master-slave replication may be delayed:

1. The network delay is high, causing the standby IO thread to wait.
2. The IO hardware condition of the standby library is worse than that of the main library, and the IO capability is insufficient.
3. A large transaction occurred in the execution of the main library, resulting in a delayed spike.
4. The standby database is not enabled for multi-threaded replication, and sql apply has a bottleneck.
5. There is metadata lock waiting in the current session of the standby machine.
6. No primary key table update.

Improvements:
1. Find out the slow query statements of the business system and require rectification. Read-only nodes cannot be used for statement testing.
2. The index creation is a DDL statement, use the pt tool or create it during the peak period of the business.

Oh, not bad! ------Welcome to point out the mistakes and add better methods

Guess you like

Origin blog.csdn.net/Tah_001/article/details/107930504