Fault analysis | A regular MySQL master-slave delay jump

Author: Li Bin

A member of the Acson DBA team, responsible for the daily problem handling of the project and the troubleshooting of the company's platform. There are more than 100 million hobbies, guitar, travel, playing games...

Source of this article: original contribution

*Produced by the Aikesheng open source community, the original content is not allowed to be used without authorization, please contact the editor and indicate the source for reprinting.


As a DBA, I believe you must have dealt with the master-slave delay. Recently, I encountered an interesting delay problem in production, and I would like to share it with you here.

First look at the related monitoring of MySQL delay:

You may have discovered that this monitoring curve is different from common delay curves, mainly showing the following two characteristics:

1. The curves of delayed rise and fall are almost straight up and down, not slow growth, but a sudden change.

2. The highest point of the delay curve is consistent.

Through the monitoring configuration, we know that the delay detection obtains the value of Seconds_Behind_Master. Now that there are regular changes, we can use a simple command to grab the value of Seconds_Behind_Master to see if it is consistent with the monitored curve changes.

# 观察曲线,五分钟内延迟会出现多次变化,所以本次共抓取5分钟,300次的Seconds_Behind_Master值
for i in {1..300}
do
    echo `date` `mysql -S /data/mysql/3306/data/mysqld.sock -uxxx -pxxx -e 'show slave status\G' | grep Second` >>/tmp/second.log
    sleep 1
done

Check the content of /tmp/second.log, you can see that the value of Seconds_Behind_Master does jump repeatedly between 0 and 71.

Why would such phenomenon happen? According to past experience, the high probability of this delay is not caused by the pressure on the database, because the change of the delay curve is too regular. From other perspectives, in the process of comparing the time of the master and slave servers, a key information was finally captured: the time difference between the time of the slave library and the master library is basically 71S, which coincides with the maximum value of Seconds_Behind_Master jump of 71.

Some people may ask, won't Seconds_Behind_Master automatically subtract the time difference when calculating? Yes, we can see from the official documentation that after the IO thread starts, the Seconds_Behind_Master will automatically subtract the time difference when calculating, but a very important premise is that this time difference "will not change" after the IO thread starts .

Therefore, one possibility of a large delay jump is that after the IO thread starts, the slave library performs time correction through NTP or other methods, resulting in an error in the calculation of Seconds_Behind_Master.

So how to solve it? A simple solution is to restart the IO thread of the slave library and let it recalculate the difference between server times. However, this approach may cause delay jumps to reappear. The optimal solution is to correct the time of all servers in the cluster first, and then restart the IO thread when the time is consistent. Before correcting the server time, there are a few points that need our attention:

First, does the business use a function that calls the system time? One possible scenario is: directly log in to the database server to import SQL scripts. At this time, adjusting the server time is risky and needs to be evaluated by the business side.

Second, when correcting the time, is the time of the main library corrected forward or backward? Normally, the business impact of time forward correction (for example, 00:03 is corrected to 23:58) is greater than that of backward correction. For example, when inserting a piece of data, the create_time time field is 00:03, and then the time is corrected to 23:58. After the correction, an update operation is performed before 00:03. At this time, the phenomenon that update_time is before create_time will appear.

Third, when performing time correction, if the time difference is too large, you can perform slow corrections in multiple times, that is, control the time range of each correction, instead of correcting to the correct time through one operation, so that you can Minimize the impact on business.

One suggestion is: if the business logic strongly depends on the time field, a reliable way is to stop the connection of the application or configure it as read-only, and then perform time correction and restart the IO thread.

Reference: https://dev.mysql.com/doc/refman/5.7/en/show-slave-status.html

Guess you like

Origin blog.csdn.net/ActionTech/article/details/130222130