Mysql master-slave delay caused by high case

Yesterday, my colleague encountered an online case. The main phenomenon was that some users successfully placed orders, but they could not find the order details. They tracked them to the order table, and found that some orders did not have their status updated.
Troubleshooting:
The background service logic of the user's order has not been changed recently. After analyzing the alarm log, it is found that some null pointers are abnormal between 4:00 pm and 5:00 pm. The code is located as follows:
write picture description here
Obviously, the order DTO queried here is empty. Caused a null pointer exception. The normal order logic is, the user creates an order - generates an order in the local db - calls the rpc interface of the order center - the order center successfully creates the order and returns the order information - after getting the information, first check the local order table and then update it. So here comes the question. The delay from the generation of the order in the local table to the return of the information from the order center is only tens of ms. What is the reason for successfully inserting a piece of data into the mysql table, and then the piece of data cannot be found after tens of ms?
Here we obviously think of the master-slave separation of mysql, and then I went to check the information of the corresponding db, using a master-slave, insert or update operations generally go to the master database by default, and query operations generally go to the slave database by default. That is to say, the information of the master library is not synchronized to the slave library in about 50ms.
We know that in general, the primary master-slave delay is about 50~100us. Then we checked the monitoring of the MySQL cluster and found that during the failure, the cluster tps was relatively large, and the peak of the master-slave delay of MySQL reached 130s.
write picture description here
write picture description here
Since MySQL's master-slave synchronization mechanism is asynchronous single-threaded, when a large number of write operations are performed on the master machine, the master-slave synchronization IO thread cannot process it in time, which will cause a delay in the synchronization of the slave machine. At present, the synchronization mechanism of our database cluster is based on the cluster as the granularity, which will cause a delay in one database of the same cluster, and other databases will also be affected accordingly. That is to say, during the failure, a colleague performed a large number of SQL operations on another library in the same cluster, and the tps was as high as 5000 or more, and the suggested tps of the cluster was within 3000, resulting in too much access to the main library, and the slave library. The data synchronization cannot keep up, resulting in inconsistencies in the data from the database.
Existing problems :
(1) At first, a relatively small amount of data was used for synchronization, and then the amount of data was gradually increased. In the process of increasing the amount of data, we only paid attention to the pressure on the service provider, ignoring the pressure on the DB
. For the business that needs to query the data again in a short time, it is recommended to query SQL to force the main database, so that there will be no problem of master-slave delay.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324453552&siteId=291194637
Recommended