Big data notes (to be continued)

mysql

caching technology

Common solutions to double-write data consistency problems in databases and caches

  1. Common solutions Normally, the main purpose of using cache is to improve query performance. In most cases, we use the cache like this:
    Insert image description here
    After the user requests it, first check whether there is data in the cache, and if so, return it directly. If there is no data in the cache, continue to check the database. If the database has data, the queried data will be put into the cache, and then the data will be returned. If there is no data in the database, empty will be returned directly. This is a very common use of cache. At first glance, there seems to be no problem. But you overlooked a very important detail: if a piece of data in the database is updated immediately after being placed in the cache, how should the cache be updated? Isn't it possible to not update the cache line? Answer: Of course not. If the cache is not updated, for a long period of time (depending on the expiration time of the cache), user requests may obtain old values ​​from the cache instead of the latest values ​​from the database. Isn't this a data inconsistency problem? So, how do we update the cache? There are currently four options: write to the cache first, then write to the database, first write to the database, then write to the cache, then delete the cache, then write to the database, first write to the database, and then delete the cache. Next, let’s talk about these four options in detail.

  2. Write the cache first, then write to the database. For the solution of updating the cache, the first thing many people think of may be to directly update the cache (write cache) during the write operation, which is more straightforward and clear. So, here comes the question: In a write operation, should the cache be written first or the database be written first? Let’s talk about the situation of writing the cache first and then writing the database here, because it has the most serious problems.
    Insert image description here

For every write operation of a certain user, if just after writing to the cache, an abnormality occurs in the network suddenly, causing the write to the database to fail.
Insert image description here

The result is that the cache is updated with the latest data, but the database is not. Doesn't the data in the cache become dirty data? If the user's query request happens to read this data at this time, a problem will occur because the data does not exist in the database at all. This problem is very serious. We all know that the main purpose of caching is to temporarily store database data in memory to facilitate subsequent queries and improve query speed. But if a certain piece of data does not exist in the database, what is the point of caching this "fake data"? Therefore, the solution of writing to the cache first and then to the database is not advisable and is rarely used in actual work. 3. Write the database first and then the cache. Since the above solution does not work, let’s talk about the solution of writing the database first and then the cache. This solution is used by some people in low-concurrency programming (I guess).
Insert image description here
For user write operations, write to the database first and then the cache, which can avoid the previous problem of "fake data". But it brings new problems. What's the problem?
3.1 The write cache fails. If the write database and write cache operations are placed in the same transaction, when the write cache fails, we can roll back the data written to the database.
Insert image description here
If the amount of concurrency is relatively small and the interface performance requirements are not too high, you can play like this. However, in a high-concurrency business scenario, writing to the database and writing to the cache are both remote operations. In order to prevent deadlock problems caused by large transactions, it is generally recommended that writing to the database and writing cache should not be placed in the same transaction. That is to say, in this solution, if the write to the database succeeds but the write cache fails, the data written in the database will not be rolled back. This will happen: the database is new data, but the cache is old data, and the data on both sides are inconsistent. 3.1 Problems under high concurrency Assume that in a high concurrency scenario, there are two write data requests for the same data from the same user: a and b, and they are requested to the business system at the same time. Among them, request a obtains old data, while request b obtains new data, as shown in the following figure:
Insert image description here
Request a comes first and has just finished writing the database. However, due to network reasons, there was a lag and the cache was not written yet. At this time, request b came over and the database was written first. Next, request b successfully writes the cache. At this time, request a is stuck and the cache is also written. Obviously, during this process, the new data in the cache of request b is overwritten by the old data of request a. That is to say: in a high-concurrency scenario, if multiple threads execute the operation of writing to the database first and then writing to the cache at the same time, it may happen that the database has a new value, but the cache has an old value, and the data on both sides is inconsistent.
3.2 Waste of system resources Another big problem with this solution is that after every write operation, the cache will be written immediately after writing to the database, which is a waste of system resources. Why do you say that? You can imagine that if the cache is written, it is not a simple data content, but the final result obtained after very complex calculations. In this way, every time you write to the cache, you need to go through a very complex calculation. Isn't it a waste of system resources? Especially cpu and memory resources. There are also some business scenarios that are more special: more writing and less reading. In this type of business scenario, each write operation requires a cache write, which is a bit outweighing the gains. It can be seen that in high-concurrency scenarios, writing to the database first and then the cache has many problems and is not recommended. If you have already used it, quickly check to see if you have encountered any pitfalls?
4. Delete the cache first, and then write to the database. From the above content, we know that there are many problems if the cache is updated directly. So, why can't we change our thinking: instead of updating the cache directly, delete the cache instead? There are also two ways to delete the cache: first delete the cache, then write to the database, first write to the database, and then delete the cache. Let’s take a look at the situation: first delete the cache, and then write to the database. To put it bluntly, in the user's write operation, the cache operation is deleted first, and then the database is written. This plan is possible, but it will also have the same problem. 4.1 Problems under high concurrency Assume that in a high concurrency scenario, the same data from the same user has a read data request c and another write data request d (an update operation), and they are requested to the business system at the same time. As shown in the figure below:
Insert image description here
Request d comes first and deletes the cache. However, due to network reasons, it stalled for a while and I didn't have time to write to the database. At this time, request c came over. First, I checked the cache and found that there was no data. Then I checked the database. There was data, but it was an old value. Request c to update the old value in the database to the cache. At this point, the request d pause ends and the new value is written to the database. During this process, the new value of request d is not written into the cache by request c, which will also cause data inconsistency between the cache and the database. Correction: Step 7 in the picture writes the old value, and step 9 needs to be deleted. So, can the data inconsistency problem in this scenario be solved?
4.2 Cache double deletion In the above business scenario, there is one read data request and one write data request. When a write data request deletes the cache, a read data request may write the old value queried from the database into the cache. Some people say that it is not easy to handle. After requesting d to write to the database, wouldn't it be enough to delete the cache again?
Insert image description here
This is what we call cache double deletion, that is, deleting it once before writing to the database, and deleting it again after writing to the database. A very key point of this solution is that the second time you delete the cache, you do not delete it immediately, but after a certain time interval. Let's review again the process of data inconsistency caused by a high-concurrency read data request and a write data request: request d comes first and deletes the cache. However, due to network reasons, it stalled for a while and I didn't have time to write to the database. At this time, request c came over. First, I checked the cache and found that there was no data. Then I checked the database. There was data, but it was an old value. Request c to update the old value in the database to the cache. At this point, the request d pause ends and the new value is written to the database. After a period of time, such as 500ms, request d to delete the cache. This way it can indeed solve the cache inconsistency problem. So, why must the cache be deleted after a period of time? After request d is stuck and the new value is written to the database, request c updates the old value in the database to the cache. At this time, if request d is deleted too quickly, the cache will have been deleted before request c updates the old value in the database to the cache, and this deletion will be meaningless. You must delete the cache after requesting c to update the cache, so that the old value can be deleted in time. Therefore, a time interval needs to be added to request d to ensure that request c, or other requests similar to request c, if an old value is set in the cache, will eventually be deleted by request d. Next, there is another question: What should I do if the deletion fails when deleting the cache for the second time? Let’s leave some suspense here for now, I’ll talk about it in detail later.
5. Write to the database first, then delete the cache. As we know from the previous article, delete the cache first and then write to the database. In the case of concurrency, the data in the cache and the database may be inconsistent. Then, we can only hope for the final solution. Next, we will focus on the solution of writing to the database first and then deleting the cache. Insert image description here
In a high-concurrency scenario, there is a request to read data and a request to write data. The update process is as follows: Request e to write to the database first, but it stalled due to network reasons and did not have time to delete the cache. . Request f to query the cache, find that there is data in the cache, and return the data directly. Request e to delete the cache. In this process, only request f read the old data once, and later the old data was deleted in time by request e. It seems that there is no big problem. But what if the read data request comes first? Request f to query the cache, find that there is data in the cache, and return the data directly. Request e to write to the database first. Request e to delete the cache. Doesn't this situation seem to be a problem? Answer: Right. But I am afraid that the following situation will occur, that is, the cache itself will become invalid. As shown below:Insert image description here
The cache expiration time has expired and will automatically expire. Request f to query the cache. There is no data in the cache and the old value of the database is queried. However, it is stuck due to network reasons and there is no time to update the cache. Request e to write to the database first, and then delete the cache. Request f to update the old value into the cache. At this time, the data in the cache and database are also inconsistent. But this situation is still relatively rare, and the following conditions need to be met at the same time: the cache happens to automatically expire. Requesting f to find the old value from the database and updating the cache takes longer than requesting e to write to the database and delete the cache. We all know that querying the database is generally faster than writing to the database, not to mention that after writing to the database, the cache must be deleted. So in most cases, writing data requests takes longer than reading data. It can be seen that the probability that the system meets the above two conditions at the same time is very small. It is recommended that you use the solution of writing to the database first and then deleting the cache. Although data inconsistency problems cannot be avoided 100%, the probability of this problem is the smallest compared to other solutions. But in this scenario, what should I do if deleting the cache fails?
6. What should I do if cache deletion fails? In fact, the solution of writing to the database first and then deleting the cache has a common risk point like the double-deletion of the cache, that is: if the cache deletion fails, the data in the cache and the database will be inconsistent. So what should you do if you fail to delete the cache? Answer: The trial mechanism needs to be added. In the interface, if the database update is successful but the cache update fails, you can immediately retry three times. If any of them succeed, success is returned directly. If it fails three times, it will be written to the database and prepared for subsequent processing. Of course, if you retry directly and synchronously in the interface, when the concurrency of the interface is relatively high, it may affect the interface performance. At this time, you need to change to asynchronous retry. There are many ways to asynchronously retry. For example, a separate thread is started each time, and this thread is dedicated to retrying. However, in a high-concurrency scenario, too many threads may be created, causing system OOM problems, so it is not recommended. The retried task is handed over to the thread pool for processing, but if the server is restarted, some data may be lost. Write the retry data to the table, and then use scheduled tasks such as elastic-job to retry. Write the retried request to message middleware such as mq, and process it in the consumer of mq. Subscribe to the binlog of mysql. In the subscriber, if an update data request is found, the corresponding cache will be deleted.
7. The specific solution for using scheduled tasks to retry is as follows: when the user operation completes writing to the database but fails to delete the cache, the user data needs to be written into the retry table. As shown in the figure below: Insert image description here
In the scheduled task, the user data in the retry table is read asynchronously. The retry table needs to record a retry count field with an initial value of 0. Then retry 5 times, continuously deleting the cache, and the field value +1 for each retry. If any one of them succeeds, success is returned. If it still fails after retrying 5 times, we need to record a failure status in the retry table and wait for further processingInsert image description here
In high concurrency scenarios, it is recommended to use elastic-job for scheduled tasks. . Compared with scheduled tasks such as xxl-job, it can be processed in slices to improve processing speed. At the same time, the interval of each slice can be set to: 1, 2, 3, 5, 7 seconds, etc. If you are more interested in scheduled tasks, you can read my other article "Learning these 10 types of scheduled tasks, I am a little confused", which lists the most mainstream scheduled tasks at present. The disadvantage of using scheduled task retry is that the real-time performance is not that high. This solution is not suitable for business scenarios with particularly high real-time requirements. But for general scenarios, it can still be used. But it has a big advantage, that is, the data is stored in the database and no data will be lost.
8. mq In high-concurrency business scenarios, mq (message queue) is one of the essential technologies. It can not only decouple asynchronously, but also cut peaks and fill valleys. It is very meaningful to ensure the stability of the system. Friends who are interested in mq can read my other article "The Rude Things About MQ". The producer of mq, after producing the message, sends it to the mq server through the specified topic. Then the mq consumer subscribes to the topic's message, reads the message data, and performs business logic processing. The specific scheme of using mq to retry is as follows:Insert image description here
When the user operation completes writing to the database but fails to delete the cache, an mq message is generated and sent to the mq server. The mq consumer reads mq messages and retries 5 times to delete the cache. If any one of them succeeds, success is returned. If it still fails after retrying 5 times, it will be written to the dead letter queue. It is recommended to use rocketmq for mq. The retry mechanism and dead letter queue are supported by default. It is very convenient to use, and it also supports multiple business scenarios such as sequential messages, delayed messages and transaction messages. Of course, in this solution, deleting the cache can be completely asynchronous. That is, the user's write operation does not need to delete the cache immediately after writing to the database. Instead, send the mq message directly to the mq server, and then the mq consumer is fully responsible for the task of deleting the cache. Because the real-time performance of mq is relatively high, the improved solution is also a good choice.
9. Binlog As we talked about before, whether it is a scheduled task or mq (message queue), the retry mechanism is intrusive to the business. In a solution that uses scheduled tasks, additional logic needs to be added to the business code. If the cache deletion fails, the data needs to be written to the retry table. In the solution using mq, if deleting the cache fails, you need to send the mq message to the mq server in the business code. In fact, there is a more elegant implementation, which is to listen to binlog, such as using middleware such as canal. The specific plan is as follows:
Insert image description here
After writing the database in the business interface, ignore it and directly return success. The mysql server will automatically write the changed data to the binlog. Binlog subscribers get the changed data and then delete the cache. The business interface in this solution indeed simplifies some processes. You only need to care about database operations, and cache deletion work is done in binlog subscribers. But if you just delete the cache according to the scheme in the figure, it may fail if you only delete it once. How to solve this problem? Answer: This requires adding the retry mechanism discussed earlier. If deleting the cache fails, write the retry table and use a scheduled task to try again. Or write to mq and let mq retry automatically. It is recommended to use the mq automatic retry mechanism here. Insert image description here
If the cache deletion fails in the binlog subscriber, an mq message will be sent to the mq server and automatically retried 5 times in the mq consumer. If there is any success, success will be returned directly. If it still fails after retrying for 5 times, the message will automatically be placed in the dead letter queue, and manual intervention may be required later.

Guess you like

Origin blog.csdn.net/yangzex/article/details/134986730