Database and cache double-write consistency

foreword

The problem of double-write data consistency in databases and caches (such as redis) is a public problem that has nothing to do with the development language. Especially in high concurrency scenarios, this problem becomes more serious.

I am very responsible to tell you that the probability of encountering this problem is very high, whether in interviews or at work, so it is very necessary to discuss it with you.

In today’s article, I will go from shallow to deep, and talk with you about common solutions to database and cache double-write data consistency problems, possible pitfalls in these solutions, and what is the optimal solution.

1. Common solutions

Usually, our main purpose of using caching is to improve query performance. In most cases, we use the cache like this:
insert image description here
1. After the user requests, first check whether there is data in the cache, and return it directly if there is.
2. If there is no data in the cache, continue to check the database.
3. If the database has data, put the queried data into the cache, and then return the data.
4. If there is no data in the database, return empty directly.

This is a very common usage of caching. At first glance, there seems to be no problem.
But you overlooked a very important detail: if a piece of data in the database is updated immediately after being put into the cache, how to update the cache?

Doesn't update the cache line work?

Answer: Of course not. If the cache is not updated, for a long period of time (depending on the expiration time of the cache), the user request may get the old value from the cache instead of the latest value of the database. Isn't this a problem of data inconsistency?

So, how do we update the cache?

Currently, there are 4 solutions:
write to the cache first, then write to the database,
write to the database first, then write to the cache,
delete the cache first, then write to the database,
write to the database first, then delete the cache

Next, we describe these four options in detail.

2. Write the cache first, then write the database

For the scheme of updating the cache, the first thing many people think of may be to directly update the cache (write cache) during the write operation, which is more straightforward.
So, here comes the question: In the write operation, should the cache be written first, or the database first?
Let's talk about the situation where the cache is written first, and then the database is written, because it has the most serious problem.
insert image description here
For each write operation of a certain user, if the cache has just been written, an abnormality occurs in the network suddenly, causing the write to the database to fail.
insert image description here

As a result, the cache is updated with the latest data, but the database is not. Doesn't the data in the cache become dirty data? If the user's query request just reads the data at this time, there will be a problem, because the data does not exist in the database at all, and this problem is very serious.

We all know that the main purpose of caching is to temporarily store the data in the database in memory, which is convenient for subsequent queries and improves query speed.

But if a piece of data doesn't exist in the database , what's the point of caching this kind of fake data?

Therefore, it is not advisable to write the cache first, and then write the database, and it is not used much in actual work.

3. Write to the database first, then write to the cache

Since the above scheme does not work, let’s talk about the scheme of writing the database first and then writing the cache. This scheme is used in low-concurrency programming (I guess).
insert image description here

For the user's write operation, write to the database first, and then write to the cache, which can avoid the previous "false data" problem. But it created new problems.

What's the problem?

3.1 Write cache failed

If the write database and write cache operations are placed in the same transaction, when the write cache fails, we can roll back the data written to the database.
insert image description here
If it is a system with a relatively small amount of concurrency and low requirements on interface performance, you can play like this.

However, in a high-concurrency business scenario, writing to the database and writing to the cache are both remote operations. In order to prevent deadlock problems caused by large transactions, it is generally recommended not to write to the database and write to the cache in the same transaction.

That is to say, in this scheme, if the write to the database succeeds, but the write cache fails, the data written in the database will not be rolled back.

This will appear: the database is new data, while the cache is old data, and the data on both sides is inconsistent.

3.2 Problems under high concurrency

Assume that in a high-concurrency scenario, for the same piece of data of the same user, there are two data write requests: a and b, which are simultaneously requested to the business system.
Among them, request a obtains old data, while request b obtains new data, as shown in the figure below:
insert image description here
1. Request a comes first, and the database has just been written. However, due to network reasons, it froze for a while, and I haven't had time to write the cache.
2. At this time, request b came over, and the database was written first.
3. Next, request b successfully writes the cache.
4. At this point, request a freezes and ends, and the cache is also written.
Obviously, during this process, the new data of request b in the cache is overwritten by the old data of request a.

That is to say: in a high-concurrency scenario, if multiple threads execute the operation of writing to the database first and then writing to the cache at the same time, the database may contain new values ​​while the cache contains old values, and the data on both sides may be inconsistent.

3.3 Waste of system resources

Another big problem with this solution is that after each write operation, the database will be written to the cache immediately, which is a waste of system resources.

Why do you say that?

You can imagine, if the write cache is not a simple data content, but the final result after very complex calculations. In this way, every time the cache is written, a very complicated calculation is required. Isn't it a waste of system resources?

Especially cpu and memory resources.

There are also some special business scenarios: write more and read less.

If in this type of business scenario, each write operation used needs to be written to the cache once, the loss outweighs the gain.

It can be seen that in a high-concurrency scenario, writing to the database first, and then writing to the cache, this solution has many problems and is not recommended.

If you have already used it, hurry up and see if you have stepped on the pit?

4. Delete the cache first, then write to the database

From the above content, we know that if you directly update the cache, there are many problems.

So, why can't we change our thinking: instead of updating the cache directly, delete the cache instead?

There are also two schemes for deleting the cache:
1. Delete the cache first, then write to the database
2. Write to the database first, then delete the cache

Let's take a look together: delete the cache first, and then write the database.
insert image description here
To put it bluntly, in the user's write operation, the delete cache operation is performed first, and then the database is written. This set of plans may be possible, but there will be the same problem.

4.1 Problems under high concurrency

Assume that in a high-concurrency scenario, for the same piece of data of the same user, there is a read data request c, and another write data request d (an update operation), and the request is sent to the business system at the same time. As shown in the figure below:
insert image description here
1. Request d to come first and delete the cache. However, due to network reasons, it froze for a while before writing to the database.
2. At this time, the request c came. First, check the cache and find that there is no data, and then check the database. There is data, but the old value.
3. Request c to update the old value in the database to the cache.
4. At this point, the request d freeze ends, and the new value is written into the database.

During this process, the new value of request d is not written into the cache by request c, which will also cause data inconsistency between the cache and the database. Correction: Step 7 in the figure writes the old value, and step 9 needs to be deleted.

So, can the data inconsistency problem in this scenario be solved?

4.2 Cache double deletion

In the above business scenario, one read data request and one write data request. When the write data request deletes the cache, the read data request may write the old value queried from the database at that time into the cache.

Some people say that it is not easy to handle. Wouldn’t it be enough to ask d to delete the cache again after writing the database?
insert image description here
This is what we call cache double delete, that is, delete once before writing to the database, and delete again after writing to the database.
A very critical part of this solution is: the second time to delete the cache, not immediately, but after a certain time interval.
Let's revisit the generation process of data inconsistency caused by a high-concurrency read data request and a write data request:

1. Request d to come first and delete the cache. However, due to network reasons, it froze for a while before writing to the database.
2. At this time, the request c came. First, check the cache and find that there is no data, and then check the database. There is data, but the old value.
3. Request c to update the old value in the database to the cache.
4. At this point, the request d freeze ends, and the new value is written into the database.
5. After a period of time, for example: 500ms, request d to delete the cache.

In this way, it can indeed solve the problem of cache inconsistency.
So, why must the cache be deleted after a period of time?
After request d freezes, after writing the new value into the database, request c to update the old value in the database to the cache.
At this time, if the request d is deleted too quickly, the cache has been deleted before the request c updates the old value in the database to the cache, and this deletion is meaningless. It is necessary to delete the cache after requesting c to update the cache, so that the old value can be deleted in time.
Therefore, it is necessary to add a time interval to request d to ensure that request c, or other requests similar to request c, if the old value is set in the cache, will eventually be deleted by request d.

Next, there is another question: What should I do if the deletion fails when deleting the cache for the second time?

Let's leave some suspense here first, and I will talk about it in detail later.

5. Write to the database first, then delete the cache

As we know from the above, delete the cache first, and then write to the database. In the case of concurrency, there may also be data inconsistencies between the cache and the database.

So, we can only hope for the final solution.

Next, let's focus on the solution of writing to the database first and then deleting the cache.
insert image description here

In a high-concurrency scenario, there is a request for reading data and a request for writing data. The update process is as follows:
1. The request e writes to the database first, but it freezes for a while due to network reasons, and there is no time to delete the cache.
2. Request f to query the cache, find that there is data in the cache, and return the data directly.
3. Request e to delete the cache.

In this process, only request f reads the old data once, and later the old data is deleted in time by request e, which seems to be not a big problem.
But what if the read data request comes first?
1. Request f to query the cache, find that there is data in the cache, and return the data directly.
2. Request e to write to the database first.
3. Request e to delete the cache.

Does this look okay?

Answer: Yes.

But I am afraid that the following situation will occur, that is, the cache itself will fail. As shown in the figure below:
insert image description here
1. When the cache expiration time is up, it will automatically become invalid.
2. Request f to query the cache, there is no data in the cache, and the old value of the database is queried, but due to network problems, the cache is not updated in time.
3. Request e to write to the database first, and then delete the cache.
4. Request f to update the old value to the cache.

At this time, the data in the cache and the database are also inconsistent.

But this kind of situation is still relatively rare, and the following conditions must be met at the same time:
1. The cache just happens to be automatically invalidated.
2. Requesting f to find out the old value from the database and updating the cache takes longer than requesting e to write to the database and delete the cache.

We all know that the speed of querying the database is generally faster than writing the database, not to mention deleting the cache after writing the database. So in most cases, writing data requests takes longer than reading data.

It can be seen that the probability that the system satisfies the above two conditions at the same time is very small.

It is recommended that you use the solution of first writing to the database and then deleting the cache. Although the problem of data inconsistency cannot be avoided 100%, the probability of this problem is the smallest compared with other solutions.

But in this scenario, what if deleting the cache fails?

6. What should I do if I fail to delete the cache?

In fact, the scheme of writing to the database first and then deleting the cache is the same as the scheme of double-deleting the cache, and has a common risk point, that is, if the cache deletion fails, the data in the cache and the database will also be inconsistent.

So, what should I do if I fail to delete the cache?

Answer: The retry mechanism needs to be added.

In the interface, if the update of the database is successful, but the update of the cache fails, you can retry 3 times immediately. If any of them are successful, success is returned directly. If it fails all three times, it will be written to the database for subsequent processing.

Of course, if you directly retry synchronously in the interface, when the concurrency of the interface is relatively high, the performance of the interface may be slightly affected.

At this time, you need to change to asynchronous retry.

There are many ways to asynchronously retry, such as:

1. A separate thread is started each time, and this thread is dedicated to retrying. However, if in a high-concurrency scenario, too many threads may be created, causing system OOM problems, so it is not recommended.

2. Hand over the retried tasks to the thread pool, but if the server restarts, some data may be lost.

3. Write the retry data to the table, and then use elastic-job and other scheduled tasks to retry.

4. Write the retried request into message middleware such as mq, and process it in the consumer of mq.

5. Subscribe to the binlog of mysql. Among the subscribers, if a data update request is found, the corresponding cache will be deleted.

7. Scheduled tasks

The specific scheme of using scheduled task retry is as follows:

1. When the user operation finishes writing to the database, but the deletion of the cache fails, the user data needs to be written into the retry table. As shown below:
insert image description here

2. In the scheduled task, asynchronously read the user data in the retry table. The retry table needs to record a retry times field, the initial value is 0. Then retry 5 times, delete the cache continuously, and add 1 to the value of this field every time you retry. If any of them succeed, return success. If it still fails after 5 retries, we need to record a failure status in the retry table and wait for further processing.
insert image description here

3. In high-concurrency scenarios, it is recommended to use elastic-job for scheduled tasks. Compared with timing tasks such as xxl-job, it can be processed in pieces to improve processing speed. At the same time, the interval of each piece can be set as: 1, 2, 3, 5, 7 seconds, etc.

If you are more interested in timed tasks, you can read my other article "Learning these 10 types of timed tasks, I am a little bit drifting" , which lists the most mainstream timed tasks at present.

If you use scheduled tasks to retry, there is a disadvantage that the real-time performance is not so high. For business scenarios with particularly high real-time requirements, this solution is not suitable. But for general scenarios, it can still be used.

But it has a great advantage, that is, the data is stored in the database, and the data will not be lost.

8. mq

In high-concurrency business scenarios, mq (message queue) is one of the essential technologies. It can not only decouple asynchronously, but also cut peaks and fill valleys. It is very meaningful to ensure the stability of the system.

Friends who are interested in mq can read my other article "Those Broken Things about MQ" .

The producer of mq, after producing the message, sends it to the mq server through the specified topic. Then the mq consumer subscribes to the topic's message, reads the message data, and then performs business logic processing.

The specific scheme of using mq retry is as follows:
insert image description here

1. When the user operation finishes writing the database but fails to delete the cache, an mq message is generated and sent to the mq server.

2. The mq consumer reads the mq message and retries 5 times to delete the cache. If any of them succeed, return success. If it still fails after 5 retries, it will be written to the dead letter queue.

3. It is recommended to use rocketmq for mq, and the retry mechanism and dead letter queue are supported by default. It is very convenient to use, and also supports various business scenarios such as sequential messages, delayed messages, and transactional messages.

Of course, in this solution, deleting the cache can be completely asynchronous. That is, the user's write operation does not need to delete the cache immediately after writing the database. Instead, send the mq message directly to the mq server, and then the mq consumer is solely responsible for the task of deleting the cache.

Because the real-time performance of mq is relatively high, the improved solution is also a good choice.

9. binlog

As we talked about earlier, whether it is a scheduled task or mq (message queue), the retry mechanism is somewhat intrusive to the business.

In the scheme of using timed tasks, additional logic needs to be added to the business code. If deleting the cache fails, the data needs to be written to the retry table.

In the scheme using mq, if deleting the cache fails, you need to send the mq message to the mq server in the business code.

In fact, there is a more elegant implementation, which is to monitor binlog, such as using middleware such as canal.

The specific plan is as follows:
insert image description here
1. After writing the database in the business interface, it doesn’t matter, and returns success directly.
2. The mysql server will automatically write the changed data into the binlog.
3. The binlog subscriber obtains the changed data and then deletes the cache.

The business interface in this solution really simplifies some processes, you only need to care about the database operation, and do the cache deletion work in the binlog subscriber.

However, if you just delete the cache according to the scheme in the figure, and delete it only once, it may fail.

How to solve this problem?

Answer: This requires the addition of the retry mechanism discussed earlier. If deleting the cache fails, write to the retry table and use the scheduled task to retry. Or write to mq and let mq retry automatically.

It is recommended to use the mq automatic retry mechanism here.
insert image description here
If the deletion of the cache fails in the binlog subscriber, an mq message is sent to the mq server, and the mq consumer automatically retries 5 times. If there is any success, it will directly return success. If it still fails after 5 retries, the message will be automatically put into the dead letter queue, and manual intervention may be required later.

Guess you like

Origin blog.csdn.net/qq_37924396/article/details/126312255