How to ensure database and cache double-write consistency

The database and cache (such as: redis) double-write data consistency problem is a common problem that has nothing to do with the development language. Especially in high concurrency scenarios, this problem becomes more serious.
Next, I will discuss with you

Common solutions

Usually, the main purpose of using cache is to improve query performance. Most of the time, we use the cache like this:
insert image description here

Simply put, check the cache first. If there is no data in the cache, then check the database, and synchronize the results found in the database to the cache. If there is no data in the database, return empty.
This cache usage seems reasonable on the surface, but we ignore some important scenarios

If a piece of data in the database is updated immediately after being put into the cache, how to update the cache?

Not updating the cache line doesn't work?

The answer is definitely not, otherwise this article would not exist.
If it is not updated for a long time and only depends on the expiration time of the cache, then the user may use old data for a long period of time. For example, if a certain piece of data in the database is placed in the cache, it is immediately updated. , then the above situation will occur.

So, how do we update the cache?
There are currently 4 options:

  • Write to the cache first, then to the database
  • Write to the database first, then write to the cache
  • Delete the cache first, then write to the database
  • Write the database first, then delete the cache

Let's see if these plans work

1. Write the cache first, then write to the database

We know that the main reason for the fast cache is its I/O bottleneck. Since the cache directly reads and writes memory, the operation speed is very fast. However, if the cache directly reads and writes memory, if the cache database is down, it will cause writing to the memory. Data is lost. At the same time, we will also encounter that when writing to the cache is successful, if just after writing to the cache, suddenly there is an abnormality in the network, causing the writing to the database to fail. In this case, there is a cache, but the database does not. At this time, the data in the cache becomes dirty data.
We must understand that the main purpose of the cache is temporary storage , that is to say, the data of the database is temporarily stored in the memory to facilitate subsequent queries and improve the query speed, but if a certain piece of data does not exist in the database, then this piece of data Arguably meaningless. Therefore this solution is not advisable.

2. Write the database first, then write the cache

This solution can avoid the problems caused by fake data. The so-called fake data means that there is no data in the database, but the data in the cache is fake data. But new problems arose.

2.1 Write cache failed

If the write database and write cache operations are placed in the same transaction, when the write cache fails, we can roll back the data written to the database.
This scenario is also acceptable, but it can only be used in business scenarios with small concurrency, and systems that do not require high interface performance can do this.
However, in a high-concurrency business scenario, writing to the database and writing to the cache are both remote operations. In order to prevent the deadlock problem caused by large transactions, it is generally recommended that the write database and write cache should not be placed in the same transaction.
If the cache and the database are written at the same time, the write to the database will be successful, but the write cache will fail, and the data written in the database will not be rolled back, resulting in the database being new data and the cache being old data, and the data on both sides are inconsistent. .

2.2 High Concurrency Scenarios

Assuming that in a high concurrency scenario, there are two data write requests for the same data of the same user: a and b, which request the business system at the same time, where request a obtains old data, while request b obtains new data data, as shown in the following figure:
insert image description here

I will go through the process as shown above:

  • 1. Request a to come first, just finished writing the database. However, due to network reasons, it was stuck for a while, and there was no time to write the cache.
  • 2. At this time, the request b came, and the database was written first.
  • 3. Next, request b successfully writes the cache
  • 4. After a while, the request a is stuck and written to the cache.

The above process will have a very serious situation: the database is the new value, and the database is the old value.

And this method will seriously waste system resources.
Why do you say that?

If the cache is written, it is not a simple data content, but the final result obtained through a very complex calculation. In this way, every time the cache is written, it needs to go through a very complex calculation, isn't it a waste of system resources! ! !

Especially when the business scenario we encounter is more writing and less reading . In such business scenarios, each write operation needs to write to the cache once, which is a bit of a loss.

3. First delete the cache and write the database

In the above two schemes, we can see that there are many problems with directly updating the cache.
So let's change our way of thinking: instead of updating the cache directly, delete the cache instead.
But there are two options for deleting the cache:

  • 1. Delete the cache first, then write the database
  • 2. Write the database first, then delete the cache

3.1 Delete the cache first, then write to the database

The general process is as follows:
insert image description here

Is there any problem with this process?
Let's analyze it together
or discuss the problem of high concurrency first.

3.1.1 Problems under high concurrency

Suppose that in a high concurrency scenario, the same data of the same user has a read data request c, and another write data request d (an update operation), which are requested to the business system at the same time. As shown below:
insert image description here

The above process is as follows:

  • Request d to come first and delete the cache. However, due to network reasons, it was stuck for a while, and there was no time to write the database.
  • At this time, the request c came, first check the cache and find that there is no data, then check the database, there is data, but the old value.
  • Request c to update the old value in the database to the cache.
  • At this point, the request d is stuck and the new value is written to the database.

In this process, the new value of the request d is not written into the cache by the request c, which will also lead to inconsistency between the data in the cache and the database.

So, can the data inconsistency problem in this scenario be solved?

3.1.2 Cache double deletion

There is a very simple solution for the above scenario. The idea is very simple:
when the d request is successfully written, we will deduplicate the cache once.
insert image description here

This is what we call cache double deletion, which is:

That is, delete it once before writing the database, and delete it again after writing the database.

A very critical aspect of this scheme is that the second time the cache is deleted, it is not deleted immediately, but after a certain time interval.

After we have the cache deletion scheme, let's review the scenario problems under high concurrency:

  • 1. Request d to come first and delete the cache. However, due to network reasons, it was stuck for a while, and there was no time to write the database.
  • 2. At this time, the request c came, first check the cache and find that there is no data, then check the database, there is data, but the old value.
  • 3. Request c to update the old value in the database to the cache.
  • 4. At this point, the request d is stuck and the new value is written to the database.
  • 5. After a period of time, for example: 500ms, request d to delete the cache.

Looking at it this way does solve the cache inconsistency problem, but why do we have to wait a while before deleting the cache?
After request d is stuck, after writing the new value to the database, request c to update the old value in the database to the cache.
At this point, if the request d deletes too quickly, the cache has been deleted before the request c to update the old value in the database to the cache, and this deletion is meaningless. We must be clear that the reason why we want to delete the cache again is because the c request causes the cache to update the old value in the database. We need to delete the old value, so we must request c to update the cache, and then Only by deleting the cache can the old value be deleted in time. If the deletion is too fast, it may be later.

Now that one problem has been solved, I have another problem: what if the deletion fails the second time I delete the cache?
Since the following scenario also encounters this problem, I will take it out separately to explain the solution to cache deletion failure.

3.2 Write the database first, then delete the cache

As we know from the above, delete the cache first, and then write to the database. In the case of concurrency, there may be inconsistencies between the data in the cache and the database.
Next, let's focus on the scheme of writing the database first and then deleting the cache.
In a high concurrency scenario, there is a read data request and a write data request. The update process is as follows:

  • Request e to write the database first. Due to network reasons, it was stuck for a while, and there was no time to delete the cache.
  • Request f to query the cache, find that there is data in the cache, and return the data directly.
  • Request e to delete the cache.

In this process, only the request f reads the old data once, and then the old data is deleted in time by the request e. It seems that the problem is not big, but what if the read data request comes first?

  • Request f to query the cache, find that there is data in the cache, and return the data directly.
  • Request e to write to the database first.
  • Request e to delete the cache.

Is there a problem with this situation?
No problem at all!!!
But let's not forget that if our cache has an expiration date, the cache itself will be invalidated.
insert image description here

The above process is roughly as follows:

  • When the cache expiration time is up, it will be automatically invalidated.
  • Request f to query the cache, send that there is no data in the cache, and query the old value of the database, but due to network reasons, the cache is stuck and there is no time to update the cache.
  • Request e first writes to the database and then deletes the cache.
  • Request f to update the old value into the cache.

Of course, under the comparison of the probability of this happening, only if it simultaneously satisfies:

  • The cache is just automatically invalidated.
  • Requesting f to find old values ​​from the database and updating the cache takes longer than requesting e to write to the database and delete the cache.

The speed of querying the database is generally faster than writing the database, not to mention that the cache must be deleted after writing the database. So in most cases, the write data request takes longer than the read data case.

Let's make a summary first:

It is recommended that you use the scheme of writing the database first and then deleting the cache. Although the problem of data inconsistency cannot be avoided 100%, the probability of this problem is the smallest compared to other schemes.

But this solution will also deal with the case of cache deletion failure, to solve the situation of cache deletion failure

4. What should I do if the cache deletion fails?

If the cache deletion fails, the data in the cache and the database will also be inconsistent.
So to solve this solution, we add a retry mechanism .
In the interface, if the update of the database is successful, but the update of the cache fails, you can retry 3 times immediately. If any of them are successful, return success directly. If it fails three times, it will be written to the database and prepared for subsequent processing.

Of course, if you retry synchronously directly in the interface, when the concurrency of the interface is relatively high, it may affect the performance of the interface a bit. We are not afraid of this, and it can be changed to asynchronous.
There are many ways to retry asynchronously:

  • 1. A separate thread is created each time, and this thread is dedicated to retrying work. However, in a high concurrency scenario, too many threads may be created, resulting in system OOM problems, so it is not recommended to use it.
  • 2. The retry task is handed over to the thread pool for processing, but if the server restarts, some data may be lost.
  • 3. Write the retry data to the table, and then use scheduled tasks such as elastic-job to retry.
  • 4. Write the retry request into message middleware such as mq, and process it in the consumer of mq.
  • 5. Subscribe to the binlog of mysql. In the subscriber, if an update data request is found, the corresponding cache is deleted.

4.1 Scheduled tasks

We create a retry table. There is a field in the table to record the number of retries. The initial value is 0. At the same time, a maximum number of retries is set. A timed task is used to asynchronously read the data in the retry table, and then execute it. Delete the cache operation, each time it is deleted, the number of retries is incremented by 1. If any one of them succeeds, it returns success. If it fails after 5 attempts, we need to record a failure status in the retry table and wait for further processing.
In high concurrency scenarios, elastic-job is recommended for scheduled tasks. Compared with timed tasks such as xxl-job, it can be processed in shards to improve the processing speed. At the same time, the interval of each slice can be set to: 1, 2, 3, 5, 7 seconds, etc.

One disadvantage of using scheduled task retry is that the real-time performance is not so high. This solution is not suitable for business scenarios with particularly high real-time requirements. But for general scenarios, it can still be used.

But it has a great advantage, that is, the data is stored in the library, and the data will not be lost.

4.2 SQM

In high-concurrency business scenarios, mq (message queue) is one of the essential technologies. It can not only decouple asynchronously, but also cut peaks and fill valleys. It is very meaningful to ensure the stability of the system.

The mq producer, after producing the message, sends it to the mq server through the specified topic. Then the consumer of mq subscribes to the message of the topic, reads the message data, and performs business logic processing.

The specific scheme of using mq retry is as follows:
insert image description here

When the user operation finishes writing the database, but the deletion of the cache fails, an mq message is generated and sent to the mq server.
The mq consumer reads the mq message and retries 5 times to delete the cache. If any of them succeed, return success. If it fails after 5 attempts, it will be written to the dead letter queue.
Of course, in this solution, deleting the cache can be completely asynchronous. That is, the user's write operation does not need to delete the cache immediately after writing to the database. And directly send the mq message to the mq server, and then the mq consumer is solely responsible for the task of deleting the cache.

Because the real-time performance of mq is still relatively high, the improved solution is also a good choice.

4.3 binlog

Both of the first two deletion retry schemes are somewhat invasive:

  • In the scheme of using scheduled tasks, additional logic needs to be added to the business code. If the deletion of the cache fails, the data needs to be written into the retry table.
  • In the scheme of using mq, if the deletion of the cache fails, the mq message needs to be sent to the mq server in the business code.

In fact, there is a more elegant implementation, which is to monitor binlog, such as using middleware such as canal.
The specific plans are as follows:
insert image description here

  • After writing the database in the business interface, ignore it and return success directly.
  • The mysql server will automatically write the changed data to the binlog.
  • The binlog subscriber gets the changed data and then deletes the cache.

The business interface in this solution really simplifies some processes. You only need to care about database operations, and do cache deletion in binlog subscribers.
However, this solution will still fail to delete, because the deletion based on binlog will only be deleted once, so we still need to rely on the retry mechanism based on timed tasks or mq.
In the binlog subscriber, if the deletion of the cache fails, an mq message is sent to the mq server, and it is automatically retried 5 times in the mq consumer. If there is any success, it will return success directly. If it fails after 5 retries, the message is automatically put into the dead letter queue, and manual intervention may be required later.
insert image description here

Guess you like

Origin blog.csdn.net/zhiyikeji/article/details/123940351