Cache data consistency analysis

Caching is a low-cost way to improve system performance and has been loved by developers since the first day it was launched.

Just like every time a release goes online to fix problems, it is also easy to introduce new problems. Since the first day of caching, the problem of data consistency between cache and database has deeply troubled developers.

Currently commonly used methods in the industry:

Query scenario: check the cache first, if there is no cache, then check the DB

Update scenario: update DB first, then delete cache

New scenario: add DB first, then add cache

Let’s look at cache updates next :

Should I update the DB first or the cache first? Should I update the cache or delete the cache? Under normal circumstances, you can do whatever you want, but once you face a high-concurrency scenario, it's worth careful consideration.

1. Update the database first and then update the cache

Thread A: Update database (1st s) --> Update cache (10s)

Thread B: Update database (3s) --> Update cache (5s)

In a concurrent scenario, such a situation is easy to occur. The order of operations of each thread is different, which causes the cache value of request B to be overwritten by request A. The new value of thread B is in the database, and the value of thread B is in the cache. The old value of A, and will continue to be so dirty

Until the cache expires (if you set an expiration time)

2. Update cache first and then update database

Thread A: Update cache (1st s) --> Update database (10s)

Thread B: Update cache (3s) --> Update database (5s)

Contrary to the previous situation, the cache contains the new value of thread B, while the database contains the old value of thread A ;

The reason why the first two methods will cause exceptions in concurrent scenarios is essentially because updating the cache and updating the database are two operations. We have no way to control the order of the two operations in the concurrent scenario, that is, the thread that starts the operation first. Do your own work first.

If you simplify it, only update the database when updating, and delete the cache at the same time. If the cache cannot be hit during the next query and then rebuild the cache, will this problem be solved?

Based on this, the latter two solutions came into being.

3. Delete the cache first and then update the database

In this way, we were pleasantly surprised to find that the problems that plagued us in the previous concurrency scenario were indeed solved. Both threads only modify the database. No matter which one is modified first, the thread that modifies the database later shall prevail.

But at this time, let’s think about another scenario: two concurrent operations, one is an update operation, the other is a query operation. The update operation deletes the cache, and the query operation does not hit the cache. The old data is read out first and then placed in the cache. The update operation then updates the database. Therefore, the data in the cache is still old data, causing the data in the cache to be dirty. Obviously, this situation is not what we want.

Delayed double deletion

Under this solution, a solution to delayed double deletion was developed.

1. Delete cache

2. Update the database

3. Sleep for a while

4. Delete cache again

A sleep time is added, mainly to ensure that when request A is sleeping, request B can complete reading data from the database during this period of time, and then write the missing cache to the cache, and then request A finishes sleeping and then deletes it. cache.

Therefore, the sleep time of request A needs to be greater than the time of request B (reading data from the database + writing to the cache);

However, the specific sleep duration is actually a mystery and difficult to evaluate. Therefore, this solution is only to ensure consistency as much as possible. In extreme cases, cache inconsistency may still occur.

Therefore, this solution is still not recommended;

4. Update the database first and then delete the cache (cache aside)

In this way, on the basis of Plan 3, the book prefaces of the two books are exchanged. We are verifying the previous scenario under this scheme: a query operation and an update operation are concurrent. We first update the data in the database. At this time, the cache is still valid, so the concurrent query operation does not get Updated data, however, the update operation immediately invalidates the cache, and subsequent query operations pull the data out of the database. Instead of being the same as option 3, subsequent query operations always fetch old data.

And this is the standard design pattern used by cache, which is cache aside. This strategy is also used in Facebook's paper "Scaling Memcache at Facebook".

So, is this solution the perfect, foolproof strategy? In fact, it is not the case. Let's look at this scenario again: one is a read operation, but the cache is not hit, and then the data is fetched from the database. At this time, a write operation comes. After writing to the database, the cache is invalidated, and then the previous read The operation then puts the old data in, so dirty data will be generated; (because the write operation is much slower than the read operation, it will cause dirty data for a period of time)

However, this case will theoretically occur, but in fact the probability of occurrence is very low, because this condition needs to occur when the cache is read and the cache fails, and there is a concurrent write operation. In fact, the write operation of the database will be much slower than the read operation, and the table must be locked. The read operation must enter the database operation before the write operation, and the cache must be updated later than the write operation. All these conditions are met. The probability is basically not high.

Therefore, either ensure consistency through 2PC or the Paxos protocol, or desperately reduce the probability of dirty data during concurrency, and Facebook uses this method of reducing the probability because 2PC is too slow and Paxos is too complex. Of course, in the end, the expiration time is set for the cache, so that even if the data is inconsistent, it will expire after a period of time and consistent data will be updated;

5. Operation failed

Although many more complex concurrency scenarios are listed above, in fact it is still an ideal situation: that is, both database and cache operations are successful. However, in actual production, the operation may fail due to network jitter, service offline, etc.;

For example: the application wants to update the value of data X from 1 to 2. It first successfully updates the database, and then deletes the cache of , the cache value of X in redis is 1, and there is an inconsistency problem between the database and the cached data;

Then, if there is a subsequent request to access data X, it will be queried in Redis first. Because the cache has not been deleted, the cache hits, but the old value 1 is read.

In fact, no matter whether the database is operated first or the cache is operated first, data consistency problems will occur as long as the second operation fails.

Now that the cause of the problem is known, how to solve it? There are two methods:

1. Retry mechanism;

2. Subscribe to the MySQL binlog and then operate the cache.

Retry mechanism:

You can introduce a message queue, add the data to be operated in the second operation (deletion cache) to the message queue, and let the consumer operate the data.

  • If the application fails to delete the cache, it can re-read the data from the message queue and then delete the cache again. This is the retry mechanism. Of course, if the retry exceeds a certain number of times and still fails, we need to send an error message to the business layer.

  • If the cache is successfully deleted, the data must be removed from the message queue to avoid repeated operations. Otherwise, continue to try again.

Give an example to illustrate the process of the retry mechanism.

Summary:
1. cache aside is not a panacea

Although cache aside can be called the best practice for cache use, at the same time, it introduces the problem of reduced cache hit rate (deleting the cache every time naturally makes it harder to hit), so it is more It is suitable for scenarios where cache hit rate requirements are not particularly high. If a higher cache hit rate is required, you still need to adopt the solution of updating the cache after updating the database;

2. Solution to cache data inconsistency

Introducing distributed locks

Try to acquire the lock before updating the cache. If it is already occupied, block the thread first and wait for other threads to release the lock before trying to update. But this will affect the performance of concurrent operations.

Set a shorter cache time

Setting a shorter cache expiration time can make the data inconsistency problem exist for a shorter time and have a relatively small impact on the business. But at the same time, this actually reduces the cache hit rate, which brings us back to the previous problem...

So to sum up, there is no eternal best solution, only the choice of solutions under different business scenarios.

Guess you like

Origin blog.csdn.net/weixin_43500974/article/details/129316402