[Cache] Cache penetration, cache breakdown, cache avalanche and their solutions

User data is generally stored in the database, and the data in the database falls on the disk. The read and write speed of the disk can be said to be the slowest hardware in the computer.

When the user requests all access the database, the database will easily crash when the number of requests increases. Therefore, in order to prevent users from directly accessing the database, Redis will be used as the cache layer.

Because Redis is an in-memory database, we can cache database data in Redis, which is equivalent to caching data in memory. The read and write speed of memory is several orders of magnitude faster than that of hard disk, which greatly improves system performance.
insert image description here
When the cache layer is introduced, there will be three problems of cache exceptions, namely cache avalanche, cache breakdown, and cache penetration.

These three questions are also very frequently investigated in interviews. We not only need to know clearly how they occur, but also need to know how to solve them.
insert image description here

cache penetration

When a cache avalanche or breakdown occurs, the data to be accessed by the application is still stored in the database. Once the cache restores the corresponding data, the pressure on the database can be reduced, but cache penetration is different.

When the data accessed by the user is neither in the cache nor in the database, when the request accesses the cache, it is found that the cache is missing, and when it accesses the database, it is found that there is no data to be accessed in the database. Serve subsequent requests. Then when a large number of such requests arrive, the pressure on the database increases sharply, which is the problem of cache penetration.
insert image description here
There are generally two situations in which cache penetration occurs:

  • Business misoperation, the data in the cache and the data in the database are deleted by mistake, so there is no data in the cache and the database;
  • Hackers maliciously attack and intentionally access a large number of services that read non-existent data;

There are three common solutions for cache penetration.

  • The first option, the restriction of illegal requests;
  • The second option is to cache null or default values;
  • The third solution is to use the Bloom filter to quickly determine whether the data exists, and avoid querying the database to determine whether the data exists;
  • The fourth option, user blacklist restriction;

The first solution, restrictions on illegal requests

When there are a large number of malicious requests to access non-existing data, cache penetration will also occur. Therefore, at the API entrance, we need to judge whether the request parameters are reasonable, whether the request parameters contain illegal values, and whether the request fields exist. If it is judged to be Malicious requests will directly return an error, avoiding further access to the cache and database.

The second option, caching null or default values

When our online business discovers cache penetration, we can set a null or default value in the cache for the queried data, so that subsequent requests can read the null or default value from the cache and return it to the application , without continuing to query the database.

The third solution is to use the Bloom filter to quickly determine whether the data exists, and avoid querying the database to determine whether the data exists.

We can use the Bloom filter to make a mark when writing database data, and then when the user request arrives, after the business thread confirms that the cache is invalid, we can quickly determine whether the data exists by querying the Bloom filter. It is not necessary to query the database to determine whether the data exists.

Even if cache penetration occurs, a large number of requests will only query Redis and Bloom filters instead of the database, ensuring the normal operation of the database. Redis itself also supports Bloom filters.

So the question is, how does the Bloom filter work? Next, let me introduce.

The Bloom filter consists of two parts: "a bitmap array whose initial value is 0" and "N hash functions". When we write data in the database, we mark it in the Bloom filter, so that the next time we query whether the data is in the database, we only need to query the Bloom filter. If the queried data is not marked, it means it is not in the database.

The Bloom filter completes the marking through 3 operations:

  • In the first step, use N hash functions to perform hash calculations on the data respectively, and obtain N hash values;
  • In the second step, the length of the bitmap array is moduloed by the N hash values ​​obtained in the first step to obtain the corresponding position of each hash value in the bitmap array.
  • The third step is to set the value of each hash value in the corresponding position of the bitmap array to 1;

For example, suppose there is a Bloom filter with a bitmap array of length 8 and a hash function of 3.
insert image description here
After the data x is written into the database, when the data x is marked in the Bloom filter, the data x will be calculated with 3 hash values ​​by 3 hash functions, and then the 3 hash values ​​are modulo 8 , assuming that the result of the modulus is 1, 4, 6, then set the value of the 1st, 4th, 6th position of the bitmap array to 1. When the application wants to query whether the data x is in the database, the Bloom filter only needs to check whether the values ​​of the 1st, 4th, and 6th positions of the bitmap array are all 1, and as long as one of them is 0, the data x is considered not in the database.

Since the Bloom filter is based on the hash function to achieve the lookup, there is a possibility of hash collision at the same time as the efficient lookup. For example, the data x and the data y may both fall in the 1st, 4th, and 6th positions. In fact, it is possible that the database There is no data y in , and there is a case of misjudgment.

Therefore, querying the Bloom filter to say that the data exists does not necessarily prove that the data exists in the database, but if the data does not exist in the query, the data must not exist in the database.

The fourth solution, user blacklist restriction

When an abnormal situation occurs, monitor the accessed objects and data in real time, analyze user behavior, and restrict specific users for intentional requests, crawlers or attackers;

Of course, it may be caused by cache penetration, or it may be caused by other reasons, and corresponding measures can be taken according to specific situations.

cache breakdown

Our business usually has several data that are frequently accessed, such as flash sales activities. This type of frequently accessed data is called hot data.

If a certain hotspot data in the cache expires, and a large number of requests access the hotspot data at this time, it cannot be read from the cache and directly accessed to the database. The database is easily overwhelmed by high concurrent requests. This is cache breakdown. The problem.

insert image description here
It can be found that cache breakdown is very similar to cache avalanche, and you can think of cache breakdown as a subset of cache avalanche.

To deal with cache breakdown, the two solutions mentioned above can be adopted:

  • The mutual exclusion lock scheme ensures that only one business thread updates the cache at the same time. If the request for the mutex cannot be acquired, it either waits for the lock to be released and reads the cache again, or returns a null value or the default value. The stand-alone process is handled by synchronized or lock, and the distributed environment uses distributed locks.
  • Do not set an expiration time for the hotspot data, and update the cache asynchronously by the background, or notify the background thread in advance to update the cache and reset the expiration time before the hotspot data is about to expire;

cache avalanche

Usually, in order to ensure the consistency between the data in the cache and the data in the database, we will set an expiration time for the data in Redis. When the cached data expires, if the data accessed by the user is not in the cache, the business system needs to regenerate the cache. Therefore, It will access the database and update the data to Redis, so that subsequent requests can directly hit the cache.
insert image description here
Then, when a large amount of cached data expires (fails) or Redis fails at the same time, if there are a large number of user requests at this time, they cannot be processed in Redis, so all requests directly access the database, resulting in sudden pressure on the database. Seriously, it will cause the database to go down, which will form a series of chain reactions and cause the entire system to crash. This is the problem of cache avalanche.
insert image description here
As you can see, there are two reasons for cache avalanche:

  • A large amount of data expires at the same time;
  • Redis downtime;

Different triggers require different coping strategies.

A lot of data expires at the same time

For the cache avalanche problem caused by a large amount of data expiring at the same time, common solutions include the following:

  • Set the expiration time evenly;
  • mutex;
  • Dual key strategy;
  • Background update cache;
  1. Set expiration time evenly

If you want to set an expiration time for cached data, you should avoid setting a large amount of data to the same expiration time. When setting the expiration time for cached data, we can add a random number to the expiration time of these data, so as to ensure that the data will not expire at the same time. (For example, randomly for 1-5 minutes), let the keys expire evenly.

  1. mutex

Consider using queues or locks to ensure that single-threaded writes are cached, but this solution may affect the amount of concurrency. When the business thread is processing user requests, if it finds that the accessed data is not in Redis, add a mutex to ensure that there is only one request at the same time to build the cache (read data from the database, and then update the data to Redis) , and release the lock when the cache build is complete. A request that fails to acquire a mutex will either wait for the lock to be released and re-read the cache, or return a null value or the default value.

When implementing a mutual exclusion lock, it is best to set a timeout period, otherwise the first request gets the lock, and then this request has some accident and has been blocked, and the lock has not been released. At this time, other requests have not been able to get the lock. The whole system becomes unresponsive.

  1. Dual key strategy

We can use two keys for the cached data, one is the primary key, which will set the expiration time, and the other is the backup key, which will not set the expiration time, they are just different keys, but the value is the same, which is equivalent to doing the cached data copies.

When the business thread cannot access the cached data of the "primary key", it directly returns the cached data of the "standby key", and then updates the data of the "primary key" and "standby key" at the same time when updating the cache.

  1. Background update cache

The business thread is no longer responsible for updating the cache, and the cache does not set an expiration date. Hot data can be considered not to expire, and the cache is updated asynchronously in the background, which is suitable for scenarios that do not strictly require cache consistency, and the work of updating the cache is handed over to the background thread for regular update.

In fact, cached data does not set a validity period, which does not mean that the data can always be in memory, because when the system memory is tight, some cached data will be "eliminated", and after the cache is "eliminated" to the next background regular update During this period of caching, if the business thread fails to read the cache, it returns a null value, and the business perspective thinks that the data is lost.

There are two ways to solve the above problem.

In the first way, the background thread is not only responsible for regularly updating the cache, but also is responsible for frequently checking whether the cache is valid. If the cache is detected to be invalid, the reason may be that the system is tense and eliminated, so the data must be read from the database immediately, and Update to cache.

The detection time interval in this way should not be too long. If it is too long, the data obtained by the user will be a null value instead of real data. Therefore, the detection interval should preferably be in milliseconds, but there is always an interval time. User experience generally.

In the second way, after the business thread finds that the cache data is invalid (the cache data is eliminated), a message is sent through the message queue to notify the background thread to update the cache. After the background thread receives the message, it can determine whether the cache exists before updating the cache. The update cache operation is not performed; the database data is read if it does not exist, and the data is loaded into the cache. Compared with the first method, this method will update the cache in a more timely manner, and the user experience is better.

When the business is just launched, it is better to buffer the data in advance instead of waiting for user access to trigger the cache construction. This is the so-called cache warm-up, and the mechanism of updating the cache in the background is also suitable for this task.

Redis downtime

For the cache avalanche problem caused by Redis failure and downtime, the common solutions are as follows:

  • Service fusing or request flow limiting mechanism;
  • Build a Redis cache high reliability cluster;
  1. Service fusing or request flow limiting mechanism

When the cache avalanche problem is caused by Redis failure and downtime, we can start the service fuse mechanism, suspend the access of the business application to the cache service, and directly return an error without continuing to access the database, thereby reducing the access pressure on the database and ensuring the database system. Normal operation, and then wait until Redis is back to normal before allowing business applications to access the cache service.

The service fusing mechanism is a normal permission to protect the database, but the business application access to the cache service system is suspended, and all businesses cannot work normally

In order to reduce the impact on the business, we can enable the request flow limiting mechanism, and only send a small number of requests to the database for processing. If there are more requests, the service will be directly rejected at the entrance. After Redis returns to normal and the cache is warmed up, then Mechanism to lift request rate limit.

  1. Build a highly reliable Redis cache cluster

Service fusing or request flow limiting mechanism or downgrade is the solution to cache avalanche. We'd better build a highly reliable Redis cache cluster through master-slave nodes.

If the master node of Redis cache fails and goes down, the slave node can switch to become the master node and continue to provide cache services, avoiding the cache avalanche problem caused by Redis failure and downtime.

Summarize

There are three problems faced by cache exceptions: cache avalanche, breakdown and penetration.

Among them, the main reason for cache avalanche and cache breakdown is that the data is not in the cache, which causes a large number of requests to access the database, and the pressure on the database suddenly increases, which is likely to trigger a series of chain reactions and cause the system to crash. However, once the data is reloaded back to the cache, the application can quickly read the data from the cache without continuing to access the database, and the pressure on the database will drop instantly. Therefore, the solutions for cache avalanche and cache breakdown are similar.

The main reason for cache penetration is that the data is neither in the cache nor in the database. Therefore, cache penetration is not the same as the solution to cache avalanche and breakdown.

I compiled the table here, and you can know the difference between cache avalanche, breakdown and penetration and the countermeasures from the following table.
insert image description here

source

What is cache avalanche, breakdown, and penetration?
Detailed explanation of cache penetration, cache avalanche, and cache breakdown

Guess you like

Origin blog.csdn.net/weixin_44231544/article/details/126721807