Redis-Cache avalanche, breakdown, penetration

Speaking of Redis, I believe that you are not unfamiliar with cache avalanche, penetration, and breakdown during interviews or the actual development process . Even if you have not encountered it before, you must have heard it. What is the difference between the three? We should How to prevent this from happening? We have invited the next victim.

Interview begins

A middle-aged man with a big belly and wearing a plaid shirt came to you with a mac full of scratches, looked at his balding hair, and thought he must be the top architect of Nima! But we have poems and books in our belly, we are proud

Boy, I see that Redis is written on your resume, so let's get straight to the point and directly ask a few common big problems. Do you understand Redis Avalanche?

Hello, handsome and charming interviewer, I understand that currently the homepage of e-commerce and hot data will be cached. Generally, the cache is refreshed by scheduled tasks, or updated after it is not found. There is a problem with refreshing scheduled tasks. .

To give a simple example: if the key expiration time of all homepages is 12 hours, and it is refreshed at 12 noon, I have a spike activity at zero and a large number of users flooded in. Assuming that there were 6000 requests per second at that time, the cache could hold every 5000 requests per second, but all the keys in the cache at that time are invalid. At this time, all 6000 requests per second fall into the database, and the database will inevitably not be able to handle it. It will report an alarm. In the real situation, the DBA may hang directly without responding. At this point, if there is no special solution to deal with this failure, the DBA is very anxious and restarts the database, but the database is immediately killed by the new traffic. This is the cache avalanche as I understand it.

I deliberately looked at the projects I have done and felt that no matter how big QPS is, I don’t allow such a big QPS to directly hit the DB, but without slow SQL and sub-database, the big table and sub-table may still be top, but it is used The gap in Redis is still big

At the same time, it failed in a large area. At that moment, Redis was the same as nothing. It would be almost catastrophic for this level of request to directly hit the database. If you want to hang up a user service library, then other libraries depend on it. Almost all interfaces will report errors. If you don’t implement a circuit breaker or other strategy, you will basically hang up in an instant. No matter how you restart the user, you will be hanged up. When you can restart, the user will go to sleep long ago, and your The product has lost confidence, what a junk product.

The interviewer touched his hair, um, it's not bad, how is this situation? How do you deal with it?

 

It’s easy to handle the cache avalanche. When storing data in Redis in batches , just add a random value to the expiration time of each Key, so as to ensure that the data will not fail in a large area at the same time. I believe that Redis is a bit of traffic. Still can withstand it.

setRedis(Key,value,time + Math.random() * 10000);

If Redis is deployed in a cluster, evenly distributing hot data in different Redis libraries can also avoid all failures. However, when I operate the cluster in a production environment, a single service corresponds to a single Redis shard. In order to facilitate the management of data, but also have the disadvantage of possible failure, random failure time is a good strategy.

Or set the hotspot data to never expire, just update the cache when there is an update operation (for example, if the homepage product is updated by the operation and maintenance, then you will be done with the cache, do not set the expiration time), the data on the e-commerce homepage can also use this Operation, insurance.

Do you know cache penetration and breakdown? Can you talk about the difference between them and avalanche?

Well, understand, let me talk about cache penetration first. Cache penetration refers to data that is not in the cache or the database, and users continue to initiate requests. The id of our database is always incremented from 1, such as id. Data whose value is -1 or data whose id is very large and does not exist. At this time, the user is likely to be an attacker, and the attack will cause excessive pressure on the database, which will severely destroy the database.

A smaller stand-alone system can basically be killed with postman, such as the Ali service I bought myself

Like this, if you don’t check the parameters, the database id is greater than 0. I always request you with a parameter less than 0. Every time I can bypass Redis and directly call the database, the database can’t be found. In this case, the high point of concurrency will easily collapse.

As for the cache breakdown , this is a bit similar to the cache avalanche , but it is a little different. The cache avalanche is caused by a large area of ​​cache failure, which crashes the DB. The difference between the cache breakdown is that the cache breakdown refers to a key. Hotspots are constantly carrying large concurrency, and the large concurrency concentrates on accessing this point. When the key is invalidated, the continuous large concurrency will break the cache and directly request the database, just like in an intact bucket A hole was cut in the upper.

The interviewer showed a relieved look, so how do they solve it?

For cache penetration, I will add verification at the interface layer, such as user authentication verification, parameter verification, and illegal parameters directly code Return.

For example: id for basic verification, direct interception for id <=0, etc.

One thing I want to mention here is that we must have a heart of "distrust" when developing programs, that is, don't trust any caller. For example, if you provide an API interface to go out, you have these few parameters, then I think As the callee, any possible parameter conditions should be considered and verified, because you don't trust the person who called you, and you don't know what parameters he will pass to you.

For a simple example, your interface is paging query, but you do not limit the size of the paging parameters. In case the caller checks Integer.MAX_VALUE at one go, it will take you a few seconds for a request, and a few more concurrency. Just hang up? It's a company colleague who called it, but it's a big deal to find out, but what if it is a hacker or a competitor? I don’t need to say what happens when you adjust your interface on Double Eleven. This is what the previous leader told me, and I think everyone should understand it.

The data that cannot be retrieved from the cache is also not retrieved in the database. At this time, you can also write the Value pair of the corresponding Key as null, the location is wrong, and the value will be retryed later. Ask the product or see the specific value. For scenes, the cache valid time can be set to a short point, such as 30 seconds (setting too long will cause it to be unusable under normal conditions).

This can prevent attacking users from repeatedly using the same id brute force attack, but we need to know that normal users will not initiate so many requests in a single second. I also remember that the gateway layer Nginx has a configuration item that allows operation and maintenance. The IPs whose access times per second of a single IP exceed the threshold are all blocked.

Do you have any other way?

And I remember Redis there is a high usage Bloom filter (Bloom Filter) This can be a good buffer to prevent penetration occurred, he's also very simple principle is the use of efficient data structures and algorithms quickly determine that you Whether the Key exists in the database, just return if it does not exist, and if it exists, check the DB, refresh the KV and then return.

Then some friends said that if hackers have many IPs launching attacks at the same time? I haven't really figured this out, but hackers at the general level don't have so many broilers, and even a normal Redis cluster can resist this level of access. I don't think they will be interested in small companies. After the high availability of the system is done, the cluster is still very capable.

 

If the cache is broken , the hotspot data will never expire. Or you can add a mutex lock

I will definitely prepare the code for you

End of interview

 

I would like to thank the up master of station B, third prince Ao Bing, everyone can pay more attention to him. A great boss. -->Excerpt from Ao Bing, Third Prince of Station B

Guess you like

Origin blog.csdn.net/cyberHerman/article/details/105007106
Recommended