Cache penetration and solutions (dry goods)

The cache is just a layer of protection added to relieve the pressure on the database. When the data we need cannot be queried from the cache, we must query it in the database. If it is used by hackers to frequently access data that is not in the cache, the cache loses its meaning, and the pressure of all requests instantly falls on the database, which will cause abnormal database connections.

There are two common solutions for cache penetration:

Solution 1: For data that does not exist in the database, the default value of Null is also set in the cache. In order to avoid occupying resources, the expiration time is generally shorter

Solution 2: You can set some filtering rules, such as bloom filters

Option 1 is relatively simple, but it is also easy to crack. For example, by analyzing the data format, the attacker does not repeatedly request that the database does not have data, then solution 1 is equivalent to invalid. Relatively speaking, solution 2 is more stable, and the next step is mainly Explain the implementation of Scheme 2.
Insert picture description here

The design idea of ​​scheme 2 is to filter the data before the database query by setting the filtering rules. If the data is found to be non-existent, then the database query is no longer performed to reduce the access pressure of the database

The current mainstream carrier of filtering rules in Scheme 2 is Bloom filter. Bloom filter is a kind of probabilistic data structure, characterized by efficient insertion and query, which can be used to tell you "something must not exist Or it may exist".

Compared with the traditional List, Set, Map and other data structures, the Bloom filter is a bit array, which is more efficient and takes up less space, but the disadvantage is that the returned result is probabilistic, not exact.
Insert picture description here

If we want to map a value to the Bloom filter, we need to use multiple different hash functions to generate multiple hash values, and each generated hash value points to the bit position 1. For example, for the value "zhangsan "And three different hash functions have generated hash values ​​1, 4, and 7 respectively.
Insert picture description here
We now save a value "lisi". If the hash function returns 4, 5, and 8, the picture continues to become:
Insert picture description here
When we want To determine whether the Bloom filter records a certain data, the Bloom filter will first perform the same hash processing on the data. For example, the hash function of "wangwu" returns three values ​​of 2, 5, and 8, as a result We found that the value of bit 2 is 0, indicating that no value is mapped to this bit, so we can say with certainty that the data "wangwu" does not exist.

But at the same time we will find that the bit 4 is overwritten because the hash functions of "zhangsan" and "lisi" both return this bit. Then as the data stored by the Bloom filter continues to increase, the probability of repetition will continue to increase, so when we filter a certain data, if we find that all three hash values ​​are recorded in the filter, then It can only show that the data may be included in the filter, but it is not absolutely certain, because the hash value of other data may have an impact on the result. This explains that the Bloom filter mentioned above can only show Something must not exist or may exist". As for why three different hash functions are used to obtain values, because as long as one of the three hash values ​​does not exist, the data must not be in the filter. This can be reduced The error probability due to hash collision (the hash value of two data is the same).

Bloom filters have packaged toolkits in many languages. Let’s take the python toolkit `pybloomfiltermmap3 as an example to demonstrate the code

import pybloom filter

#Create filter (data capacity, error rate, storage file) The lower the error rate, the larger the file

filter = pybloomfilter.BloomFilter(1000000, 0.01, ‘words.bloom’)

#adding data

filter.update((‘bj’, ‘sh’, ‘gz’))

#Judge whether it contains

if ‘bj’ in filter:

print('contains')

else:

print('does not contain')

Guess you like

Origin blog.csdn.net/JACK_SUJAVA/article/details/109206301