The cache is just a layer of protection added to relieve the pressure on the database. When the data we need cannot be queried from the cache, we must query it in the database. If it is used by hackers to frequently access data that is not in the cache, the cache loses its meaning, and the pressure of all requests instantly falls on the database, which will cause abnormal database connections.
There are two common solutions for cache penetration:
Solution 1: For data that does not exist in the database, the default value of Null is also set in the cache. In order to avoid occupying resources, the expiration time is generally shorter
Solution 2: You can set some filtering rules, such as bloom filters
Option 1 is relatively simple, but it is also easy to crack. For example, by analyzing the data format, the attacker does not repeatedly request that the database does not have data, then solution 1 is equivalent to invalid. Relatively speaking, solution 2 is more stable, and the next step is mainly Explain the implementation of Scheme 2.
The design idea of scheme 2 is to filter the data before the database query by setting the filtering rules. If the data is found to be non-existent, then the database query is no longer performed to reduce the access pressure of the database
The current mainstream carrier of filtering rules in Scheme 2 is Bloom filter. Bloom filter is a kind of probabilistic data structure, characterized by efficient insertion and query, which can be used to tell you "something must not exist Or it may exist".
Compared with the traditional List, Set, Map and other data structures, the Bloom filter is a bit array, which is more efficient and takes up less space, but the disadvantage is that the returned result is probabilistic, not exact.
If we want to map a value to the Bloom filter, we need to use multiple different hash functions to generate multiple hash values, and each generated hash value points to the bit position 1. For example, for the value "zhangsan "And three different hash functions have generated hash values 1, 4, and 7 respectively.
We now save a value "lisi". If the hash function returns 4, 5, and 8, the picture continues to become:
When we want To determine whether the Bloom filter records a certain data, the Bloom filter will first perform the same hash processing on the data. For example, the hash function of "wangwu" returns three values of 2, 5, and 8, as a result We found that the value of bit 2 is 0, indicating that no value is mapped to this bit, so we can say with certainty that the data "wangwu" does not exist.
But at the same time we will find that the bit 4 is overwritten because the hash functions of "zhangsan" and "lisi" both return this bit. Then as the data stored by the Bloom filter continues to increase, the probability of repetition will continue to increase, so when we filter a certain data, if we find that all three hash values are recorded in the filter, then It can only show that the data may be included in the filter, but it is not absolutely certain, because the hash value of other data may have an impact on the result. This explains that the Bloom filter mentioned above can only show Something must not exist or may exist". As for why three different hash functions are used to obtain values, because as long as one of the three hash values does not exist, the data must not be in the filter. This can be reduced The error probability due to hash collision (the hash value of two data is the same).
Bloom filters have packaged toolkits in many languages. Let’s take the python toolkit `pybloomfiltermmap3 as an example to demonstrate the code
import pybloom filter
#Create filter (data capacity, error rate, storage file) The lower the error rate, the larger the file
filter = pybloomfilter.BloomFilter(1000000, 0.01, ‘words.bloom’)
#adding data
filter.update((‘bj’, ‘sh’, ‘gz’))
#Judge whether it contains
if ‘bj’ in filter:
print('contains')
else:
print('does not contain')