Bloom filter and cuckoo filter of Redis

Everyone knows that IO has always been a bottleneck in computers. Many frameworks, technologies and even hardware are designed to reduce IO operations. Let’s talk about filters today. Let’s talk about a scene first:

Our business backend involves databases. When requesting messages to query certain information, we may first check whether there is relevant information in the cache, and return it if there is any. If not, we may go to the database to query. At this time, there is a problem. The request is requesting data that does not exist in the database, so the database will frequently respond to such unnecessary IO queries. If there are more, most of the database IOs are responding to such meaningless request operations, so how do these What about blocking requests? The filter was born:

bloom filter

The general idea of ​​Bloom Filter is, when the information you request comes, first check whether I have the data you queried, if there is, press the request to the database, if not, return it directly, how is it did it?

As shown in the figure, a bitmap is used for recording, and the original values ​​of the bitmap are all 0. When a data is stored, three Hash functions are used to calculate the Hash value three times, and the corresponding position of the bitmap is set to 1. In the above figure, the bitmap The 1, 3, and 6 positions are marked as 1. At this time, if a data request comes over, the Hash value is still calculated with the previous three Hash functions. If it is the same data, it must still be mapped to 1, 3, and 6 bits. , then it can be judged that this data has been stored before. If one of the three positions of the new data mapping does not match, if it is mapped to bits 1, 3, and 7, since the 7th bit is 0, that is, the data has not been stored before. Added into the database, so return directly.

The problem with Bloom filters

In the above way, you should have discovered that there are some problems with the Bloom filter:

In the first aspect, the Bloom filter may misjudge:

If there is such a scenario, when putting in data packet 1, set bits 1, 3, and 6 of the bitmap to 1, and when putting in data packet 2, set bits 3, 6, and 7 of bitmap to 1. At this time, there is no The stored data packet request 3, after doing three hashes, the corresponding bitmap points are 1, 6, 7 respectively, this data has not been stored before, but because the data packets 1 and 2 will be stored in the corresponding point It is set to 1, so request 3 will also overwhelm the database. In this case, it will increase as the stored data increases.

In the second aspect, the Bloom filter cannot delete data, and there are two dilemmas in deleting data:

First, due to the possibility of misjudgment, it is not sure whether the data exists in the database, such as data packet 3.

Second, when you delete the flag on the bitmap corresponding to a data packet, it may affect other data packets. For example, in the above example, if you delete data packet 1, it means that bitmap1, 3, and 6 bits will be set to 0, when data packet 2 comes to request at this time, it will show that it does not exist, because the 3 and 6 bits have been set to 0.

Enhanced Bloom Filter

In order to solve the problem of the above Bloom filter, an enhanced version of the Bloom filter (Counting Bloom Filter) has appeared. The idea of ​​this filter is to replace the bitmap of the Bloom filter with an array. When a certain position of the array is mapped once When it is deleted, it will be +1, and when it is deleted, it will be -1. This avoids the problem of recalculating the Hash of the remaining data packets after the ordinary Bloom filter deletes the data, but it still cannot avoid misjudgment.

cuckoo filter

In order to solve the problem that the Bloom filter cannot delete elements, a cuckoo filter is proposed. Compared with the cuckoo filter, the Bloom filter has the following disadvantages: weak query performance, low space utilization efficiency, no support for reverse operations (deletion), and no support for counting.

The weak query performance is because the Bloom filter needs to use multiple hash functions to detect multiple different locations in the bitmap. These locations have a large span in memory, which will lead to a low CPU cache line hit rate.

The reason for the low space efficiency is that under the same misjudgment rate, the space utilization rate of the cuckoo filter is significantly higher than that of the Bloom, and the space can be saved by more than 40%. However, the Bloom filter does not require that the length of the bitmap must be an index of 2, while the cuckoo filter must have this requirement. From this point of view, it seems that the Bloom filter is more space-scalable.

The problem of not supporting the reverse delete operation really hits the weakness of the Bloom filter. In a dynamic system, elements are always coming and going. Bloom filters are like blots, there will be traces when they come and go, and they can't be cleaned up even if they go away. For example, there are only 1kw elements left in your system, but there are hundreds of millions of water elements in the whole, the Bloom filter is very helpless, it will store the imprints of these lost elements there forever. Over time, this filter will become more and more crowded, until one day you find that its false positive rate is too high and you have to rebuild it.

The cuckoo filter claims to solve this problem in the paper, and it can effectively support the reverse delete operation. And use it as an important selling point to tempt you to ditch the Bloom filter and use the cuckoo filter instead.

Why the name Cuckoo?

There is an idiom, "the dove occupies the magpie's nest", and so does the cuckoo. The cuckoo never builds its own nest. It lays its own eggs in other people's nests and asks others to help hatch them. After the baby cuckoo broke out of its shell, because the cuckoo was relatively large, it squeezed the other children (or eggs) of the adoptive mother out of the nest - fell from a high altitude and died young.

Cuckoo Hash

The simplest cuckoo hash structure is a one-dimensional array structure, and there will be two hash algorithms to map new elements to two positions of the array. If one of the two positions is empty, the element can be placed directly into it. But if these two positions are full, it has to "occupy the magpie's nest", kick one away at random, and then occupy this position by itself.

p1 = hash1(x) % l
p2 = hash2(x) % l

Unlike the cuckoo, the cuckoo hashing algorithm will help these victims (displaced eggs) find other nests. Because each element can be placed in two positions, as long as any one has a free position, it can be inserted. So this sad egg that was squeezed out will check to see if his other position is free, and if it is free, he will move over by himself and everyone will be happy. But what if this position is also taken by someone else? Well, then it will do the "doves take over the magpie's nest" again, passing the role of victim to someone else. The new victim then repeats the process until all eggs have found their nests.

Problems with cuckoo hashing

But there will be a problem, that is, if the array is too crowded, it will be kicked hundreds of times without stopping, and the insertion efficiency will be seriously affected at this time. At this time, the cuckoo hash will set a threshold. When the continuous nest occupation behavior exceeds a certain threshold, the array is considered to be almost full. At this time, it needs to be expanded and all elements relocated.

There would be another problem, and that would be the possibility of a run cycle. For example, for two different elements, the two positions after the hash are exactly the same. At this time, there is no problem with each of them having one position. But at this time, the third element comes, and its position after hash is the same as them. Obviously, there will be a running cycle at this time. However, the position of three different elements is still the same after two hashes. The probability of this is not very high, unless your hash algorithm is too frustrated.

The cuckoo hash algorithm's attitude towards this run cycle is that the array is too crowded and needs to be expanded (in fact, it is not).

optimization

The average space utilization of the above cuckoo hash algorithm is not high, only about 50%. When this percentage is reached, the number of consecutive runs will soon exceed the threshold. The value of such a hash algorithm is not obvious, so it needs to be improved.

One of the improved solutions is to increase the hash function so that each element has not only two nests, but three nests and four nests. This can greatly reduce the probability of collision and increase the space utilization rate to about 95%.

Another improvement is to hang multiple seats on each position of the array, so that even if two elements are hashed at the same position, there is no need to "occupy the magpie's nest" immediately, because there are multiple seats, you can Feel free to sit on one. Unless these multiple seats are occupied, a run is required. Obviously this would also significantly reduce the number of runs. The space utilization rate of this solution is only about 85%, but the query efficiency will be very high. Multiple seats in the same position are continuous in memory space, which can effectively use the CPU cache.

Therefore, a more efficient solution is to combine the above two improved solutions, such as using 4 hash functions and placing 2 seats in each position. In this way, both time efficiency and space efficiency can be obtained. Such a combination can even increase the space utilization rate by 99%, which is a very remarkable space efficiency.

cuckoo filter

The cuckoo filter has the same structure as the cuckoo hash. It is also a one-dimensional array, but unlike the cuckoo hash, the cuckoo hash will store the entire element, while the cuckoo filter will only store the fingerprint information of the element. (several bits, similar to bloom filters). Here the filter sacrifices data accuracy for space efficiency. It is precisely because the fingerprint information of the element is stored, so there will be a false positive rate, which is exactly the same as the Bloom filter.

First of all, the cuckoo filter still only uses two hash functions, but multiple seats can be placed in each position. The selection of these two hash functions is special, because only fingerprint information can be stored in the filter. When the fingerprint at this position is squeezed, it needs to calculate another dual position. The calculation of this dual position requires the element itself. Let's recall the previous hash position calculation formula.

fp = fingerprint(x)
p1 = hash1(x) % l
p2 = hash2(x) % l

We know the fingerprints of p1 and x, but there is no way to directly calculate p2.

special hash function

The ingenuity of the cuckoo filter lies in the design of a unique hash function, so that p2 can be directly calculated according to p1 and the element fingerprint, without the complete x element.

fp = fingerprint(x)
p1 = hash(x)
p2 = p1 ^ hash(fp)  // 异或

It can be seen from the above formula that when we know fp and p1, we can directly calculate p2. Similarly, if we know p2 and fp, we can also directly calculate p1—the duality.

p1 = p2 ^ hash(fp)

So we don't need to know whether the current position is p1 or p2 at all, we only need to XOR the current position and hash(fp) to get the dual position. And you only need to ensure that hash(fp) != 0 can ensure that p1 != p2, so that there will be no problem of kicking yourself and causing an infinite loop.

Maybe you will ask why the hash function here does not need to take the modulus of the length of the array? It is actually needed, but the cuckoo filter enforces that the length of the array must be a power of 2, so modulo the length of the array is equivalent to taking the last n bits of the hash value. When doing XOR operation, just ignore the other bits except the lower n bits. Reserving the lower n bits of the calculated position p is the final dual position.

Guess you like

Origin blog.csdn.net/qq_45635347/article/details/131443934