Redis' distributed bloom filter

problem

Old Gu first came up with a frequently asked interview question: There are currently 5 billion phone numbers, and there are 100,000 phone numbers . How to quickly and accurately determine whether these phone numbers already exist?

The above problem can be refined, that is, 5 billion phone numbers are in the database, and now it is necessary to quickly and accurately determine whether the provided 100,000 phone numbers exist.

Do our little friends have the following solutions in their minds: In

actual projects, we will also encounter similar problems, such as spam filtering, web crawler repeated URL detection, etc. The essence is to judge whether the data exists in a large set.

How to solve it? This is the Bloom filter solution we are going to introduce today. Let's continue to look down.

Bloom filter

Bloom filter is a data structure similar to set , but it is not very accurate. When judging whether the element exists, the return result exists but the truth does not necessarily exist ; when the return does not exist, it must not exist , so there is certain The probability of misjudgment.

Of course, misjudgment will only occur on elements that have not been added by the filter, and no misjudgment will occur for elements that have been added.

Features: Insert and query efficiently, occupy less space, and return results are uncertain.

Bloom filter principle

This was proposed by Burton Bron in 1970, using a small space to solve the similar problems mentioned above.

The implementation principle is that we need a very long binary array (also called a vector) ; when adding data, use multiple hash functions to hash the key to get an index value (that is, the index value of the binary array)


In the above figure, the bottom is a very long binary array, the second layer is multiple hash functions, and the top is data.

In the above figure, each data is calculated by multiple hash functions to get the index value, and the index value corresponding to the binary array is set to 1. We found that after three hashes , it will be set to 1 in the three index places. It means that this data exists.

Bloom filter error

Space occupation

There is a simple calculation formula for the space occupation of the Bloom filter, but the derivation is more complicated. Bloom filter has two parameters, the expected number of elements n , the error rate f, the formula gets two outputs, the bit array length L (that is, the storage space size bit) , and the optimal number of hash functions k.

k = 0.7*(1/n)
f = 0.6185^(L/n) When the

actual elements exceed the

above friends, it is enough to know that there will be errors, and do not need to force how to calculate.

The basic use of Redis Bloom filter

In Redis, Bloom filter has two basic commands, namely:

  • bf.add : add elements to the Bloom filter , similar to the set of sadd command, but bf.add command can only add one element, if you want to add more elements, can be used bf.madd command.

  • bf.exists : Determine whether an element is in the filter , similar to the sismember command of the collection , but the bf.exists command can only query one element at a time. If you want to query multiple elements at a time, you can use the bf.mexists command.

Advanced use of Bloom filters

The bloom filter used in the above example is just the bloom filter with default parameters , which is automatically created when we first use the bf.add command. Redis also provides Bloom filters with custom parameters . If you want to minimize the misjudgment of Bloom filters, you must set reasonable parameters.

Before using the bf.add command to add elements, use the bf.reserve command to create a custom bloom filter . The bf.reserve command has three parameters, namely:

  • key : jian

  • error_rate : Expected error rate. The lower the expected error rate, the more space is needed.

  • capacity : Initial capacity. When the actual number of elements exceeds this initial capacity, the false positive rate increases.

such as:

If the corresponding key already exists, an error will be reported when the bf.reserve command is executed. If you do not use the bf.reserve command to create, but use the Bloom filter automatically created by Redis, the default error_rate is 0.01 and the capacity is 100.

The smaller the error_rate of the Bloom filter is , the more storage space is required. For scenarios that do not require too much precision, the error_rate can be set slightly larger. If the capacity of the Bloom filter is set too large, it will waste storage space, and if the setting is too small, it will affect the accuracy. Therefore, you must estimate the number of elements as accurately as possible before using it, and you need to add a certain amount of redundancy. Space to avoid the actual elements may unexpectedly be much higher than the set value. In short, both error_rate and capacity need to be set to an appropriate value.

Bloom filter application

Solve the problem of cache penetration

Under normal circumstances, ** first query whether the data is in the cache, and then query the database if it is not in the cache. **When the data does not exist in the database, the database must be accessed for each query, which is cache penetration. The problem with cache penetration is that when there are a large number of requests to query data that does not exist in the database, it will put pressure on the database and even bring down the database.

The bloom filter can be used to solve the problem of cache penetration and store the key of the existing data in the bloom filter. When there is a new request, first check whether it exists in the Bloom filter, if the data does not exist, return directly ; if the data exists, then query the cache query database.

Blacklist verification

If found in the blacklist, perform specific operations. For example: to identify spam, as long as the mailbox is in the blacklist, it will be identified as spam. Assuming that the number of blacklists is in the hundreds of millions, it is very storage space consuming to store them. Bloom filters are a better solution. Put all the blacklists in the Bloom filter, and then when you receive an email, just judge whether the email address is in the Bloom filter.

to sum up

Today, Lao Gu took everyone to understand the principle and application scenarios of redis bloom filters ; I hope to bring help to my friends, thank you! !

Guess you like

Origin blog.csdn.net/EnjoyEDU/article/details/107886078