Bloom filter of redis study notes

What is a Bloom filter?

Bloom filter is provided in redis 4.0. Simply put, bloom filter is an inaccurate set structure. When you use its contains to determine whether an object exists. When the Bloomer says that a certain value exists, the value may not exist, and when he says that a certain value does not exist, it must not exist.

Simple use of Bloom filter

127.0.0.1:6379> bf.add calvinBloom 1
(integer) 1
127.0.0.1:6379> bf.add calvinBloom 2
(integer) 1
127.0.0.1:6379> bf.exists calvinBloom 1
(integer) 1
127.0.0.1:6379> bf.exists calvinBloom 2
(integer) 1
127.0.0.1:6379> bf.exists calvinBloom 3
(integer) 0
#本代码来自redis深度历险
import redis
import random
 
client = redis.StrictRedis()
 
chars = ''.join([chr(ord('a') + i) for i in range(26)])
 
def random_string(n):
    chars_list = []
    for i in range(n):
        ids = random.randint(0, len(chars) - 1)
        chars_list.append(chars[ids])
    return ''.join(chars_list)
 
users = list(set([random_string(64) for i in range(10000)]))
print("total users", len(users))
 
users_train = users[:int(len(users)/ 2)]
users_test = users[int(len(users) / 2):]
 
client.delete("codehole")
falses = 0
for user in users_train:
 
    client.execute_command("bf.add", "codehole", user)
print("all train")
for user in users_test:
    ret = client.execute_command("bf.add", "codehole", user)
    if ret == 1:
        falses += 1
 
print(falses, len(users_test))

From the results of the above operation, the false positive rate is close to 2%, so this false positive rate is still a bit high. Bloom filters with custom parameters are provided in redis, which are created using bf.reserve before the bf.add command. bf.reserve has three parameters: key (key), error_rate (error rate), inital_size (estimated number of elements).

note:

  1. If the corresponding key already exists, an error will be reported.
  2. If inital_size is set too large, it will waste space, and if it is too small, it will affect the accuracy. Estimate the number of elements as much as possible before use, and add a certain amount of redundant space to avoid the actual elements may be too much higher than the estimated value.

Bloom filter principle

Each bloom filter corresponds to the redis data structure, which is a large bit array and several different hash functions (h1, h2, h3). When adding, use multiple hash functions to hash the key to get a hash value, and then calculate the remainder of the length of the array ((length-1) & hash value) integer s to get a position, and put the corresponding The position is set to 1. When querying the Bloom filter for the key, like add, find the corresponding position and see if it is 1. As long as one is 0, the key does not exist, and vice versa. 

Space occupancy estimate 


n: expected number of elements f: error rate m: length of bit array k: optimal number of hash functions

k = 0.7*(m/n)

f = 0.6185^(m/n)

From the above formula, it can be concluded that the longer the bit array is, the lower the error rate, and the more optimal number of hash functions are required (affecting judgment efficiency)

Here I directly take the conclusions from the redis in-depth adventure book, and see the following recommendations for derivation

Recommended reading

Bloom filter and its mathematical derivation

Bloom filter calculator (combined with the recommended understanding above)

Use cases of Bloom filters for common interview questions (massive data)

Guess you like

Origin blog.csdn.net/lin_keys/article/details/105959545