[Data Structure and Algorithm] Bloom Filter

1. What is a Bloom filter

Bloom Filter (Bloom Filter) was proposed by Bloom in 1970. It's actually a long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element is in a set. Its advantage is that the space efficiency and query time are much better than the general algorithm, but the disadvantage is that there is a certain rate of misrecognition and difficulty in deletion.

The above sentence describes more comprehensively what a Bloom filter is. If it is still not easy to understand, you can understand the Bloom filter as a set collection. We can add elements to it through add, and judge through contains Whether to contain an element. Since this article describes the Bloom filter in conjunction with Redis, it is easier to understand the analogy of the Set data structure in Redis, and the instructions used by the Bloom filter in Redis are very similar to the Set collection (will be discussed later) .

Before learning the Bloom filter, it is necessary to talk about its advantages and disadvantages, because we only want good things!

  • Advantages of Bloom filters:

    • The time complexity is low, and the time complexity of adding and querying elements is O(N), (N is the number of hash functions, usually relatively small)
    • Strong confidentiality, Bloom filter does not store the element itself
    • The storage space is small. If certain misjudgments are allowed, the Bloom filter is very space-saving (compared to other data structures such as Set collections)
  • Disadvantages of Bloom filters:

    • There is a certain false positive rate, but it can be reduced by adjusting the parameters
    • Can't get the element itself
    • It is difficult to remove elements

2. The usage scenario of Bloom filter

The Bloom filter can tell us that "something must not exist or may exist", that is to say, if the Bloom filter says that the number does not exist, it must not exist, and the Bloom filter says that the number may not exist ( misjudgment , will be discussed later), using this feature to judge whether it exists can do many interesting things.

  • 解决Redis缓存穿透问题(面试重点)
  • Mail filtering, using Bloom filter to do mail blacklist filtering
  • Filter the crawler URLs, and those that have been crawled will not be crawled again
  • Solve the problem of no longer recommending those recommended by news (similar to those that have been swiped by Douyin, slide down and no longer be swiped)
  • Databases such as HBase\RocksDB\LevelDB have built-in Bloom filters, which are used to determine whether data exists, which can reduce the IO requests of the database

3. The principle of Bloom filter

3.1 Data structure

Bloom filter It is actually a long binary vector and a series of random mapping functions. Take the Bloom filter implementation in Redis as an example. The bottom layer of the Bloom filter in Redis is a large bit array (binary array) + multiple unbiased hash functions .
A large bit array (binary array):

insert image description here

Multiple unbiased hash functions:
An unbiased hash function is a hash function that can calculate the hash value of an element more uniformly, and can make the calculated element subscripts evenly mapped to the bit array.

The following is a schematic diagram of a simple Bloom filter, where k1 and k2 represent added elements, a, b, and c are unbiased hash functions, and the bottom layer is a binary array.

insert image description here

3.2 Spatial Computing

Before adding elements to the Bloom filter, it is first necessary to initialize the space of the Bloom filter, that is, the binary array mentioned above. In addition, it is necessary to calculate the number of unbiased hash functions. The Bloom filter provides two parameters, namely, the size n of the elements expected to be added, and the error rate f of the operation. There is an algorithm in the Bloom filter to calculate the size l of the binary array and the number k of unbiased hash functions based on these two parameters.
The relationship between them is relatively simple:

  • The lower the error rate, the longer the bit array, and the larger the control occupies
  • The lower the error rate, the more unbiased hash functions, and the longer the calculation time

The following address is a URL for online calculation of a free online Bloom filter:

https://krisives.github.io/bloom-calculator/

insert image description here

3.3 Adding elements

To add elements to the Bloom filter, the added key needs to be calculated according to k unbiased hash functions to obtain multiple hash values, and then the length of the array is moduloed to obtain the position of the array subscript, and then the value corresponding to the position of the array subscript is set to 1

  • K hash values ​​are obtained by calculating k unbiased hash functions
  • Take the length of the modulus array in turn to get the array index
  • Modify the calculated array index subscript position data to 1

For example, key = Liziba, the number of unbiased hash functions k=3, which are hash1, hash2, and hash3 respectively. After the calculation of the three hash functions, three array subscript values ​​are obtained, and their values ​​are changed to 1.
As shown in the figure:

insert image description here

3.4 Query elements

The greatest use of the Bloom filter is to judge that something must not exist or may exist, and this is the result of querying elements. The process of querying elements is as follows:

  • K hash values ​​are obtained by calculating k unbiased hash functions
  • Take the length of the modulus array in turn to get the array index
  • Determine whether the values ​​at the index are all 1, if they are all 1, they exist (this existence may be a misjudgment), if there is a 0, it must not exist

Regarding misjudgment, it is actually very easy to understand. No matter how good the hash function is, hash conflicts cannot be completely avoided. That is to say, there may be multiple elements whose calculated hash values ​​are the same, so they will be obtained after taking the length of the modulo array The array index is also the same, which is the reason for the misjudgment. For example, the index of the array obtained after moduloing the hash values ​​of Li Ziba and Li Ziqi is 1, but in fact there is only Li Ziba here. If it is judged whether Li Ziqi is here at this time, a misjudgment will occur! Therefore, the biggest shortcoming of Bloom filter is misjudgment, as long as you know the principle of judging whether an element exists, it is easy to understand!

3.5 Modifying elements


no

3.6 Deleting elements

Bloom filters do not support element deletion very well, and there are currently some variants of specific Bloom filters that support element deletion! As for why deletion is not very supported, it is actually very easy to understand. Hash conflicts must exist, and deletion must be very difficult!

4. Redis integrates Bloom filter

4.1 Version requirements

  • The recommended version is 6.x, and the minimum version is 4.x. You can check the version with the following command:
redis-server -v

insert image description here

  • For plug-in installation, most of the Internet recommends v1.1.1. When the article was written, v2.2.6 was already a release version. Users can choose by themselves. features, no need to upgrade!)

insert image description here

v1.1.1

https://github.com/RedisLabsModules/rebloom/archive/v1.1.1.tar.gz

v2.2.6

https://github.com/RedisLabsModules/rebloom/archive/v2.2.6.tar.gz

4.2 Installation & Compilation

The following installations are all completed in the specified directory, and you can choose a suitable unified directory for software installation and management.

4.2.1 Download the plug-in compressed package

wget https://github.com/RedisLabsModules/rebloom/archive/v2.2.6.tar.gz

4.2.2 Decompression

tar -zxvf v2.2.6.tar.gz

4.2.3 Compiling plugins

cd RedisBloom-2.2.6/
make

insert image description here

After the compilation is successful, you
can

4.3 Redis integration

4.3.1 Redis configuration file modification

  • Add the address of the redisbloom.so file such as RedisBloom in the redis.conf configuration file
  • If it is a cluster, the address of the redisbloom.so file needs to be added to each configuration file
  • Redis needs to be restarted after adding
loadmodule /usr/local/soft/RedisBloom-2.2.6/redisbloom.so

The configuration item of loadmodule is preset in the redis.conf configuration file, we can directly modify it here, and subsequent modification will be more convenient.
insert image description here

  • Remember to restart Redis after saving and exiting!

4.3.2 Whether the test is successful

The main instructions of Redis integrated Bloom filter are as follows:

  • bf.add adds an element
  • bf.exists determines whether an element exists
  • bf.madd adds multiple elements
  • bf.mexists determines whether multiple elements exist

Connect the client to test, if the command is valid, it proves that the integration is successful

insert image description here

If the following situation (error) ERR unknown command occurs, you can check it by the following method:

  • SHUTDOWN the Redis instance, restart the instance, and test again
  • Check whether the configuration file is configured with the correct redisbloom.so file address
  • Check if the version of Redis is too low

insert image description here

5. Use of Bloom filter instructions in Redis

5.1 bf.add

bf.add means to add a single element, and return 1 if the addition is successful

127.0.0.1:6379> bf.add name liziba
(integer) 1

insert image description here

5.2 bf.madd

bf.madd means adding multiple elements

127.0.0.1:6379> bf.madd name liziqi lizijiu lizishi
1) (integer) 1
2) (integer) 1
3) (integer) 1

5.3 bf.exists

bf.exists means to judge whether the element exists, return 1 if it exists, and return 0 if it does not exist

127.0.0.1:6379> bf.mexists name liziba
1) (integer) 1

insert image description here

5.4 bf.mexists

bf.mexists means to judge whether multiple elements exist, return 1 if they exist, and return 0 if they do not exist

127.0.0.1:6379> bf.mexists name liziqi lizijiu liziliu
1) (integer) 1
2) (integer) 1
3) (integer) 0

insert image description here

6. Java local memory uses Bloom filter

There are many ways to use Bloom filters, and many bigwigs have written them by themselves. What I use here is the Bloom filter implemented in the Google guava package. The Bloom filter in this way is implemented in local memory.

6.1 Introducing pom dependencies

<dependency>
  <groupId>com.google.guava</groupId>
  <artifactId>guava</artifactId>
  <version>29.0-jre</version>
</dependency>

6.2 Writing test code

package com.lizba.bf;
 
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;
 
/**
 * <p>
 *        布隆过滤器测试代码
 * </p>
 *
 */
public class BloomFilterTest {
    
    
 
    /** 预计插入的数据 */
    private static Integer expectedInsertions = 10000000;
    /** 误判率 */
    private static Double fpp = 0.01;
    /** 布隆过滤器 */
    private static BloomFilter<Integer> bloomFilter = BloomFilter.create(Funnels.integerFunnel(), expectedInsertions, fpp);
 
    public static void main(String[] args) {
    
    
        // 插入 1千万数据
        for (int i = 0; i < expectedInsertions; i++) {
    
    
            bloomFilter.put(i);
        }
 
        // 用1千万数据测试误判率
        int count = 0;
        for (int i = expectedInsertions; i < expectedInsertions *2; i++) {
    
    
            if (bloomFilter.mightContain(i)) {
    
    
                count++;
            }
        }
        System.out.println("一共误判了:" + count);
 
    }
 
}

6.3 Test results

There are 100075 misjudgments, which is about 0.01 of expectedInsertions (10 million), which is very close to the fpp = 0.01 we set.

insert image description here

6.4 Parameter description

In the BloomFilter source code in the guava package, constructing a BloomFilter object has four parameters:

  • Funnel funnel: data type, specified by the Funnels class
  • long expectedInsertions: the number of values ​​expected to be inserted
  • fpp: error rate
  • BloomFilter.Strategy: hash algorithm

6.5 fpp&expectedInsertions

  • When expectedInsertions=10000000&&fpp=0.01, the size of the bit array numBits=95850583, the number of hash functions numHashFunctions=7

insert image description here

  • When expectedInsertions=10000000&&fpp=0.03, the size of the bit array numBits=72984408, the number of hash functions numHashFunctions=5
    insert image description here

  • When expectedInsertions=100000&&fpp=0.03, the size of the bit array numBits=729844, the number of hash functions numHashFunctions=5
    insert image description here

Based on the above three tests, the following conclusions can be drawn:

  • When the number of expected inserted values ​​is constant, the smaller the deviation value fpp is, the larger the bit array is, and the larger the number of hash functions is.
  • When the deviation value remains unchanged, it is expected that the number of inserts will be larger, the bit array will be larger, and the hash function will not change (note that this conclusion is only consistent with the algorithm in the Bloom filter implemented by Guava, not all algorithms This is the conclusion, I have done many tests, and it is true that numHashFunctions does not change when the fpp is the same!)

7. Java integrated Redis uses Bloom filter

Redis is often asked about cache breakdown. The better solution is to use Bloom filters, and some use empty objects to solve them, but the best way is definitely Bloom filters. We can use Bloom filters to Determine whether the element exists, and avoid querying and accessing data that does not exist in the cache and database! In the following code, you only need to pass bloomFilter.contains(xxx). What I am demonstrating here is still the false positive rate!

7.1 Introducing pom dependencies

<dependency>
  <groupId>org.redisson</groupId>
  <artifactId>redisson-spring-boot-starter</artifactId>
  <version>3.16.0</version>
</dependency>

7.2 Writing test code

package com.lizba.bf;
 
import org.redisson.Redisson;
import org.redisson.api.RBloomFilter;
import org.redisson.api.RedissonClient;
import org.redisson.config.Config;
 
/**
 * <p>
 *      Java集成Redis使用布隆过滤器防止缓存穿透方案
 * </p>
 *
 */
public class RedisBloomFilterTest {
    
    
 
    /** 预计插入的数据 */
    private static Integer expectedInsertions = 10000;
    /** 误判率 */
    private static Double fpp = 0.01;
 
    public static void main(String[] args) {
    
    
        // Redis连接配置,无密码
        Config config = new Config();
        config.useSingleServer().setAddress("redis://192.168.211.108:6379");
        // config.useSingleServer().setPassword("123456");
 
        // 初始化布隆过滤器
        RedissonClient client = Redisson.create(config);
        RBloomFilter<Object> bloomFilter = client.getBloomFilter("user");
        bloomFilter.tryInit(expectedInsertions, fpp);
 
        // 布隆过滤器增加元素
        for (Integer i = 0; i < expectedInsertions; i++) {
    
    
            bloomFilter.add(i);
        }
 
        // 统计元素
        int count = 0;
        for (int i = expectedInsertions; i < expectedInsertions*2; i++) {
    
    
            if (bloomFilter.contains(i)) {
    
    
                count++;
            }
        }
        System.out.println("误判次数" + count);
 
    }
 
}

7.3 Test results

insert image description here

Guess you like

Origin blog.csdn.net/u011397981/article/details/130690257