Bloom filter in Redis

What is a Bloom filter?

Bloom Filter (Bloom Filter) was proposed by Bloom in 1970.

It is actually a very long binary vector and some random mapping functions.

Bloom filters can be used to retrieve whether an element is in a collection.

Its advantage is that space efficiency and query events are much better than general algorithms, but its disadvantage is that it has a certain misrecognition rate and difficulty in deletion.

basic concepts

If you want to judge whether an element exists in a set, the general idea is to save all the elements and then determine by comparison. Linked lists, trees, and other data structures are all this way of thinking, but as the elements in the collection increase, the space we need becomes larger and larger, and the retrieval speed becomes slower and slower.

However, there is also a data structure called a hash table (also known as a hash table) in the world, which can map an element to a point in a bit array through a Hash function. In this way, we only need to see if this point is 1 to know if it is in the set. This is the basic idea of ​​Bloom filter.

Bloom filter is specifically used to solve the problem of deduplication. Using bloom filter will not waste space as much as using cache. It has advantages and natural disadvantages. The disadvantage is that it is not very precise.

We can use the pf.exists method in Bloom Filter to judge whether a certain value exists. This judgment is not very accurate 判断某个值不存在,那就一定不存在.但是判断某个值存在,则有可能不存在。

Bloom Filter usage scenarios

The use of bloom filters can reduce disk IO or network requests. As long as it is judged that a value must exist, it can be returned directly without subsequent requests.

  • On the issue of cache penetration, use Bloom Filter to determine whether the data exists, and return directly if it does not exist
  • Mass data deduplication

Disadvantages of Bloom Filter

  • There is misjudgment: the misjudgment rate can be controlled according to the tolerance of the business.
  • Difficulty in deletion: an element placed in the Bloom filter is mapped to a position in the bit array with 1s. When deleting, the position of these bits cannot be simply set to 0, because it may affect the judgment of other elements.
  • Improper use of bloom filters may generate large values ​​and increase the risk of redis blocking. The production environment recommends that bulky bloom filters be split. (General idea: split into multiple small bit sets, and then all hash functions of a key must fall on the same bit set, not scattered on different bit sets)

How to use Bloom Filter?

Bloom Filter installation

Compile and install

// 安装 git 依赖
yum install git
// 拉取 Bloom Filter
git clone https://github.com/RedisBloom/RedisBloom.git
// 进入拉取后的目录
cd RedisBloom/
// 编译
make
// 运行
redis-server redis.conf --loadmodule /opt/redis-5.0.7/RedisBloom/redisbloom.so

After the installation is complete, enter redis-cli and execute the bf.add command to test whether the installation is successful.

Insert picture description here

Because we have to take a long parameter every time we start, we can modify it in the redis.conf configuration file.

################################## MODULES #####################################

# Load modules at startup. If the server is not able to load modules
# it will abort. It is possible to use multiple loadmodule directives.
#
# loadmodule /path/to/my_module.so
# loadmodule /path/to/other_module.so
loadmodule /opt/myredis/RedisBloom/redisbloom.so

The last line is the place we want to add, just configure the directory where our own redisbloom.so file is located, and start it directly next time
redis-server redis.conf 即可

Basic usage

Commonly used commands are divided into two categories: adding elements and determining whether an element exists.

  • bf.add\bf.madd add and batch add
  • bf.exists\bf.mexists judge and batch judge whether the value exists

How Jedis uses Bloom Filter

First add dependencies:

<dependency>    
	<groupId>com.redislabs</groupId>    
	<artifactId>jrebloom</artifactId>    
	<version>1.2.0</version> 
</dependency>

Then use jedis to test:

public class BloomFilter {
    
    
    public static void main(String[] args) {
    
    
        GenericObjectPoolConfig config = new GenericObjectPoolConfig();
        config.setMaxIdle(30);
        config.setMaxTotal(200);
        config.setMaxWaitMillis(30000);
        config.setTestOnBorrow(true);
        JedisPool jedisPool = new JedisPool(config,"192.168.253.100",6379,30000,"javaboy");
        Client client = new Client(jedisPool);
        // 1.存入数据
        for (int i = 0; i < 100000; i++) {
    
    
            client.add("name","javaboy-"+i);
        }
        // 2.检查数据是否存在
        boolean exists = client.exists("name", "javaboy-999999");
        System.out.println(exists);
    }
}

By default, the error rate of Bloom Filter is 0.01, and the default element size is 100. These two parameters can also be configured.

It can be configured through the bf.reserve method. (Note that name must be an existing key)

BF.RESERVE name 0.0001 100000

The first parameter is the key, and the second parameter is the error rate. The lower the error rate, the larger the occupied space. The third parameter is the estimated storage quantity. When the actual quantity exceeds the estimated quantity, the error rate will increase.

Cache breakdown scenario

For example, in a high concurrency scenario, multiple threads query the same resource at the same time. If the resource is not in the cache, then these multithreads will search in the database, causing great pressure on the database, and the cache will lose its meaning.

Insert picture description here

For example: there are 100 million pieces of user data, and now we have to query it in the database. Because the massive amount of data is not only extremely inefficient but also under pressure on the database, we will first process the request in Redis (the hot and high frequency users are stored in Redis) ), there is no such user in Redis, and then go to the database to query.

Now there may be a malicious request. These requests carry many non-existent users. First, the request will go to Redis to check if it exists. If Redis does not exist, it will go to the database to query. Imagine that suddenly there are tens of millions of users. Request access to our database, and the database will be destroyed directly, causing an avalanche effect.

In order to solve this problem, we can use the Bloom filter to store 100 million users in the Bloom filter. When the request comes, first go to the Bloom filter to search, if it exists, go to the database to query, otherwise Directly reject this request, at least through the Bloom filter we can reject 99% of malicious requests.

Guess you like

Origin blog.csdn.net/qq_43647359/article/details/105848980