See through the Redis series (4): Bloom filter in detail

Redis series articles:

Thorough Redis series (1): Redis installation under Linux

Thorough Redis series (2): Detailed usage of Redis six data types

See Through Redis Series (3): Redis pipeline, publish/subscribe, things, expiration time detailed introduction

See through the Redis series (4): Bloom filter in detail

Thorough Redis series (5): RDB and AOF persistence detailed introduction

Thorough Redis Series (6): A detailed introduction to master-slave replication

Thorough Redis Series (7): A detailed introduction to the sentinel mechanism

See Through Redis Series (8): Detailed introduction to clusters

Thorough Redis series (9): Redis proxy twemproxy and predixy detailed introduction

Thorough Redis series (10): Detailed introduction to Redis memory model

Thorough Redis Series (11): Detailed introduction to Jedis and Lettuce clients


In this blog, we mainly introduce how to use Redis to implement bloom filters, but before introducing bloom filters, we first introduce why you want to use bloom filters.

Bloom filter application scenarios

  • Solve the problem of cache penetration

Under normal circumstances, first query whether there is the data in the cache, and then query the database when there is none in the cache. When the data does not exist in the database, the database must be accessed for each query, which is cache penetration. The problem with cache penetration is that when there are a large number of requests to query data that does not exist in the database, it will put pressure on the database and even bring down the database.

The bloom filter can be used to solve the problem of cache penetration and store the existing data in the keybloom filter. When there is a new request, first check whether it exists in the Bloom filter, if the data does not exist, return directly; if the data exists, then query the cache query database.

  • Blacklist verification

If found in the blacklist, perform specific operations. For example: to identify spam, as long as the mailbox is in the blacklist, it will be identified as spam. Assuming that the number of blacklists is in the hundreds of millions, it is very storage space consuming to store them. Bloom filters are a better solution. Put all the blacklists in the Bloom filter, and then when you receive an email, just judge whether the email address is in the Bloom filter.

**Scenario 1: **Originally there were 1 billion numbers, but now there are 100,000 numbers. To quickly and accurately determine whether these 100,000 numbers are in the 1 billion number database?

Solution 1: Store 1 billion numbers in the database and perform database query. The accuracy is good, but the speed will be slower.

Solution 2: Put 1 billion numbers in memory, such as Redis cache, here we calculate the memory size: 1 billion * 8 bytes = 8GB, through the memory query, the accuracy and speed are all, but about 8gb The memory space is a waste of memory space.

**Scenario 2: **Shopping website searches for products, the customer enters the product in the product search bar, first of all, it is necessary to determine whether the product exists in my database, and if it exists, the database query operation will be executed!

So for a large data collection like this, how to accurately and quickly determine whether a certain data is in a large data collection without occupying memory, the Bloom filter came into being.

Introduction to Bloom Filter

With the above questions, let's take a look at what exactly is a Bloom filter.

Bloom filter: A data structure consisting of a long string of binary vectors, which can be regarded as a binary array. Since it is binary, it stores either 0 or 1, but the initial default value is 0.

As follows:

Insert picture description here

1. Add data

When introducing the concept, we said that the Bloom filter can be regarded as a container, so how to add a data to the Bloom filter?

As shown in the figure below: When we want to add an element key to the Bloom filter, we calculate a value through multiple hash functions, and then set the square where this value is located to 1.

For example, hash1(key)=1 in the figure below, then change 0 to 1 in the second grid (the array is counted from 0), hash2(key)=7, then set the eighth grid to 1, and then analogy.

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-5gSG4wfS-1610099908069)(/home/bobo/.config/Typora/typora-user-images/image-20210108084745498) .png)]

2. Determine whether the data exists?

Knowing how to add a piece of data to the Bloom filter, how do we judge whether a new piece of data exists in this Bloom filter?

Very simple, we only need to pass this new data through the custom hash functions above to calculate each value separately, and then see whether the corresponding place is all 1, if there is a situation that is not 1, then we can say , The new data must not exist in this Bloom filter.

On the other hand, if the value calculated by the hash function is 1 in the corresponding place, then we can be sure that this data must exist in this Bloom filter?

The answer is no, because the results of multiple different data calculated through the hash function will be repeated, so there will be a certain position where other data is set to 1 through the hash function.

We can get a conclusion: Bloom filter can determine that a certain data must not exist, but it cannot determine that it must exist .

3. Advantages and disadvantages of bloom filter

Advantages: The advantages are obvious. The binary array takes up very little memory, and the insertion and query speeds are fast enough.

Disadvantages: With the increase of data, the rate of misjudgment will increase; there is also the inability to determine that the data must exist; there is also an important disadvantage, the data cannot be deleted.

Redis implements Bloom filter

In the Redis is bitmapachieved Bloom filter!

bitmap

We know that computers use binary bits as the basic unit of underlying storage, and one byte is equal to 8 bits.

For example, the "big" string is composed of three characters. The ASCII codes corresponding to these three characters are 98, 105, 103, and the corresponding binary storage is as follows:

Insert picture description here

In Redis, Bitmaps provides a set of commands to manipulate each bit in a string similar to the above.

Settings

setbit key offset value
127.0.0.1:6379> set k1 big
OK
127.0.0.1:6379> setbit k1 7 1
(integer) 0
127.0.0.1:6379> get k1
"cig"
127.0.0.1:6379> 

We know that the binary representation of "b" is 0110 0010, we set the 7th bit (starting from 0) to 1, then 0110 0011 represents the character "c", so the last character "big" becomes "cig" .

Get the number of bitmaps whose specified range is 1

bitcount key [start end]

If you don't specify it, it will get all the numbers of 1.

Note: start and end specify the number of bytes , not the bit array subscript.

127.0.0.1:6379> set k1 big
OK
127.0.0.1:6379> bitcount k1
(integer) 12
127.0.0.1:6379> bitcount k1 0 0
(integer) 3
127.0.0.1:6379> bitcount k1 0 1
(integer) 7
127.0.0.1:6379> 

Redis install Bloom filter module

1. Visit the github address and download the module source code

https://github.com/RedisBloom/RedisBloom

Use git clone directly or download the zip

git clone https://github.com/RedisBloom/RedisBloom.git

2. Execute make to compile the dynamic library

cd RedisBloom
make

After the execution is complete, a redisbloom.so dynamic library will be generated

3. Start redis to load the dynamic library

# 我习惯把该库放到redis的安装目录下,这步骤看自己喜好
sudo cp redisbloom.so /opt/redis/
# 先停掉redis进程
sudo kill -9 pid
# 加载动态库
redis-server --loadmodule /opt/redis/redisbloom.so

The following figure appears to show that the loading is complete:

Insert picture description here

Then you can use the redis-cliclient to connect and test

Redis uses Bloom filters

1. Commonly used commands

bf.add add element

bf.exists query whether the element exists

bf.madd add multiple elements at once

bf.mexists query whether multiple elements exist at once

127.0.0.1:6379> bf.add k1 1
(integer) 1
127.0.0.1:6379> bf.add k1 2
(integer) 1
127.0.0.1:6379> bf.exists k1 1
(integer) 1
127.0.0.1:6379> bf.exists k1 5
(integer) 0
127.0.0.1:6379>

2. Bloom filter accuracy rate

There are two values ​​in redis that determine the accuracy of the Bloom filter:

error_rate: Allow the error rate of the Bloom filter. The lower the value, the larger the size of the bit array of the filter, and the larger the space occupied.

initial_size: The number of elements that the Bloom filter can store. When the number of elements actually stored exceeds this value, the accuracy of the filter will decrease.

There is a command in redis to set these two values:

bf.reserve test 0.01 100 

The first value is the name of the filter.

The second value is the value of error_rate.

The third value is the value of initial_size.

Note that you must use the bf.reserve command to create it explicitly before add. If the corresponding key already exists, bf.reserve will report an error. At the same time, the lower the error rate is set, the more space is required. If bf.reserve is not used, the default error_rate is 0.01, and the default initial_size is 100.

3. Use in the project

3.1

Import package

<dependency>
        <groupId>com.redislabs</groupId>
        <artifactId>jrebloom</artifactId>
        <version>1.0.2</version>
</dependency>

There are only three classes in the JAR package, and there is insufficient support for connection methods and data types

Code:

Client client = new Client(redisProperties.getHost(), redisProperties.getPort(), 10000, 100);
client.add("bobo", "123");
boolean bo = client.exists("bobo", "123");
System.out.println(bo);

3.2: BloomFilter in Guava

The BloomFilter class is provided in the guava package of Google, which directly uses the server memory

Import package

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>22.0</version>
</dependency>

Code:

private static int size = 1000000;
private static BloomFilter<String> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charset.defaultCharset()), size, 0.0001);

public void test2() {
    
    
    String bo = "bobo";
    bloomFilter.put(bo);
    System.out.println(bloomFilter.mightContain(bo));
}

Guess you like

Origin blog.csdn.net/u013277209/article/details/112376005