Redis series articles:
Thorough Redis series (1): Redis installation under Linux
Thorough Redis series (2): Detailed usage of Redis six data types
See through the Redis series (4): Bloom filter in detail
Thorough Redis series (5): RDB and AOF persistence detailed introduction
Thorough Redis Series (6): A detailed introduction to master-slave replication
Thorough Redis Series (7): A detailed introduction to the sentinel mechanism
See Through Redis Series (8): Detailed introduction to clusters
Thorough Redis series (9): Redis proxy twemproxy and predixy detailed introduction
Thorough Redis series (10): Detailed introduction to Redis memory model
Thorough Redis Series (11): Detailed introduction to Jedis and Lettuce clients
Article Directory
In this blog, we mainly introduce how to use Redis to implement bloom filters, but before introducing bloom filters, we first introduce why you want to use bloom filters.
Bloom filter application scenarios
- Solve the problem of cache penetration
Under normal circumstances, first query whether there is the data in the cache, and then query the database when there is none in the cache. When the data does not exist in the database, the database must be accessed for each query, which is cache penetration. The problem with cache penetration is that when there are a large number of requests to query data that does not exist in the database, it will put pressure on the database and even bring down the database.
The bloom filter can be used to solve the problem of cache penetration and store the existing data in the key
bloom filter. When there is a new request, first check whether it exists in the Bloom filter, if the data does not exist, return directly; if the data exists, then query the cache query database.
- Blacklist verification
If found in the blacklist, perform specific operations. For example: to identify spam, as long as the mailbox is in the blacklist, it will be identified as spam. Assuming that the number of blacklists is in the hundreds of millions, it is very storage space consuming to store them. Bloom filters are a better solution. Put all the blacklists in the Bloom filter, and then when you receive an email, just judge whether the email address is in the Bloom filter.
**Scenario 1: **Originally there were 1 billion numbers, but now there are 100,000 numbers. To quickly and accurately determine whether these 100,000 numbers are in the 1 billion number database?
Solution 1: Store 1 billion numbers in the database and perform database query. The accuracy is good, but the speed will be slower.
Solution 2: Put 1 billion numbers in memory, such as Redis cache, here we calculate the memory size: 1 billion * 8 bytes = 8GB, through the memory query, the accuracy and speed are all, but about 8gb The memory space is a waste of memory space.
**Scenario 2: **Shopping website searches for products, the customer enters the product in the product search bar, first of all, it is necessary to determine whether the product exists in my database, and if it exists, the database query operation will be executed!
So for a large data collection like this, how to accurately and quickly determine whether a certain data is in a large data collection without occupying memory, the Bloom filter came into being.
Introduction to Bloom Filter
With the above questions, let's take a look at what exactly is a Bloom filter.
Bloom filter: A data structure consisting of a long string of binary vectors, which can be regarded as a binary array. Since it is binary, it stores either 0 or 1, but the initial default value is 0.
As follows:
1. Add data
When introducing the concept, we said that the Bloom filter can be regarded as a container, so how to add a data to the Bloom filter?
As shown in the figure below: When we want to add an element key to the Bloom filter, we calculate a value through multiple hash functions, and then set the square where this value is located to 1.
For example, hash1(key)=1 in the figure below, then change 0 to 1 in the second grid (the array is counted from 0), hash2(key)=7, then set the eighth grid to 1, and then analogy.
2. Determine whether the data exists?
Knowing how to add a piece of data to the Bloom filter, how do we judge whether a new piece of data exists in this Bloom filter?
Very simple, we only need to pass this new data through the custom hash functions above to calculate each value separately, and then see whether the corresponding place is all 1, if there is a situation that is not 1, then we can say , The new data must not exist in this Bloom filter.
On the other hand, if the value calculated by the hash function is 1 in the corresponding place, then we can be sure that this data must exist in this Bloom filter?
The answer is no, because the results of multiple different data calculated through the hash function will be repeated, so there will be a certain position where other data is set to 1 through the hash function.
We can get a conclusion: Bloom filter can determine that a certain data must not exist, but it cannot determine that it must exist .
3. Advantages and disadvantages of bloom filter
Advantages: The advantages are obvious. The binary array takes up very little memory, and the insertion and query speeds are fast enough.
Disadvantages: With the increase of data, the rate of misjudgment will increase; there is also the inability to determine that the data must exist; there is also an important disadvantage, the data cannot be deleted.
Redis implements Bloom filter
In the Redis is bitmap
achieved Bloom filter!
bitmap
We know that computers use binary bits as the basic unit of underlying storage, and one byte is equal to 8 bits.
For example, the "big" string is composed of three characters. The ASCII codes corresponding to these three characters are 98, 105, 103, and the corresponding binary storage is as follows:
In Redis, Bitmaps provides a set of commands to manipulate each bit in a string similar to the above.
Settings
setbit key offset value
127.0.0.1:6379> set k1 big
OK
127.0.0.1:6379> setbit k1 7 1
(integer) 0
127.0.0.1:6379> get k1
"cig"
127.0.0.1:6379>
We know that the binary representation of "b" is 0110 0010, we set the 7th bit (starting from 0) to 1, then 0110 0011 represents the character "c", so the last character "big" becomes "cig" .
Get the number of bitmaps whose specified range is 1
bitcount key [start end]
If you don't specify it, it will get all the numbers of 1.
Note: start and end specify the number of bytes , not the bit array subscript.
127.0.0.1:6379> set k1 big
OK
127.0.0.1:6379> bitcount k1
(integer) 12
127.0.0.1:6379> bitcount k1 0 0
(integer) 3
127.0.0.1:6379> bitcount k1 0 1
(integer) 7
127.0.0.1:6379>
Redis install Bloom filter module
1. Visit the github address and download the module source code
https://github.com/RedisBloom/RedisBloom
Use git clone directly or download the zip
git clone https://github.com/RedisBloom/RedisBloom.git
2. Execute make to compile the dynamic library
cd RedisBloom
make
After the execution is complete, a redisbloom.so dynamic library will be generated
3. Start redis to load the dynamic library
# 我习惯把该库放到redis的安装目录下,这步骤看自己喜好
sudo cp redisbloom.so /opt/redis/
# 先停掉redis进程
sudo kill -9 pid
# 加载动态库
redis-server --loadmodule /opt/redis/redisbloom.so
The following figure appears to show that the loading is complete:
Then you can use the redis-cli
client to connect and test
Redis uses Bloom filters
1. Commonly used commands
bf.add add element
bf.exists query whether the element exists
bf.madd add multiple elements at once
bf.mexists query whether multiple elements exist at once
127.0.0.1:6379> bf.add k1 1
(integer) 1
127.0.0.1:6379> bf.add k1 2
(integer) 1
127.0.0.1:6379> bf.exists k1 1
(integer) 1
127.0.0.1:6379> bf.exists k1 5
(integer) 0
127.0.0.1:6379>
2. Bloom filter accuracy rate
There are two values in redis that determine the accuracy of the Bloom filter:
error_rate: Allow the error rate of the Bloom filter. The lower the value, the larger the size of the bit array of the filter, and the larger the space occupied.
initial_size: The number of elements that the Bloom filter can store. When the number of elements actually stored exceeds this value, the accuracy of the filter will decrease.
There is a command in redis to set these two values:
bf.reserve test 0.01 100
The first value is the name of the filter.
The second value is the value of error_rate.
The third value is the value of initial_size.
Note that you must use the bf.reserve command to create it explicitly before add. If the corresponding key already exists, bf.reserve will report an error. At the same time, the lower the error rate is set, the more space is required. If bf.reserve is not used, the default error_rate is 0.01, and the default initial_size is 100.
3. Use in the project
3.1
Import package
<dependency>
<groupId>com.redislabs</groupId>
<artifactId>jrebloom</artifactId>
<version>1.0.2</version>
</dependency>
There are only three classes in the JAR package, and there is insufficient support for connection methods and data types
Code:
Client client = new Client(redisProperties.getHost(), redisProperties.getPort(), 10000, 100);
client.add("bobo", "123");
boolean bo = client.exists("bobo", "123");
System.out.println(bo);
3.2: BloomFilter in Guava
The BloomFilter class is provided in the guava package of Google, which directly uses the server memory
Import package
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>22.0</version>
</dependency>
Code:
private static int size = 1000000;
private static BloomFilter<String> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charset.defaultCharset()), size, 0.0001);
public void test2() {
String bo = "bobo";
bloomFilter.put(bo);
System.out.println(bloomFilter.mightContain(bo));
}