[Java project] Bloom filter solves the problem of cache penetration and the problem of difficult deletion of Bloom filter

What is the cache penetration problem

When accessing the browser, the user can return the results normally when accessing the data that exists in Redis or the database, but if you query the data that does not exist in Redis or the database, then if there are many meaningless queries , will directly pass through Redis and lead to the database, which will increase the pressure on the database and cause downtime.

Therefore, cache penetration refers to the user accessing data that does not exist in the database and Redis. For example, we know that the id adopts the self-increment strategy, so it is impossible to have negative ids, and if criminals use negative ids to query, then these Requests will be sent directly to the database through Redis, which will cause a sudden increase in the pressure on the database and cause the database to go down. (usually malicious behavior)

So how to solve the cache penetration problem?

  • Cache empty objects, cache these ids in Redis every time you send such ids that cannot be queried, and set the value to null (null), then if the same id is queried next time, it will directly return empty object.
    The advantage is that it is simple to implement and easy to maintain, and the disadvantage is that memory is wasted.
  • Directly block the IP of the malicious request, and the other party may change the IP constantly
  • Verify the legality of parameters
  • Bloom filter, the advantage is that it takes up less memory and there will be no redundant keys. The disadvantage is that it is not easy to implement and there is a possibility of misjudgment

What is a bloom filter

The Bloom filter (English: Bloom Filter) was proposed by Bloom in 1970. It's actually a long binary vector and a series of random mapping functions.

Bloom filters can be used to retrieve whether an element is in a set. Its advantage is that the space efficiency and query time far exceed the general algorithm. The disadvantage is that there is a certain rate of misrecognition and difficulty in deletion. The reason is that for different data, the same hash function may be used to obtain the same value. , causing a bit to be set to one. Misjudgment occurs.
That is to say, the Bloom filter is not accurate, it is purely in error.

The principle of the Bloom filter: When an element is added to the set, the element is mapped to K points in a bit array through K hash functions, and they are set to 1. When retrieving, we only need to see if these points are all 1 to know (approximately) whether it is in the collection: if any of these points has 0, the checked element must not be there; if they are all 1, the checked element Probably in.

To put it simply, prepare a bit array with a length of length and initialize all elements to 0, use k hash functions to perform k hash operations on the elements and take the remainder of length to obtain k positions, and set the corresponding positions in the bit array to 1.

When the Bloom filter saves more elements, more and more bits are set to 1. Even if the element x has not been stored, it is assumed that a certain bit mapped to the bit array by the hash function is set to 1 by other values. Yes, for the mechanism of the Bloom filter, the value of the element x also exists, that is to say, the Bloom filter has a certain misjudgment rate.

It can be known from the above that the space complexity of the Bloom filter is O(length), and the event complexity of query and insertion is O(k), where k is the number of hash functions.
Another feature is that the Bloom filter does not support data deletion, because deleting a bit of data may affect other data that is also mapped to this bit. Therefore, as the Bloom filter is used, the error rate will become larger and larger, and a method needs to be considered to solve this problem.

Implementation of Bloom filter

According to the above understanding, the implementation of the Bloom filter relies on binary vectors and multiple Hash functions. Therefore, for the accuracy of the Bloom filter, its influencing factors include the randomness of the Hash function and the size of the binary vector.
Currently, the commonly used Bloom filter implementations are Guava and Redisson.
For using Guava, the method is as follows

1、添加Maven依赖

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>31.0.1-jre<</version>
</dependency>


2、创建布隆过滤器
BloomFilter<Integer> filter = BloomFilter.create(
  //Funnel 是一个接口,用于将任意类型的对象转换为字节流,
  //以便用于布隆过滤器的哈希计算。
  Funnels.integerFunnel(), 
  10000,  // 插入数据条目数量
  0.001  // 误判率
);

For using Redission

1、添加Maven依赖

<dependency>
   <groupId>org.redisson</groupId>
   <artifactId>redisson</artifactId>
   <version>3.16.1</version>
</dependency>
2、配置 Redisson 客户端

@Configuration
public class RedissonConfig {
    
    

 Bean
 public RedissonClient redissonClient() {
    
    
    Config config = new Config();
    config.useSingleServer().setAddress("redis://localhost:6379");
    return Redisson.create(config);
 }
 
}
3、初始化

RBloomFilter<Long> bloomFilter = redissonClient.
                                      getBloomFilter("myBloomFilter");
//10000表示插入元素的个数,0.001表示误判率
bloomFilter.tryInit(10000, 0.001);
//插入4个元素
bloomFilter.add(1L);
bloomFilter.add(2L);
bloomFilter.add(3L);
bloomFilter.add(4L);
4、判断数据是否存在

public boolean mightcontain(Long id) {
    
    
    return bloomFilter.contains(id);
}

In fact, Java natively also provides the implementation of Bloom filter. Java provides a set BitSet for bit operations. We can use BitSet and implement the Hash function by ourselves to implement a Bloom filter.

Solve the problem of not supporting deletion

As mentioned above, with the use of the Bloom filter, the error of the query will become larger and larger, so is there any way to solve this problem?

Let me talk about a simple one first, which is to directly replace this Bloom filter.
We can write a timed task, create a new Bloom filter after a certain event, then query the full amount of data in the database and use mapping to map it to the new Bloom filter, and then update the original There are pointers to Bloom filters. This method is simple to implement.

Another way is to use counting. We know that the bottom layer of the Bloom filter uses a binary vector, that is, because this vector has only two values ​​of 01, then if multiple elements are mapped to this position, we will set it to If it is 0, it will lead to greater misjudgment. Therefore, if we consider replacing the binary vector with the Byte type, which supports counting, the value of this bit is not only 01, then we use it when mapping to this position ++ to increment, if we want to delete, we will – , so if this bit is 0, we can also judge whether the data exists.
A big disadvantage is that the waste of space is increased, and the efficiency is also reduced. At the same time, we have to consider the issue of concurrent modification.

Guess you like

Origin blog.csdn.net/Zhangsama1/article/details/131590267