Detailed explanation of the use of Bloom filter in redis


1. Introduction to Bloom filter

1. What is a Bloom filter

布隆过滤器(英语:Bloom Filter)It was proposed by Bloom in 1970. It's actually a long binary vector and a series of random mapping functions. mainly 用于判断一个元素是否在一个集合中.

Usually we will encounter many business scenarios where we need to judge whether an element is in a certain collection. The general idea is to save all the elements in the collection and then determine through comparison. Linked list, tree, hash table (also called hash table, Hash table) and other data structures are all in this way. But 随着集合中元素的增加,我们需要的存储空间也会呈现线性增长,最终达到瓶颈。同时检索速度也越来越慢,上述三种结构的检索时间复杂度分别为O(n),O(logN),O(1).

At this time, the Bloom Filter (Bloom Filter) came into being.

2. Implementation principle of Bloom filter

If you want to judge whether an element is in a collection, the general method that comes to mind is to temporarily store the data, and then search to determine whether it exists in the collection. This method is applicable when the amount of data is relatively small, but when there are many elements in a few, the retrieval speed will become slower and slower.

Bitmap can be used: just check whether the corresponding point is 1 to know whether there is such a number in the set. Bloom filter can be regarded as an extension of bitmap, but uses multiple hash mapping functions to reduce the probability of hash collisions.

The algorithm is as follows:
insert image description here

BloomFilter is composed of a fixed-size binary vector or bitmap (bitmap) and a series of mapping functions.

  1. In the initial state, for a bit array of length m, all its bits are set to 0;
  2. When a variable is added to the set, the variable is mapped to K points in the bitmap through K mapping functions, and they are set to 1;
  3. When querying whether a variable exists, we only need to check whether these points are all 1 to know whether it exists in the set with a high probability.
  • If any of these points has 0, the queried variable must not be there;
  • If both are 1, the variable is queried 很可能存在.
    Why do we say that it may exist, but not necessarily exist? That's because the mapping function itself is a hash function, and the hash function will have collisions.

3. Misjudgment rate

The misjudgment rate here refers to BloomFilter 判断某个 key 存在,但它实际不存在的概率.

The misjudgment of the Bloom filter is due to the fact that multiple inputs are hashed at the same bit position, so it is impossible to determine which input is generated. Therefore, the root cause of the misjudgment is that the same bit is mapped multiple times and set to 1.

The hash function itself will have collisions. Although the Bloom filter will use multiple hash calculations to reduce the probability of collisions, it cannot be completely avoided. This will cause the bits of an object that have undergone multiple hash calculations to overlap with the bits of other objects. , if the bits of a new object are all set to 1 when they are stored in other objects, then a misjudgment will occur.

This situation also caused the removal problem of Bloom filter, because 布隆过滤器的每一个 bit 并不是独占的,很有可能多个元素共享了某一位. If we delete this bit directly, it will affect other elements.

Bloom filter misjudgment diagram:
1. Initial state: the bit array in the Bloom filter is 0;
2. Add object X and object Y to the Bloom filter, and after multiple hash calculations, the bit array 1, 2, 4, 5, and 7 are filled with 1;
3. When the Bloom filter is used to determine whether the object Z exists, multiple hash calculations are performed on the object Z first, and the corresponding bits are 4, 5, and 7. However, 4, 5, and 7 have just been filled with 1 by object X and object Y. At this time, it will be misjudged that object Z already exists.
4. In extreme cases, when all the bits in the bit array are set to 1, the Bloom filter will fail.
insert image description here

characteristic:

  • An element does not necessarily exist if the judgment result is existence, but it must not exist when the judgment result is non-existence.
  • Bloom filters can add elements, but cannot delete elements, because deleting elements will increase the false positive rate.

How to reduce the misjudgment rate?

  1. Increase the number of hash functions to reduce the possibility of bit collisions;
  2. Increase the size of the Bitmap to avoid a large number of bits covered and filled.

4. Bloom filter usage scenarios

Typical applications of Bloom filters are:

  • 数据库防止穿库, Google Bigtable, HBase and Cassandra, and Postgresql use BloomFilter to reduce disk lookups for rows or columns that don't exist. Avoiding costly disk seeks can greatly improve the performance of database query operations.
  • In a business scenario, judging whether a user has read a certain video or article, such as Douyin or Toutiao, will of course lead to certain misjudgments, but will not allow users to see duplicate content.
  • 缓存宕机、缓存击穿场景, generally judge whether the user is in the cache, if it is, the result will be returned directly, if not, the db will be queried, if a wave of cold data comes, it will cause a large number of cache breakdowns, causing an avalanche effect, at this time you can use the Bloom filter as a cache The index, only in the Bloom filter, is used to query the cache, and if it is not found, it is penetrated to the db. If not in the bloomer, return directly.
  • WEB拦截器, if the request is the same, it will be intercepted to prevent repeated attacks. When the user requests for the first time, put the request parameters into the Bloom filter, and when the user makes the second request, first judge whether the request parameters are hit by the Bloom filter. Can improve the cache hit rate.
  • The Squid web proxy cache server uses Bloom filters in cache digests. Google Chrome uses Bloom filters to speed up Safe Browsing

In general, 布隆过滤器是用于大数据场景下的重复判断,并且允许有一定误差存在, the most typical use is to solve the problem of cache penetration.

5. Comparison between hash table and Bloom filter

Hash tables can also be used to determine whether an element is in a set, but Bloom Filter only needs 1/8 or 1/4 of the space complexity of a hash table to complete the same problem.
The hash table stores the real elements in the collection, while the Bloom Filter only fills the binary array according to the multiple hash calculation results of the elements, and does not store the real objects.

Bloom Filter can insert elements, but cannot delete existing elements. The more elements in the set, the greater the false positive rate, but there will be no false negatives.

2. Actual combat of Bloom filter in redis

"On paper, it's always shallow, but I know that this matter has to be done." Next, let's see how to avoid the cache penetration problem of order information query through the Bloom filter.
insert image description here

1. Introduce redisson dependency

  <dependency>
      <groupId>org.redisson</groupId>
      <artifactId>redisson-spring-boot-starter</artifactId>
      <version>3.16.7</version>
  </dependency>

2. Create an order form

CREATE TABLE `tb_order` (
  `id` bigint NOT NULL AUTO_INCREMENT COMMENT '订单Id',
  `order_desc` varchar(50) NOT NULL COMMENT '订单描述',
  `user_id` bigint NOT NULL COMMENT '用户Id',
  `product_id` bigint NOT NULL COMMENT '商品Id',
  `product_num` int NOT NULL COMMENT '商品数量',
  `total_account` decimal(10,2) NOT NULL COMMENT '订单金额',
  `create_time` datetime NOT NULL COMMENT '创建时间',
  PRIMARY KEY (`id`),
  KEY `ik_user_id` (`user_id`)
) ENGINE=InnoDB AUTO_INCREMENT=51 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

3. Configure redis

Added redis connection properties:

spring.redis.host=192.168.206.129
spring.redis.port=6379
spring.redis.password=123456

Configure redisTemplate, mainly to set the serialization strategy Jackson

@Configuration
public class RedisConfig {
    
    

    @Bean
    public RedisTemplate<String, Object> redisTemplate(RedissonConnectionFactory redisConnectionFactory) {
    
    
        //设置序列化
        Jackson2JsonRedisSerializer jackson2JsonRedisSerializer = new Jackson2JsonRedisSerializer(Object.class);
        ObjectMapper om = new ObjectMapper();
        om.setVisibility(PropertyAccessor.ALL, JsonAutoDetect.Visibility.ANY);
        //反序列化,该设置不能省略,不然从redis获取json转为实体时会报错
        om.activateDefaultTyping(LaissezFaireSubTypeValidator.instance,
                ObjectMapper.DefaultTyping.NON_FINAL,
                JsonTypeInfo.As.WRAPPER_ARRAY);
        jackson2JsonRedisSerializer.setObjectMapper(om);
        //配置redisTemplate
        RedisTemplate<String, Object> redisTemplate = new RedisTemplate<String, Object>();
        redisTemplate.setConnectionFactory(redisConnectionFactory);
        RedisSerializer stringSerializer = new StringRedisSerializer();
        //key序列化
        redisTemplate.setKeySerializer(stringSerializer);
        //value序列化
        redisTemplate.setValueSerializer(jackson2JsonRedisSerializer);
        return redisTemplate;
    }

}

4. Place BloomFilter

/**
 * 配置布隆过滤器
 */
@Configuration
public class BloomFilterConfig {
    
    
    @Autowired
    private RedissonClient redissonClient;
    /**
     * 创建订单号布隆过滤器
     * @return
     */
    @Bean
    public RBloomFilter<Long> orderBloomFilter() {
    
    
        //过滤器名称
        String filterName = "orderBloomFilter";
        // 预期插入数量
        long expectedInsertions = 10000L;
        // 错误比率
        double falseProbability = 0.01;
        RBloomFilter<Long> bloomFilter = redissonClient.getBloomFilter(filterName);
        bloomFilter.tryInit(expectedInsertions, falseProbability);
        return bloomFilter;
    }
}

5. Create order

BloomFilter in redisson has 2 core methods:

  • bloomFilter.add(orderId) Add id to bloom filter
  • bloomFilter.contains(orderId) Determine whether the id exists
@Slf4j
@Service
public class OrderServiceImpl implements OrderService {
    
    

    @Resource
    private RBloomFilter<Long> orderBloomFilter;

    @Resource
    private TbOrderMapper  tbOrderMapper;

    @Resource
    private RedisTemplate<String,Object> redisTemplate;


    @Override
    public void createOrder(TbOrder tbOrder) {
    
    
        //1、创建订单
        tbOrderMapper.insert(tbOrder);

        //2、订单id保存到布隆过滤器
        log.info("布隆过滤器中添加订单号:{}",tbOrder.getId());
        orderBloomFilter.add(tbOrder.getId());
    }

    @Override
    public TbOrder get(Long orderId) {
    
    
        TbOrder tbOrder = null;
        //1、根据布隆过滤器判断订单号是否存在
        if(orderBloomFilter.contains(orderId)){
    
    
            log.info("布隆过滤器判断订单号{}存在",orderId);
            String key = "order:"+orderId;
            //2、先查询缓存
            Object object = redisTemplate.opsForValue().get(key);
            if(object != null){
    
    
                log.info("命中缓存");
                tbOrder =  (TbOrder)object;
            }else{
    
    
                //3、缓存不存在则查询数据库
                log.info("未命中缓存,查询数据库");
                tbOrder = tbOrderMapper.selectById(orderId);
                redisTemplate.opsForValue().set(key,tbOrder);
            }
        }else{
    
    
            log.info("判定订单号{}不存在,不进行查询",orderId);
        }
        return tbOrder;
    }
}

6. Unit testing

    @Test
    public void testCreateOrder() {
    
    
        for (int i = 0; i < 50; i++) {
    
    
            TbOrder tbOrder = new TbOrder();
            tbOrder.setOrderDesc("测试订单"+(i+1));
            tbOrder.setUserId(1958L);
            tbOrder.setProductId(102589L);
            tbOrder.setProductNum(5);
            tbOrder.setTotalAccount(new BigDecimal("300"));
            tbOrder.setCreateTime(new Date());
            orderService.createOrder(tbOrder);
        }
    }

    @Test
    public void testGetOrder() {
    
    
        TbOrder  tbOrder = orderService.get(25L);
        log.info("查询结果:{}", tbOrder.toString());
    }

Summarize

The principle of the Bloom filter is actually very simple, that is, bitmap + multiple hashes. The main advantage is that it can quickly determine whether an object exists under large-scale data using a very small space. The disadvantage is that there is a possibility of misjudgment, but not It will miss the judgment, that is, the existing object will definitely be judged to exist, and the non-existent object will have a lower probability of being misjudged as existing, and the deletion of the object is not supported, because the probability of misjudgment will increase. The most typical use is to solve 缓存穿透problems.

Guess you like

Origin blog.csdn.net/w1014074794/article/details/129750865