布隆过滤器及入门使用

一、什么是布隆过滤器？

布隆过滤器(英语:Bloom Filter)是1970年由布隆提出的。它实际上是一个很长的二进制向量和一系列随机映射函数。布隆过滤器可以用于检索一个元素是否在一个集合中。它的优点是空间效率和查询时间都远远超过一般的算法，缺点是有一定的误识别率和删除困难。

二、布隆过滤器的使用场景

网页爬虫对URL的去重，避免爬取相同的URL地址；
反垃圾邮件，从数十亿个垃圾邮件列表中判断某邮箱是否垃圾邮箱(同理，垃圾短信)；
缓存击穿，将已存在的缓存放到布隆中，当黑客访问不存在的缓存时迅速返回避免缓存及DB挂掉。
抽奖程序中排除非会员参与

三、布隆过滤器的设计原理

布隆过滤器而言，它的本质是一个位数组：位数组就是数组的每个元素都只占用1bit ，并且每个元素只能是0或者1布隆过滤器除了一个位数组，还有Ｋ个哈希函数。当一个元素加入布隆过滤器中的时候，会进行如下操作：

使用K个哈希函数对元素值进行K次计算，得到K个哈希值
根据得到的哈希值，在位数组中把对应下标的值置为1

　　比如一个集合中有x、y、z三个元素，分别用三个hash函数映射到二进制序列的某些位上，假设我们判断w是否在集合中，同样用三个hash函数来映射，结果发现取得的结果不全为1，则表示w不在集合里面。
　　数组的容量即使再大，也是有限的。那么随着元素的增加，插入的元素就会越多，位数组中被置为1的位置因此也越多，这就会造成一种情况：当一个不在布隆过滤器中的元素，经过同样规则的哈希计算之后，得到的值在位数组中查询，有可能这些位置因为之前其它元素的操作先被置为1了。所以，有可能一个不存在布隆过滤器中的会被误判成在布隆过滤器中。这就是布隆过滤器的一个缺陷。但是，如果布隆过滤器判断某个元素不在布隆过滤器中，那么这个值就一定不在布隆过滤器中。总结来说，就是布隆过滤器说某个元素在，可能会被误判。布隆过滤器说某个元素不在，那么一定不在。

Question：为什么布隆过滤器要使用多个hash函数？
answer：Hash面临的问题就是冲突。假设 Hash 函数是良好的，如果我们的位阵列长度为m个点，那么如果我们想将冲突率降低到例如 1%, 这个散列表就只能容纳 m/100 个元素。显然这就不叫空间有效了。解决方法也简单，就是使用多个 Hash，如果它们有一个说元素不在集合中，那肯定就不在。如果它们都说在，虽然也有一定可能性它们在说谎，不过直觉上判断这种事情的概率是比较低的。

四、Google布隆过滤器

在Google Guava library中Google为我们提供了一个布隆过滤器的实现BloomFilter。下面笔者来通过一个例子说明下Google布隆过滤器的入门使用。

（1）引入依赖

<dependency>
  <groupId>com.google.guava</groupId>
  <artifactId>guava</artifactId>
  <version>21.0</version>
</dependency>

（2）准备测试脚本

DROP TABLE IF EXISTS `sys_user`;
CREATE TABLE `sys_user` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `user_name` varchar(11) CHARACTER SET utf8mb4 DEFAULT NULL COMMENT '用户名',
  `image` varchar(11) CHARACTER SET utf8mb4 DEFAULT NULL COMMENT '用户头像',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

INSERT INTO `sys_user` VALUES (1, 'calvin', 'xxx');
INSERT INTO `sys_user` VALUES (2, 'bobb', 'yyy');
INSERT INTO `sys_user` VALUES (3, 'liming', 'zzz');

（3）BloomFilterService

package com.calvin.service;

import com.calvin.entity.SysUser;
import com.calvin.mapper.SysUserMapper;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;
import java.util.List;
import javax.annotation.PostConstruct;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;
import org.springframework.util.CollectionUtils;

/**
 * @Title BloomFilterService
 * @Description google布隆过滤器
 * @author calvin
 * @date: 2019/12/19 11:37 PM 
 */
@Service
public class BloomFilterService {

    @Autowired
    private SysUserMapper sysUserMapper;

    private BloomFilter<Integer> bf;

    /**
     * 程序启动时候加载此方法
     */
    @PostConstruct
    public void initBloomFilter() {
        List<SysUser> sysUsers = sysUserMapper.selectAll();
        if (CollectionUtils.isEmpty(sysUsers)) {
            return;
        }
        // 创建布隆过滤器（默认误差3%）
        bf = BloomFilter.create(Funnels.integerFunnel(), sysUsers.size());
        // 将数据库中所有用户id压入布隆过滤器，存于JVM内存
        sysUsers.stream().forEach(user -> bf.put(user.getId()));
    }

    /**
     * 判断用户id是否存在于布隆过滤器
     * @param userId
     * @return
     */
    public boolean userIdExist(Integer userId) {
        return bf.mightContain(userId);
    }
}

（4）BloomFilterController

package com.calvin.controller;

import com.calvin.service.BloomFilterService;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import javax.annotation.Resource;

/**
 * @Title GoogleBloomFilterController
 * @Description
 * @author calvin
 * @date: 2019/12/20 11:25 AM
 */
@RestController
public class GoogleBloomFilterController {

    @Resource
    private BloomFilterService bloomFilterService;

    /**
     * 使用google布隆过滤器判断元素是否存在
     * @param id
     * @return
     */
    @RequestMapping("/bloom/idExists")
    public boolean ifExists(int id) {
        return bloomFilterService.userIdExist(id);
    }
}

（5）测试

浏览器中输入：http://localhost:8082/google/bloom/idExists?id=1

同理id=2，id=3结果返回都是true，说明id=1，2，3都是存在于布隆过滤器中，返回true。

当输入：http://localhost:8082/google/bloom/idExists?id=4

id=4不在布隆过滤器中，返回true。

根据与数据库存在的用户id做对比，发现测试结果与预期相同，这就是Google布隆过滤器的使用。

从上面的例子可以看到Google布隆过滤器有如下的缺点：

基于JVM内存的一种布隆过滤器
重启即失效
本地内存无法用在分布式场景
不支持大数据量存储

因此接下来笔者介绍Redis布隆过滤器。

五、Redis布隆过滤器

相对于Google布隆过滤器，Redis布隆过滤器有以下的优点：

可扩展性Bloom过滤器：一旦Bloom过滤器达到容量，就会在其上创建一个新的过滤器
不存在重启即失效或者定时任务维护的成本：基于Google实现的布隆过滤器需要启动之后初始化布隆过滤器

缺点：

需要网络IO，性能比Google布隆过滤器低

Redis如果需要使用布隆过滤器，先要安装rebloom。

## 从github下载rebloom
git clone git://github.com/RedisLabsModules/rebloom

## 编译模块
cd rebloom
make

编译模块之后，可以看到文件夹中生成了一个redisbloom.so文件。

接着，在redis的配置文件中加入该模块。

loadmodule /Users/calvin/Documents/Work/install/rebloom/redisbloom.so

redis指定配置文件启动

./redis-server ../redis.conf

可以看到rebloom这个模块已经随着redis的启动而加载

Redis布隆过滤器命令：

bf.add：添加元素到布隆过滤器中，只能添加一个元素，如果想要添加多个使用bf.madd命令
bf.exists：判断某个元素是否在过滤器中，只能判断一个元素，如果想要判断多个使用bf.mexists命令

127.0.0.1:6379> bf.add calvinBloom 111
(integer) 1
127.0.0.1:6379> bf.add calvinBloom 222
(integer) 1
127.0.0.1:6379> bf.add calvinBloom 333
(integer) 1
127.0.0.1:6379> bf.exists calvinBloom 111
(integer) 1
127.0.0.1:6379> bf.exists calvinBloom 222
(integer) 1
127.0.0.1:6379> bf.exists calvinBloom 333
(integer) 1
127.0.0.1:6379> bf.exists calvinBloom 444
(integer) 0

可以看到，定义的布隆过滤器名称为：calvinBloom，加入进去的元素为：111，222，333

而元素444不在过滤器中，因此返回0。

六、基于lua脚本实现SpringBoot和Redis布隆过滤器的整合

（1）编写Lua脚本

bloomFilterAdd.lua

local bloomName = KEYS[1]
local value = KEYS[2]

-- 添加bloom过滤器及元素
local result_1 = redis.call('BF.ADD', bloomName, value)
return result_1

bloomFilterExist.lua

local bloomName = KEYS[1]
local value = KEYS[2]

-- 判断指定的布隆过滤器是否存在指定元素
local result_1 = redis.call('BF.EXISTS', bloomName, value)
return result_1

（2）使用redisTemplate加载Lua脚本

		/**
     * 添加bloom过滤器及元素
     * @param filterName 过滤器名称
     * @param value 要添加的元素
     * @return
     */
    public Boolean bloomFilterAdd(String filterName, int value) {
        DefaultRedisScript<Boolean> bloomAdd = new DefaultRedisScript<>();
        bloomAdd.setScriptSource(new ResourceScriptSource(new ClassPathResource("bloomFilterAdd.lua")));
        bloomAdd.setResultType(Boolean.class);
        List<Object> keyList = new ArrayList<>();
        keyList.add(filterName);
        keyList.add(value + "");
        Boolean result = (Boolean) redisTemplate.execute(bloomAdd, keyList);
        return result;
    }

    /**
     * 判断指定的布隆过滤器是否存在指定元素
     * @param filterName 过滤器名称
     * @param value 需要判断是否存在的元素
     * @return
     */
    public Boolean bloomFilterExists(String filterName, int value) {
        DefaultRedisScript<Boolean> bloomExists = new DefaultRedisScript<>();
        bloomExists.setScriptSource(new ResourceScriptSource(new ClassPathResource("bloomFilterExist.lua")));
        bloomExists.setResultType(Boolean.class);
        List<Object> keyList = new ArrayList<>();
        keyList.add(filterName);
        keyList.add(value + "");
        Boolean result = (Boolean) redisTemplate.execute(bloomExists, keyList);
        return result;
    }

（3）RedisBloomFilterController

package com.calvin.controller;

import com.calvin.service.RedisService;
import javax.annotation.Resource;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

/**
 * @Title RedisBloomFilterController
 * @Description
 * @author calvin
 * @date: 2019/12/20 11:40 AM
 */
@RestController
@RequestMapping("redis")
public class RedisBloomFilterController {

    @Resource
    private RedisService redisService;

    private static final String BLOOM_FILTER_NAME = "redisBloom";

    /**
     * 添加bloom过滤器及元素
     * @param id
     * @return
     */
    @RequestMapping("/bloom/add")
    public boolean redisidAdd(int id) {
        return redisService.bloomFilterAdd(BLOOM_FILTER_NAME, id);
    }

    /**
     * 判断指定的布隆过滤器是否存在指定元素
     * @param id
     * @return
     */
    @RequestMapping("/bloom/idExists")
    public boolean redisidExists(int id) {
        return redisService.bloomFilterExists(BLOOM_FILTER_NAME, id);
    }
}

（4）测试

浏览器中输入：http://localhost:8082/redis/bloom/add?id=1

同样的，将id=2，id=3也存入redis布隆过滤器中。

接着我们来测试元素是否存在过滤器中。

浏览器输入：http://localhost:8082/google/bloom/idExists?id=1

当id=2，id=3返回的都是true，而id=4返回的是false。说明布隆过滤器判断是正确的。

当然，我们也可以在redis的客户端控制台来验证我们的结果。

127.0.0.1:6379> bf.exists redisBloom 1
(integer) 1
127.0.0.1:6379> bf.exists redisBloom 2
(integer) 1
127.0.0.1:6379> bf.exists redisBloom 3
(integer) 1
127.0.0.1:6379> bf.exists redisBloom 4
(integer) 0

电商技术进阶

发布了10 篇原创文章 · 获赞 10 · 访问量 1万+

私信关注