SpringBoot (18) --- Lua script by bulk insert data into a Bloom filter Redis

Lua script by bulk insert data into a Bloom filter

Before Bloom filter principle regarding wrote a blog: Algorithms (3) --- Bloom filter principle

In the actual development process, often to do one of the steps is to determine whether there is a current key.

That this blog is divided into three parts:

1、几种方式判断当前key是否存在的性能进行比较。
2、Redis实现布隆过滤器并批量插入数据,并判断当前key值是否存在。
3、针对以上做一个总结。

First, the performance comparison

The main performance tests to compare the following methods:

1, List contains the method

2, Map of containsKey method

3, Google Bloom filter method mightContain

前提准备

When SpringBoot project start to List collection , the Map collection , Google Bloom filter distributed storage 500万条length of 32位的Stringstring.

1, demo code

@Slf4j
@RestController
public class PerformanceController {

    /**
     * 存储500万条数据
     */
    public static final int SIZE = 5000000;
    /**
     * list集合存储数据
     */
    public static List<String> list = Lists.newArrayListWithCapacity(SIZE);
    /**
     * map集合存储数据
     */
    public static Map<String, Integer> map = Maps.newHashMapWithExpectedSize(SIZE);
    /**
     * guava 布隆过滤器
     */
    BloomFilter<String> bloomFilter = BloomFilter.create(Funnels.unencodedCharsFunnel(), SIZE);
    /**
     * 用来校验的集合
     */
    public static List<String> exist = Lists.newArrayList();
    /**
     * 计时工具类
     */
    public static Stopwatch stopwatch = Stopwatch.createUnstarted();

    /**
     * 初始化数据
     */
    @PostConstruct
    public void insertData() {
        for (int i = 0; i < SIZE; i++) {
            String data = UUID.randomUUID().toString();
            data = data.replace("-", "");
            //1、存入list
            list.add(data);
            //2、存入map
           map.put(data, 0);
            //3、存入本地布隆过滤器
            bloomFilter.put(data);
            //校验数据 相当于从这500万条数据,存储5条到这个集合中
            if (i % 1000000 == 0) {
                exist.add(data);
            }
        }
    }
    /**
     * 1、list 查看value是否存在 执行时间
     */
    @RequestMapping("/list")
    public void existsList() {
        //计时开始
        stopwatch.start();
        for (String s : exist) {
            if (list.contains(s)) {
                log.info("list集合存在该数据=============数据{}", s);
            }
        }
        //计时结束
        stopwatch.stop();
        log.info("list集合测试,判断该元素集合中是否存在用时:{}", stopwatch.elapsed(MILLISECONDS));
        stopwatch.reset();
    }
    /**
     * 2、查看map 判断k值是否存在 执行时间
     */
    @RequestMapping("/map")
    public void existsMap() {
        //计时开始
        stopwatch.start();
        for (String s : exist) {
            if (map.containsKey(s)) {
                log.info("map集合存在该数据=============数据{}", s);
            }
        }
        //计时结束
        stopwatch.stop();
        //获取时间差

        log.info("map集合测试,判断该元素集合中是否存在用时:{}", stopwatch.elapsed(MILLISECONDS));
        stopwatch.reset();
    }

    /**
     * 3、查看guava布隆过滤器 判断value值是否存在 执行时间
     */
    @RequestMapping("/bloom")
    public void existsBloom() {
        //计时开始
        stopwatch.start();
        for (String s : exist) {
        if (bloomFilter.mightContain(s)) {
            log.info("guava布隆过滤器存在该数据=============数据{}", s);
        }
        }
        //计时结束
        stopwatch.stop();
        //获取时间差
        log.info("bloom集合测试,判断该元素集合中是否存在用时:{}", stopwatch.elapsed(MILLISECONDS));
        stopwatch.reset();
    }
}

2, the test output

测试结果

In fact, here it is performed five times for each method of checking whether there is, if a single count if so, then 500万条数据,且每条数据长度为32位的String类型情况下, can probably draw.

1、List的contains方法执行所需时间,大概80毫秒左右。
2、Map的containsKey方法执行所需时间,不超过1毫秒。
3、Google布隆过滤器 mightContain 方法,不超过1毫秒。

总结

Map Needless to say than the reason for the high efficiency List here, I did not think that they are so fast speed. I also test a 100万条数据通过list遍历key时间竟然也不超过1毫秒. This shows that in the actual development process, if the data

The amount is not the case, where in fact, almost with.

3, take up memory analysis

From a performance point of view of the above, Google Bloom filter is actually no advantage at all, if the amount of data is small indeed, entirely through the above can be solved, no need to consider the Bloom filter, but if the huge amount of data, never even billions level

Do that, certainly not with the collection, not to say that efficiency can not accept, but total memory is unacceptable.

Let's count the next 32-byte key value of the 500 million roads data, stored in a List collection accounted for how much memory needs.

500万 * 32 = 16000000字节 ≈ 152MB

A collection accounted for such a large memory, this is clearly unacceptable.

Then we calculate the Bloom filter needed to account for memory

  • Provided bit array size is m, the number of samples n, error rate p.

  • The problem can be seen n = 500 Wan, the p-= 3% ( Google Bloom filter default is 3%, we can also modify )

    Calculated by the formula:

m ≈ 16.7MB

It is not able to receive more.

So Google Bloom filter is also a great drawback

1、每次项目启动都要重新将数据存入Google布隆过滤器,消费额外的资源。
2、分布式集群部署架构中,需要在每个集群节点都要存储一份相同数据到布隆过滤器中。
3、随着数据量的加大,布隆过滤器也会占比较大的JVM内存,显然也不够合理。

So there is a better solution is to use Bloom filter redis as distributed clusters.


Two, Redis Bloom filter

1, Redis server set up

If you do not use Docker , then you need to deploy redis on the server, and then install a separate support redis Bloom filter plug-ins rebloom.

If you've used so docker deployment is very simple, just the following command:

  docker pull redislabs/rebloom # 拉取镜像
  docker run -p 6379:6379 redislabs/rebloom # 运行容器

This installation was successful.

2, Lua script bulk insert

SpringBoot not paste the full code I came out here, the last article I will address the project's github attached, here we talk about the meaning of the script under:

bloomFilter-inster.lua

local values = KEYS
local bloomName = ARGV[1]
local result_1
for k,v in ipairs(values) do
 result_1 = redis.call('BF.ADD',bloomName,v)
end
return result_1

1) Parameter Description

Here KEYSand ARGV[1]all we need to pass in the java code, redisTemplate there is a solution

execute(RedisScript<T> script, List<K> keys, Object... args)
  • script entity encapsulated bulk inserts lua script.
  • keys for the script KEYS .
  • ARGV [1] for the first variable parameter, if a plurality of variable input parameters, can be ARGV [2] ..... to acquire.

2) traversal

Lua script There are two ways to traverse one is ipairs, the other is pairsthat they are still differences. Here do not start, following a blog entry can refer to.

注意Lua java traversal of traversal and also a little different, we are starting from 0 in java, and for Lua script k is from 1 to start.

3) Insert command

BF.ADDCommand is inserted into the Bloom filter data, the insertion returned successfully to true .

3、判断布隆过滤器元素是否存在Lua脚本

bloomFilter-exist.lua

local bloomName = KEYS[1]
local value = KEYS[2]
-- bloomFilter
local result_1 = redis.call('BF.EXISTS', bloomName, value)
return result_1

从这里我们可以很明显看到, KEYS[1]对于的是keys集合的get(0)位置,所以说Lua遍历是从1开始的。

BF.EXISTS 是判断布隆过滤器中是否存在该数据命令,存在返回true

4、测试

我们来测下是否成功。

@Slf4j
@RestController
public class RedisBloomFilterController {

    @Autowired
    private RedisService redisService;
    public static final String FILTER_NAME = "isMember";
   
    /**
     * 保存 数据到redis布隆过滤器
     */
    @RequestMapping("/save-redis-bloom")
    public Object saveReidsBloom() {
        //数据插入布隆过滤器
        List<String> exist = Lists.newArrayList("11111", "22222");
        Object object = redisService.addsLuaBloomFilter(FILTER_NAME, exist);
        log.info("保存是否成功====object:{}",object);
        return object;
    }
    /**
     * 查询 当前数据redis布隆过滤器是否存在
     */
    @RequestMapping("/exists-redis-bloom")
    public void existsReidsBloom() {
        //不存在输出
        if (!redisService.existsLuabloomFilter(FILTER_NAME, "00000")) {
            log.info("redis布隆过滤器不存在该数据=============数据{}",  "00000");
        }
        //存在输出
        if (redisService.existsLuabloomFilter(FILTER_NAME, "11111")) {
            log.info("redis布隆过滤器存在该数据=============数据{}", "11111");
        }
    }
}

这里先调插入接口,插入两条数据,如果返回true则说明成功,如果是同一个数据第一次插入返回成功,第二次插入就会返回false,说明重复插入相同值会失败。

然后调查询接口,这里应该两条日志都会输出,因为上面"00000"是取反的,多了个!号。

我们来看最终结果。

符合我们的预期,说明,redis布隆过滤器从部署到整合SpringBoot都是成功的。


三、总结

下面个人对整个做一个总结吧。主要是思考下,在什么环境下可以考虑用以上哪种方式来判断该元素是否存在。

1、数据量不大,且不能有误差。

那么用List或者Map都可以,虽然说List判断该元素是否存在采用的是遍历集合的方式,在性能在会比Map差,但就像上面测试一样,100万的数据,

List遍历和Map都不超过1毫秒,选谁不都一样,何必在乎那0.几毫秒的差异。

2、数据量不大,且允许有误差。

这就可以考虑用Google布隆过滤器了,尽管查询数据效率都差不多,但关键是它可以减少内存的开销,这就很关键。

3、数据量大,且不能有误差。

如果说数量大,为了提升查询元素是否存在的效率,而选用Map的话,我觉得也不对,因为如果数据量大,所占内存也会更大,所以我更推荐用

Redis的map数据结构来存储数据,这样可以大大减少JVM内存开销,而且不需要每次重启都要往集合中存储数据。

4、数据量大,且允许有误差。

如果是单体应用,数据量内存也可以接收,那么可以考虑Google布隆过滤器,因为它的查询速度会比redis要快。毕竟它不需要网络IO开销。

如果是分布式集群架构,或者数据量非常大,那么还是考虑用redis布隆过滤器吧,毕竟它不需要往每一节点都存储数据,而且不占用JVM虚拟机内存。

Github地址https://github.com/yudiandemingzi/spring-boot-redis-lua


参考

1、redis lua官方文档

2、redis lua中文翻译文档

3、Lua泛型for遍历table时ipairs与pairs的区别



只要自己变优秀了,其他的事情才会跟着好起来(上将10)

Guess you like

Origin www.cnblogs.com/qdhxhz/p/11259078.html