Java implements Bloom filter

What is a bloom filter

The Bloom Filter was proposed by Bloom in 1970. It actually consists of a very long binary array + a series of hash algorithm mapping functions to determine whether an element exists in the set.
Bloom filters can be used to retrieve whether an element is in a set. Its advantage is that the space efficiency and query time are much better than the general algorithm, but the disadvantage is that there is a certain rate of misrecognition and difficulty in deletion.

Welcome to pay attention to the personal public account [Have a good time learning technology] exchange and study

Scenes

Suppose there are 1 billion mobile phone numbers, and then judge whether a certain mobile phone number is in the list?

can mysql?

Under normal circumstances, if the amount of data is not large, we can consider using mysql storage. Store all the data in the database, and then go to the library every time to check whether it exists. However, if the amount of data is too large, exceeding tens of millions, the query efficiency of mysql is very low, which consumes a lot of performance.

Can HashSet

We can put the data into the HashSet, and use the natural deduplication of the HashSet. The query only needs to call the contains method, but the hashset is stored in the memory. If the amount of data is too large, the memory will be directly oomed.

Bloom filter characteristics

Insertion and query are efficient and take up less space, but the returned results are indeterminate.
When an element is judged to exist, it does not necessarily exist. But if it is judged that an element does not exist, then it must not exist.
Bloom filters can add elements, but they must not delete elements , which will increase the false positive rate.

Bloom filter principle

The Bloom filter is actually a BIT array, which maps the corresponding hash through a series of hash algorithms, and then changes the subscript position of the array corresponding to the hash to 1. When querying, a series of hash algorithms are performed on the data to obtain the subscript, and the data is fetched from the BIT array. If it is 1, it means that the data may exist. If it is 0, it means it must not exist.

Why is there an error rate

We know that the Bloom filter actually hashes the data, so no matter what algorithm is used, it is possible that the hash generated by two different data is indeed the same, which is what we often call a hash conflict.

First insert a piece of data: learn technology well

Inserting a piece of data:

这是如果查询一条数据，假设他的hash下标已经标为1了，那么布隆过滤器就会认为他存在

常见使用场景

缓存穿透

java实现布隆过滤器

package com.fandf.test.redis;

import java.util.BitSet;

/**
 * java布隆过滤器
 *
 * @author fandongfeng
 */
public class MyBloomFilter {

    /**
     * 位数组大小
     */
    private static final int DEFAULT_SIZE = 2 << 24;

    /**
     * 通过这个数组创建多个Hash函数
     */
    private static final int[] SEEDS = new int[]{4, 8, 16, 32, 64, 128, 256};

    /**
     * 初始化位数组，数组中的元素只能是 0 或者 1
     */
    private final BitSet bits = new BitSet(DEFAULT_SIZE);

    /**
     * Hash函数数组
     */
    private final MyHash[] myHashes = new MyHash[SEEDS.length];

    /**
     * 初始化多个包含 Hash 函数的类数组，每个类中的 Hash 函数都不一样
     */
    public MyBloomFilter() {
        // 初始化多个不同的 Hash 函数
        for (int i = 0; i < SEEDS.length; i++) {
            myHashes[i] = new MyHash(DEFAULT_SIZE, SEEDS[i]);
        }
    }

    /**
     * 添加元素到位数组
     */
    public void add(Object value) {
        for (MyHash myHash : myHashes) {
            bits.set(myHash.hash(value), true);
        }
    }

    /**
     * 判断指定元素是否存在于位数组
     */
    public boolean contains(Object value) {
        boolean result = true;
        for (MyHash myHash : myHashes) {
            result = result && bits.get(myHash.hash(value));
        }
        return result;
    }

    /**
     * 自定义 Hash 函数
     */
    private class MyHash {
        private int cap;
        private int seed;

        MyHash(int cap, int seed) {
            this.cap = cap;
            this.seed = seed;
        }

        /**
         * 计算 Hash 值
         */
        int hash(Object obj) {
            return (obj == null) ? 0 : Math.abs(seed * (cap - 1) & (obj.hashCode() ^ (obj.hashCode() >>> 16)));
        }
    }

    public static void main(String[] args) {
        String str = "好好学技术";
        MyBloomFilter myBloomFilter = new MyBloomFilter();
        System.out.println("str是否存在：" + myBloomFilter.contains(str));
        myBloomFilter.add(str);
        System.out.println("str是否存在：" + myBloomFilter.contains(str));
    }


}
复制代码

Guava实现布隆过滤器

引入依赖

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>31.1-jre</version>
</dependency>
复制代码

package com.fandf.test.redis;

import com.google.common.base.Charsets;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;

/**
 * @author fandongfeng
 */
public class GuavaBloomFilter {

    public static void main(String[] args) {
        BloomFilter<String> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8),100000,0.01);
        bloomFilter.put("好好学技术");
        System.out.println(bloomFilter.mightContain("不好好学技术"));
        System.out.println(bloomFilter.mightContain("好好学技术"));
    }
}
复制代码

hutool实现布隆过滤器

引入依赖

<dependency>
    <groupId>cn.hutool</groupId>
    <artifactId>hutool-all</artifactId>
    <version>5.8.3</version>
</dependency>
复制代码

package com.fandf.test.redis;

import cn.hutool.bloomfilter.BitMapBloomFilter;
import cn.hutool.bloomfilter.BloomFilterUtil;

/**
 * @author fandongfeng
 */
public class HutoolBloomFilter {
    public static void main(String[] args) {
        BitMapBloomFilter bloomFilter = BloomFilterUtil.createBitMap(1000);
        bloomFilter.add("好好学技术");
        System.out.println(bloomFilter.contains("不好好学技术"));
        System.out.println(bloomFilter.contains("好好学技术"));
    }

}
复制代码

Redisson实现布隆过滤器

引入依赖

<dependency>
    <groupId>org.redisson</groupId>
    <artifactId>redisson</artifactId>
    <version>3.20.0</version>
</dependency>
复制代码

package com.fandf.test.redis;
 
import org.redisson.Redisson;
import org.redisson.api.RBloomFilter;
import org.redisson.api.RedissonClient;
import org.redisson.config.Config;
 
/**
 * Redisson 实现布隆过滤器
 * @author fandongfeng
 */
public class RedissonBloomFilter {
 
    public static void main(String[] args) {
        Config config = new Config();
        config.useSingleServer().setAddress("redis://127.0.0.1:6379");
        //构造Redisson
        RedissonClient redisson = Redisson.create(config);
 
        RBloomFilter<String> bloomFilter = redisson.getBloomFilter("name");
        //初始化布隆过滤器：预计元素为100000000L,误差率为1%
        bloomFilter.tryInit(100000000L,0.01);
        bloomFilter.add("好好学技术");
 
        System.out.println(bloomFilter.contains("不好好学技术"));
        System.out.println(bloomFilter.contains("好好学技术"));
    }
}
复制代码