What is a Bloom filter? Let's take a look!

 Bloom filter, just look at the name, isn't it just a filter! First of all, filters are known to everyone, such as sieves, gauze and other tools used to filter large particles. Using filters can filter out some unwanted stuff, and finally get what we want. Remember the advertisement for a certain mineral water, all processes go through more than 20 filtering processes! The cowhide exploded! Maybe filtering sand or anything can be considered as a layer of filtering! [Slightly smile: Haha]

A few days ago, when I was looking at Redis, I saw a structure called BitMap. After I finished reading it, I screamed: Good guy, isn't this the root of Bloom filters! So, I silently hit an arrow in my heart, pointing to the Bloom filter.

Come on, add a Baidu entry first, show the number of words:

Bloom Filter (Bloom Filter) was proposed by Bloom in 1970. It is actually a very long binary vector and a series of random mapping functions . Bloom filters can be used to retrieve whether an element is in a collection . Its advantage is that the space efficiency and query time are much better than the general algorithm, but the disadvantage is that it has a certain misrecognition rate and difficulty in deletion.

This kind of entry explanation may not be very clear to some buddies, so let's talk about it in plain language.

Bloom filter, in simple terms, is used for filtering, how to filter it? First, I prepare a large number of segments, such as from 0 to 100 million, and then each number has a corresponding true and false value, the default is false. Then I received a request here with a parameter, such as Id. How do I filter it? First, I first hash all the IDs in the database several times (let's do it 3 times), then there will be 3 hash values, and then I will change the value of the corresponding number field of these 3 numbers to true If the database has 300,000 data, I will ask for the hash 900,000 times, and then change the value of the corresponding number field of these hashes to true. Of course, there may be duplicates. (This is done in advance, at least it has been done before the request arrives, hehe.)

After the request came, I asked for the hash of the requested parameters, um, also 3 times, and then checked the corresponding value from this number field, if it is true, let you pass, if one is false, then I’m sorry, you The id is obviously not in the database, where is the love, the Lord is not waiting!

Of course, if the hash is repeated in a certain data segment, there may also be a piece of data. It is very lucky. Although it is not in the database, the value of the hash is true for 3 times, and it is also from the filter. Slipped over!

Therefore, the Bloom filter has such a characteristic that the existing ones will certainly pass, and the non-existent ones may pass.

Please don’t lift the bar at this time. The filter is mainly to intercept loads and attacks. Even if there are some missing fish, after the code is processed, the damage to the server or DB is almost insignificant. This small amount Data processing is acceptable.

Let's look at a simple diagram:

Schematic diagram of Bloom filter

As shown in the figure, the following data segment is prepared in advance, and the data is hashed in advance to modify the true and false in the data segment. When the request comes, it is judged whether to filter the data according to the hash.

Okay, the model diagram has come out, how do we implement it?

Is it to create an array? Hey! It just so happens, isn't this data segment just an array? Then write true and false in the data, so the function is not realized?

Yes, the function is realized, and you can understand from the schematic diagram that this data segment must be very large, otherwise it will be filled up casually, all are true, then, what else is necessary for filtering? Let's not say too much, just tens of millions! Are you sure you want to create an array or list of tens of millions of levels? Let’s not talk about whether this can filter attacks well, just this array of tens of millions of levels is enough for the server! Come and come, let’s do a little calculation for you. To create an object in Java, let’s do the smallest calculation, which is 16 Bytes, multiplied by ten million, which is 160 million Bytes, which is about 1 GB.

Then, you have to perform multiple queries and judgments on this 1GB array, and then filter.

Let's not talk about whether it can reduce the pressure on the server. If there are more such Bloom filters, then congratulations! game over! The server is down directly, so what filter is needed!

Just kill the server, what filter is needed!  !
Just kill the server, what filter is needed! !

       Bloom filter: "Arrays are definitely not usable. You don't need arrays to kill in this life! We are also a disciplined filter!"

        Since you don't need an array, what should you use? The Baidu entry also said-a very long binary vector.

Regarding binary, it involves a part of the bottom of the computer, I will explain it a little bit here.

Computer programming, eventually from high-level languages, such as java, C, C++, etc., to assembly language, and then to machine language, will eventually be transformed into two numbers 0 and 1. The computer only recognizes these two numbers, what if. .else..? Where is the love!

The early large computers are actually punch-hole recognition, but after generations of updates, the current computers no longer know how much they surpass their ancestors, whether it is storage or computing power.

Okay, too much talk.

Everyone is familiar with storage, right? Even if you don't know the bottom layer, you will always come into contact with it often. For example, hundreds of gigabytes of action movies in a certain hard drive, mobile phones using dozens of MB of data, etc., these are actually the transmission of the two numbers 1 and 0.

The familiar units are generally TB (1024GB), GB (1024MB), MB (1024KB), KB (1024B), B (Byte), so is B the smallest unit?

NO! This B (Byte) is still some distance away from 1 and 0, that is bit, 1Byte = 8 bit, and bit is the place where 1 and 0 are stored in the legend, and 1bit is a place where 1 or 0 is stored. So, it is obvious that 1Byte occupies 8 bits, 1KB is 1024*8 = 8192 bits, 1MB is 8,388,608 bits, if each bit represents a number, it can represent 8,388,608 (million level), and then it is represented by 1 and 0 True and false, wouldn’t it mean that 1MB can represent more than 8 million data segments? And it can directly locate a certain number, return 0 and 1 directly, and the speed is directly O(1). Isn't this just for Bloom filters? Oh, no, on the contrary, shouldn't the Bloom filter just make use of this storage mechanism? Let's look at a schematic diagram:

Schematic diagram of Bloom filter storage

Look, it only takes about 10-20M of space to complete the Bloom filter. Isn't it fragrant? What array do you want! Bloom filter is very simple to understand, but it is a little troublesome if you implement it independently. However, Don't warry! Java is packaged, hehe, let's take a look first.

public class BloomTest {

    // 需要存储的数据,数据段的话是后台创建的,是根据下面的概率来的, 不用你管的,
    private static int dataAmount = 500000;

    // 百分比,就是漏网之鱼的概率,布隆过滤器总有一些不存在的数据能通过,
    //这个就是不存在的数据能通过的概率
    // 千分之一的概率
    private static double rate = 0.001;
    
    public static void main(String[] args) {

        // 本来想用Integer的,但是字段哪可能刚好是数字啊,字符串的可能更大,适应性更强,反正也是求hash,差别不大。
        BloomFilter<String> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charset.defaultCharset()), dataAmount, 0.001);


        // 先塞数据吧
        for (int i = 0; i < dataAmount; i++) {
            String uuid = UUID.randomUUID().toString();
            bloomFilter.put(uuid);
        }
        //数据放好了,开始拦截呗,放50000数据过来呗,
        int number = 0;
        for (int i = 0; i < 50000; i++) {
            String uuid = UUID.randomUUID().toString();
            if (bloomFilter.mightContain(uuid)){
                number++;
            }
        }
        System.out.println("50000条数据误判的数据量为:"+number
                +"\n所占百分百:"+ new BigDecimal(number).divide(new BigDecimal(50000)));
    }
}

You can run this main method yourself, the percentage is close to 0.001, more or less, the following is the result of one of my execution:

50000条数据误判的数据量为:51
所占百分百:0.00102

Process finished with exit code 0

  By the way, remember to add dependencies when copying the code, the maven dependency of Google's BloomFilter (the version depends on you):

       <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
        </dependency>

how about it? Is this Bloom filter very convenient to use? Is it so easy? Is it really exciting? Hey, but they are not used by many people.

This is a Bloom filter for a single server. Now deploying the service, is it still a stand-alone server? are you kidding me! In a distributed environment, you have a Bloom filter for each machine. Don't you need money for memory? Are you a tyrant? Then please.

Redis: "Distributed environment? I'm familiar with it, come, come, let me come, hehe!".

So, Redis is on the stage! Some buddies implement the Bloom filter in Redis by themselves, but there is encapsulation in the Java integration Redis. For simplicity, I did not construct bean injection, but wrote the already simple test method, and then passed the construction method. , Introduce redisson, and then use Bloom filter directly.

public class RedissonBloomTest {

    // 需要存储的数据,数据段的话是后台创建的,是根据下面的概率来的, 不用你管的,
    private static int dataAmount = 1000000;

    // 百分比,就是漏网之鱼的概率,布隆过滤器总有一些不存在的数据能通过,
    //这个就是不存在的数据能通过的概率
    // 千分之一的概率
    private static double rate = 0.001;

    // 客户端服务,spring中,应该是创建bean,然后直接注入的,我这边为了简单把Redisson在构造方法中初始化了
    RedissonClient redisson;

    public static void main(String[] args) {
        //获取redissonClient 服务
        RedissonClient redissonClient = new RedissonBloomTest().getRedisson();
        //获取(创建)布隆过滤器
        RBloomFilter<String> redisBloomFilter = redissonClient.getBloomFilter("RedisBloomFilter");
        //初始化布隆过滤器
        redisBloomFilter.tryInit(dataAmount, rate);
        // 代码copy过来

        // 先塞数据吧
        for (int i = 0; i < dataAmount; i++) {
            String uuid = UUID.randomUUID().toString();
            //这里塞数据就很慢了,建议少放点数据,生产中肯定要提前弄好,不然很容易出事故哦
            redisBloomFilter.add(uuid);
        }
        //数据放好了,开始拦截呗,放100000数据过来呗,
        int number = 0;
        for (int i = 0; i < 100000; i++) {
            String uuid = UUID.randomUUID().toString();
            if (redisBloomFilter.contains(uuid)){
                number++;
            }
        }
        System.out.println("100000条数据误判的数据量为:"+number
                +"\n所占百分百:"+ new BigDecimal(number).divide(new BigDecimal(100000)));

        redissonClient.shutdown();
    }

    static Config config = new Config();

    static {
        config.useSingleServer()
                .setAddress("redis://127.0.0.1:6379");
    }

    public RedissonBloomTest(){
        redisson = Redisson.create(config);
    }

    public static Config getConfig() {
        return config;
    }

    public RedissonClient getRedisson() {
        return redisson;
    }
}

This is the result after I run:

100000条数据误判的数据量为:1847
所占百分百:0.01847

This result seems to be a little bit different from the set error rate. I looked at the size of the Bloom filter and compared it with Google’s data of one million. The size of the Bloom filter is almost the same, all of which are tens of millions. However, this falseProbability seems to be a little different. (If anyone sees it, please feel free to advise!)

This problem may have to be knocked down a bit. Let’s take a look at the advantages and disadvantages of Bloom filters:

Excellent: simple, convenient, easy to filter large quantities of data

Deficiency: The data needs to be sorted and added in advance, and it can only be used once. If the basic data is deleted or added, the Bloom filter will be reset, which is very inconvenient.

There is also the application scenario of Bloom filter:

1. Cache penetration frequently asked in interviews, use Bloom filter to filter out directly

2. De-duplicate big data. For example, in a crawler system, we need to de-duplicate the URL

3. The Bloom filter is also commonly used in the spam filtering function. Because of this filter, some normal emails are usually put into the spam directory. This is caused by a misjudgment, the probability Very low.

 

It's dangerous and dangerous, and I only output this one at the end of the month.

no sacrifice,no victory~

If you feel that what you wrote is passable, give it a thumbs up~

Guess you like

Origin blog.csdn.net/zsah2011/article/details/115300961