Bloom filter (BloomFilter) - Description (a)

Bloom filter (BloomFilter) - Description (a)

1970 Bloom made BloomFilter, it is used as the primary data filtering, filtering the data to reduce the cost. Like BloomFilter like a big rough sieve for filtering the sand, first of all sand poured into a large sieve, will initially be able to filter out much of the sand, but can not guarantee that all we want is to stay thin sand (coarse sieve since the hole sizes, while the lower production costs, even if the production of a large screen can filter large amount of sand, can accept). Then using a small fine sieve to filter the remaining sand, it is possible to filter out all sand fit the size limitation. (High definition screen production costs, but because of the small enough so that the cost is acceptable). This is a description of the generation of Bloom filters.

1. Basic Concepts

  • Description
    BloomFilter is a probabilistic data structure type, a data structure using a probabilistic design. Let's be directly understood as it is a "collection" (of course this is not true ^ _ ^ back will write an explanation of how it is designed ), can efficiently insert and query data. Its main function is to determine whether there is an element BloomFilter this "collection" in , but it can only tell you "this element certainly does not exist or may exist in the BloomFilter" that BloomFilter may mistakenly believe that a elements which do not exist where there is, but BloomFilter do not think there is a certain element of it which does not exist .
    BloomFilter does not exist presence
    actual does not exist does not exist presence
  • Feature
    • Based on the data structure to learn probability
    • It used to determine whether there is an element in the collection, but has some false positives (very low customizable), commonly used as a primary filter large data
    • Insert and query efficiency are constant level (see element exists only query)
    • You can not delete elements
    • Low memory footprint (memory footprint much smaller than HashSet, reference may calculate the size occupied by Bloom-Calculator )

2. BloomFilter and HashSet performance comparison

  • Sequentially modifying the amount of data size, the recording efficiency of the different capacity of the query (Query 1 million times each)

    Determiner 100,000 1000000 5000000 10 million 20000000 30 million 50000000 100000000 200000000 500 million
    BloomFilter 285ms 288ms 186ms 182ms 197ms 198ms 188ms 189ms 191ms 173ms
    HashSet 22ms 30ms 21ms 17ms 23ms - - - - -
  • Sequentially modifying the amount of data inserted, each record insertion time

    Determiner 100,000 1000000 5000000 10 million 20000000 30 million 50000000 100000000 200000000 500 million
    BloomFilter 96ms 518ms 2807ms 5888ms 13147ms 20713ms 35675ms 73075ms 152333ms 380256ms
    HashSet 21ms 121ms 2971ms 6632ms 22494ms Memory overflow Memory overflow Memory overflow Memory overflow Memory overflow

    Note: When tested on my PC, fault tolerance here BloomFilter set to 1%

  • Analysis
    can be seen: the query terms, with an increase in container volume of data, efficiency is almost unchanged HashSet query, the query will gradually increase efficiency BloomFilter stabilized; insert, along with the insertion amount of data increases, beginning HashSet faster than BloomFilter, to the back of the beyond BloomFilter, directly far behind, and even take up memory space as much of a problem, resulting in memory overflow.

3. BloomFilter use examples

  • Guide package pom.xml
<!-- 这里选择的是谷歌guava提供的BloomFilter,使用maven导入依赖 -->
<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>28.0-jre</version>
</dependency>
  • Examples of Use
import com.google.common.base.Charsets;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnel;
import com.google.common.hash.PrimitiveSink;

/**
 * Description: 布隆过滤器示例
 *
 * @author ALion
 * @version 2019/8/17 1:52
 */
public class BloomFilterDemo {

    public static void main(String[] args) {
        // Funnel用于告诉BloomFilter需要根据Person的哪些字段来计算其信息指纹
        Funnel<Person> personFunnel = new Funnel<Person>() {
            @Override
            public void funnel(Person person, PrimitiveSink into) {
                into.putInt(person.id)
                    .putString(person.firstName, Charsets.UTF_8);
            }
        };

        int count = 100 * 10000;
        
        // arg1: Funnel
        // arg2: 指定你的BloomFilter要容纳多少条数据
        // arg3: 指定创建的BloomFilter的错误率
        BloomFilter<Person> friends = BloomFilter.create(personFunnel, count, 0.01);

        // 向BloomFilter中加入数据
        for (int i = 0; i < count; i++) {
            friends.put(new Person(i, "jack" + i));
        }

        // 测试效果
        Person person1 = new Person(10, "jack");
        boolean exist1 = friends.mightContain(person1);
        System.out.println("person1: exist = " + exist1);

        Person person2 = new Person(10, "jack10");
        boolean exist2 = friends.mightContain(person2);
        System.out.println("person2: exist = " + exist2);

    }

    static class Person {

        int id;

        String firstName;

        public Person(int id, String firstName) {
            this.id = id;
            this.firstName = firstName;
        }

    }

}
Published 128 original articles · won praise 45 · Views 150,000 +

Guess you like

Origin blog.csdn.net/alionsss/article/details/99687080