Redis advanced features of one hundred million data filtering and Bloom filter

First, the Bloom filter Profile

The last time we learned how to use  HyperLogLog  to a large estimate of the data, it is very valuable, can solve a lot of statistical accuracy is not high demand. But if we want to know one of the values is not already in  HyperLogLog  structure inside, it will not do anything, it only provides  pfadd and  pfcount methods do not provide similar  contains method.

Let me give it a scene, such as you  brush vibrato :

 

You have  to brush through repeated recommendations  do? So many of the recommended content to be recommended to so many users, it is how to ensure that each user while watching recommended content, recommended to ensure that the video does not appear before've seen it? In other words, vibrato is how to achieve  push sent to re-  do?

You would think the server  records  the user's read  all the history , it will be from the history of each user's system when the recommendation recommended short video  screening , filtering out those records that already exist. The problem is that when  a large amount of users , each user watched a short video and a lot of cases, this way, to re-work recommendation system  in terms of performance with on it?

In fact, if history is stored in a relational database, you need to go heavy on the database frequently  exists inquire, when high concurrency system, the database is difficult withstood pressure.

 

You might also thought of  cache , but so many users so much history, cached if all that was required  to waste much space  ah ..  (boss may look at the bill, look at you ..)  and this storage space will increases linearly with time, even if you live a month with a cache can make it, but how long it can continue to hold? Do not cache performance can not keep up, we supposed to do?

As shown above, the Bloom filter (Bloom Filter)  is one such high-level data structures designed to solve the problem of de-duplication. But with  HyperLogLog  , as it too has a little bit inaccurate, there is a certain probability of miscarriage of justice, but it can at the same time to solve the heavy, in  the space can save 90%  or more, is very worthwhile.

What is the Bloom filter

Bloom filter (Bloom Filter)  in 1970 proposed by Bloom. It  is actually  a very long series of random binary vector and the mapping function  (detailed below) , in fact, you can put it  simply understood  as a very precise  set  structure, when you use its  contains method of determining an object whether there is, it may misjudge. However, Bloom filter is not particularly precise, as long as a reasonable set of parameters, it is relatively sufficient accuracy can be controlled accurately, there will only be a small probability of false positives.

When the Bloom filter say that there is a value that  may not exist ; when it say there, it  certainly does not exist . Analogy, when it does not recognize you, that do not really know, but you know when it's time to say, probably because you know it looks like another friend  (face looks somewhat similar) , so you know miscarriage of justice.

Bloom filter usage scenarios

Based on the above features, we can put substantially in a Bloom filter for the following scenarios:

  • Big data to determine whether there is : It can achieve the above-mentioned de-duplication function, if your server memory is large enough, then use a HashMap may be a good solution, in theory, the time complexity of O levels can reach (1, but when the amount of data together, or only consider the Bloom filter.

  • Cache resolve penetration : We often will put some hot data in Redis as a cache, such as product details. We usually come after a request will first query cache, without directly read the database, which is to enhance the performance of the simplest and most common practice, but  if the request has a cache that does not exist , then the cache must not exist at this time, it there will be  a large number of requests directly hit the database  , the resulting  buffer penetration , Bloom filter can also be used to solve such problems.

  • Filter reptiles / mail and other systems : usually do not know if you ever noticed how some of the normal mail spam will be put in the directory, which is using the Bloom filter  misjudgment  caused.

Second, the Bloom filter principle to resolve

Bloom filter  essentially  by the length of  m the bit vector or bit list (containing only  0 or a  1 list of bit values), with the initial values are all set  0, so we first create a slightly longer bit vector used for the display :

When we add to the Bloom filter data, uses  a plurality of hash  function  key calculates, calculated index value of a certificate, then the number of bits of length modulo operation to obtain a position, each  hash function will be considered a different location. This group of several locations and then the number of bits are set to  1 complete the  add operation, for example, we add a  wmyskxz:

Bloom filter search queries to  key whether there is, with  add the same operation, will put this  key through a plurality of the same  hash conduct operations function, see  position corresponding to  whether  all  is  1, as long as there is a bit 0 , then said Ming Bulong filter that  key does not presence. If these locations are  1, it does not mean that  key there must be, can only say that there is most likely because these positions  1 may be because the other  key exists due.

For example, we just  add certain data after a query  does not exist  in  key:

Obviously, 1/3/5 these positions  1 because first added above  wmyskxz caused, so here there is  miscarriage of justice . Fortunately, there is a Bloom filter can predict the false positive rate formula, more complex, and interested friends can go to read on their own, relatively burning brain .. just need to remember the following points like:

  • When using  do not let the actual number is much larger than the number of elements in the initialization ;

  • When the actual number of elements exceeds the number of initialization of the Bloom filter should be  rebuilt and re-allocate a  size larger filter, and then all the historical elements of the batch  add carried out;

Third, the Bloom filter used

Redis official  Bloom filter provided to the  Redis 4.0  before the official debut after providing a plug-in function. Bloom filter is loaded as a plug-in to Redis Server provides a powerful Bloom deduplication functionality to Redis. Let's experience the Bloom filter Redis 4.0, eliminating the need for cumbersome installation process, we directly Docker it.

> docker pull redislabs/rebloom # 拉取镜像
> docker run -p6379:6379 redislabs/rebloom # 运行容器
> redis-cli # 连接容器中的 redis 服务

If the above three instructions to perform no problem, here you can experience the Bloom filter.

  • Of course, if you do not want to use Docker, can also be installed in the machine after checking Redis versions qualified on their own plug-ins, can be found here: https://blog.csdn.net/u013030276/article/details/88350641

Basic usage of the Bloom filter

Bloom filter There are two basic commands, bf.add add elements, bf.exists query whether the element exists, its usage and set the set  sadd and  sismember the same. Note that  bf.add only add one element, if you want to add more, you need to use the  bf.madd command. Likewise if you need a query whether there are multiple elements, we need to use the  bf.mexists command.

127.0.0.1:6379> bf.add codehole user1
(integer) 1
127.0.0.1:6379> bf.add codehole user2
(integer) 1
127.0.0.1:6379> bf.add codehole user3
(integer) 1
127.0.0.1:6379> bf.exists codehole user1
(integer) 1
127.0.0.1:6379> bf.exists codehole user2
(integer) 1
127.0.0.1:6379> bf.exists codehole user3
(integer) 1
127.0.0.1:6379> bf.exists codehole user4
(integer) 0
127.0.0.1:6379> bf.madd codehole user4 user5 user6
1) (integer) 1
2) (integer) 1
3) (integer) 1
127.0.0.1:6379> bf.mexists codehole user4 user5 user6 user7
1) (integer) 1
2) (integer) 1
3) (integer) 1
4) (integer) 0

Bloom through the filter used above are just the default Bloom filter parameters, it is the first time we  add automatically create time. Redis can also provide custom parameters Bloom filter, only need to  add use the prior  bf.reserve instruction explicitly create enough. If the corresponding  key already exists bf.reserve error.

bf.reserve There are three parameters, namely  key, error_rate (error rate)  and  initial_size:

  • error_rate The lower, the more space is needed , do not be too precise for the occasion, set slightly larger does not matter, such as the above said push system, will only make a small part of the content is filtered out, the overall viewing experience or not It will be greatly affected;

  • initial_size Said it expects the number of elements loaded , when the actual number exceeds this value, the false positive rate will increase, so it is necessary to set in advance a large value not to exceed the rate of false positive results in elevated;

If not applicable  bf.reserve, default  error_rate Shi  0.01, the default  initial_size Shi  100.

Fourth, the Bloom filter implemented Code

Own simple analog implementation

According to the basic theory of the above, we can easily implement their own for a  简单模拟 Bloom filter data structure:

public static class BloomFilter {

    private byte[] data;

    public BloomFilter(int initSize) {
        this.data = new byte[initSize * 2]; // 默认创建大小 * 2 的空间
    }

    public void add(int key) {
        int location1 = Math.abs(hash1(key) % data.length);
        int location2 = Math.abs(hash2(key) % data.length);
        int location3 = Math.abs(hash3(key) % data.length);

        data[location1] = data[location2] = data[location3] = 1;
    }

    public boolean contains(int key) {
        int location1 = Math.abs(hash1(key) % data.length);
        int location2 = Math.abs(hash2(key) % data.length);
        int location3 = Math.abs(hash3(key) % data.length);

        return data[location1] * data[location2] * data[location3] == 1;
    }

    private int hash1(Integer key) {
        return key.hashCode();
    }

    private int hash2(Integer key) {
        int hashCode = key.hashCode();
        return hashCode ^ (hashCode >>> 3);
    }

    private int hash3(Integer key) {
        int hashCode = key.hashCode();
        return hashCode ^ (hashCode >>> 16);
    }
}

Here it is very simple, only internal maintains a  byte type of  data array, in fact,  byte still occupies a byte as much as can be optimized to  bit be replaced, this is only for convenience simulation. In addition, I also created three different  hash functions, in fact, is drawing  HashMap way hash jitter, respectively, using its own  hash and different from the right median or different results. And it provides basic  add and  contains methods.

Let's look at how this simple test Bloom filter effect:

public static void main(String[] args) {
    Random random = new Random();
    // 假设我们的数据有 1 百万
    int size = 1_000_000;
    // 用一个数据结构保存一下所有实际存在的值
    LinkedList<Integer> existentNumbers = new LinkedList<>();
    BloomFilter bloomFilter = new BloomFilter(size);

    for (int i = 0; i < size; i++) {
        int randomKey = random.nextInt();
        existentNumbers.add(randomKey);
        bloomFilter.add(randomKey);
    }

    // 验证已存在的数是否都存在
    AtomicInteger count = new AtomicInteger();
    AtomicInteger finalCount = count;
    existentNumbers.forEach(number -> {
        if (bloomFilter.contains(number)) {
            finalCount.incrementAndGet();
        }
    });
    System.out.printf("实际的数据量:%d, 判断存在的数据量: %d \n", size, count.get());

    // 验证10个不存在的数
    count = new AtomicInteger();
    while (count.get() < 10) {
        int key = random.nextInt();
        if (existentNumbers.contains(key)) {
            continue;
        } else {
            // 这里一定是不存在的数
            System.out.println(bloomFilter.contains(key));
            count.incrementAndGet();
        }
    }
}

Output is as follows:

实际的数据量:1000000, 判断存在的数据量: 1000000
false
true
false
true
true
true
false
false
true
false

This is said before, when the Bloom filter, said certain value  exists , the value  may not exist , when it said that a value  does not exist , it would  certainly not exist , and there is a certain rate of false positives. ..

Implementation reference manual

Of course, the above version is particularly low, but the main idea is not bad, there is also given a better version of himself achieved as a reference test:

import java.util.BitSet;

public class MyBloomFilter {

    /**
     * 位数组的大小
     */
    private static final int DEFAULT_SIZE = 2 << 24;
    /**
     * 通过这个数组可以创建 6 个不同的哈希函数
     */
    private static final int[] SEEDS = new int[]{3, 13, 46, 71, 91, 134};

    /**
     * 位数组。数组中的元素只能是 0 或者 1
     */
    private BitSet bits = new BitSet(DEFAULT_SIZE);

    /**
     * 存放包含 hash 函数的类的数组
     */
    private SimpleHash[] func = new SimpleHash[SEEDS.length];

    /**
     * 初始化多个包含 hash 函数的类的数组,每个类中的 hash 函数都不一样
     */
    public MyBloomFilter() {
        // 初始化多个不同的 Hash 函数
        for (int i = 0; i < SEEDS.length; i++) {
            func[i] = new SimpleHash(DEFAULT_SIZE, SEEDS[i]);
        }
    }

    /**
     * 添加元素到位数组
     */
    public void add(Object value) {
        for (SimpleHash f : func) {
            bits.set(f.hash(value), true);
        }
    }

    /**
     * 判断指定元素是否存在于位数组
     */
    public boolean contains(Object value) {
        boolean ret = true;
        for (SimpleHash f : func) {
            ret = ret && bits.get(f.hash(value));
        }
        return ret;
    }

    /**
     * 静态内部类。用于 hash 操作!
     */
    public static class SimpleHash {

        private int cap;
        private int seed;

        public SimpleHash(int cap, int seed) {
            this.cap = cap;
            this.seed = seed;
        }

        /**
         * 计算 hash 值
         */
        public int hash(Object value) {
            int h;
            return (value == null) ? 0 : Math.abs(seed * (cap - 1) & ((h = value.hashCode()) ^ (h >>> 16)));
        }

    }
}

Guava use Google open source comes with the Bloom filter

Achieved their purpose is mainly to get to know yourself Bloom filter principle, Guava implemented the Bloom filter is a relatively authoritative, so the actual project we do not need to manually implement a Bloom filter.

First, we need to introduce Guava dependency in the project:

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>28.0-jre</version>
</dependency>

Actually used as follows:

We have created a most store up to 1500 integer Bloom filter, and we can tolerate misjudgment probability percent (0.01)

// 创建布隆过滤器对象
BloomFilter<Integer> filter = BloomFilter.create(
        Funnels.integerFunnel(),
        1500,
        0.01);
// 判断指定元素是否存在
System.out.println(filter.mightContain(1));
System.out.println(filter.mightContain(2));
// 将元素添加进布隆过滤器
filter.put(1);
filter.put(2);
System.out.println(filter.mightContain(1));
System.out.println(filter.mightContain(2));

In our example, when  mightContain() the method returns  true , we can  99%  determined that the filter element when the filter is returned  false , we can  100%  determine that the element is not present in the filter.

Implement the Bloom filter provided Guava is still very good  (Want to know more about it can look at the source code to achieve) , but it has a major flaw is the only stand-alone use  (In addition, the capacity expansion is not easy) , and now the Internet are generally distributed scenes. To solve this problem, we need to use  Redis  Bloom filter in the.

Source: I do not have three heart

Published 277 original articles · won praise 65 · views 380 000 +

Guess you like

Origin blog.csdn.net/ailiandeziwei/article/details/104850563