[Reprint] Detailed Bloom filter principle, usage scenarios and notes

Detailed Bloom filter principle, usage scenarios and notes

 

Today met a business, he has a big Value Redis cluster use is as a Bloom filter, but the communication of hate when they were a bit small, which means something like "Bloom filter principle do not understand, I have to optimize? . " Technical vegetable hate being recognized, and no wonder others, before they had really only heard of this, but did not understand before, take advantage of this opportunity to add knowledge.

Before entering text, seen before, I think there is a saying put it well:

Data structures are nothing different. They are like the bookshelves of your application where you can organize your data. Different data structures will give you different facility and benefits. To properly use the power and accessibility of the data structures you need to know the trade-offs of using one.

The effect that different data structures have different application scenarios and the advantages and disadvantages, and properly apply them after you need to carefully weigh their own needs, Bloom filter is practicing on behalf of this sentence.

What is the Bloom filter

Essentially Bloom filter is a data structure, more subtle probabilistic data structure (probabilistic data structure), characterized by efficient inserts and queries can be used to tell you "something certainly does not exist or may exist."

Compared to the traditional List, Set, Map and other data structures, it is more efficient and take up less space, but the drawback is the result of its return is probabilistic, not exact.

The principle

HashMap problem

Bloom filter before about the principle, we first think about, you usually determine whether there is an element of what is? HashMap should find many people answer it, really value can be mapped to a HashMap Key, then return the results within the time complexity of O (1), the extraordinarily high efficiency. But HashMap implementation has its drawbacks, such as high storage capacity accounted for, taking into account the presence of load factor, usually the space can not be used full, and once you are worth a lot, for example, hundreds of millions of times, that HashMap occupied by memory size becomes very substantial.

For example, when it comes to your data set stored on a remote server, the local service accepts input data set is very large and can not be read into memory to build a one-time HashMap, it will be problematic.

Bloom filter data structure

Bloom filter is a bit vector or bit arrays, a long way:

 

 
image

If we want to map a value to a bloom filter, we need to use a plurality of different hash function to generate a plurality of hash values, and generates a hash value for each bit position of a point, for example, the value "baidu "and three different hash function to generate hash values ​​1,4,7 respectively, on the FIG into:

 

 
image

Ok, now we have a reload value "tencent", if the hash function returns 3,4,8, then continue drawing becomes:

 

 
image

Notably, the 4-bit bit hash function since the two values ​​are returned to the bit position, it is covered. Now if we want to query "dianping" this value exists, the hash function returns three values ​​1,5,8, we find a value of 0 on the results of this 5-bit bit, indicating that none of the values ​​are mapped to the bit position on, so we can safely say "dianping" this value does not exist. And when we need to query the "baidu" this value is present, then the hash function is bound to return 1,4,7, and then we check found that the value of these three bits are 1 bit, then we can say that "baidu" exist yet? The answer is no, it can only be "baidu" this value may exist.

这是为什么呢?答案跟简单,因为随着增加的值越来越多,被置为 1 的 bit 位也会越来越多,这样某个值 “taobao” 即使没有被存储过,但是万一哈希函数返回的三个 bit 位都被其他值置位了 1 ,那么程序还是会判断 “taobao” 这个值存在。

支持删除么

目前我们知道布隆过滤器可以支持 add 和 isExist 操作,那么 delete 操作可以么,答案是不可以,例如上图中的 bit 位 4 被两个值共同覆盖的话,一旦你删除其中一个值例如 “tencent” 而将其置位 0,那么下次判断另一个值例如 “baidu” 是否存在的话,会直接返回 false,而实际上你并没有删除它。

如何解决这个问题,答案是计数删除。但是计数删除需要存储一个数值,而不是原先的 bit 位,会增大占用的内存大小。这样的话,增加一个值就是将对应索引槽上存储的值加一,删除则是减一,判断是否存在则是看值是否大于0。

如何选择哈希函数个数和布隆过滤器长度

很显然,过小的布隆过滤器很快所有的 bit 位均为 1,那么查询任何值都会返回“可能存在”,起不到过滤的目的了。布隆过滤器的长度会直接影响误报率,布隆过滤器越长其误报率越小。

另外,哈希函数的个数也需要权衡,个数越多则布隆过滤器 bit 位置位 1 的速度越快,且布隆过滤器的效率越低;但是如果太少的话,那我们的误报率会变高。

 

 
image

k 为哈希函数个数,m 为布隆过滤器长度,n 为插入的元素个数,p 为误报率。
至于如何推导这个公式,我在知乎发布的文章有涉及,感兴趣可以看看,不感兴趣的话记住上面这个公式就行了。

最佳实践

常见的适用常见有,利用布隆过滤器减少磁盘 IO 或者网络请求,因为一旦一个值必定不存在的话,我们可以不用进行后续昂贵的查询请求。

另外,既然你使用布隆过滤器来加速查找和判断是否存在,那么性能很低的哈希函数不是个好选择,推荐 MurmurHash、Fnv 这些。

大Value拆分

Redis 因其支持 setbit 和 getbit 操作,且纯内存性能高等特点,因此天然就可以作为布隆过滤器来使用。但是布隆过滤器的不当使用极易产生大 Value,增加 Redis 阻塞风险,因此生成环境中建议对体积庞大的布隆过滤器进行拆分。

拆分的形式方法多种多样,但是本质是不要将 Hash(Key) 之后的请求分散在多个节点的多个小 bitmap 上,而是应该拆分成多个小 bitmap 之后,对一个 Key 的所有哈希函数都落在这一个小 bitmap 上。

参考资料

Guess you like

Origin www.cnblogs.com/jinanxiaolaohu/p/11844970.html