[Reserved] Bloom filter (Bloom Filter)

[Reserved] Bloom filter (Bloom Filter)

This part of the study Source: https://www.youtube.com/watch?v=v7AzUcZ4XA4

Filter judge not, and that is certainly not; Filter judgment, you can only have a chance in

Bit of a mess ah, meaning: Bloom filters tend to judge , this is its error: It may not be said as in the.

1571671678071
1571671831447

Element with a function to be mapped to a binary array. When the need to insert the insertion elements are mapped to binary bits, if there is not at least one bit corresponding to an array, it is not described.

A more complete example:

1571671971957

! Error is B B would not otherwise exist, but the judgment is the existence, so that misjudgment: it would have been non-existent in some cases judged to exist.

It is for this erroneous recognition rate, it is called Filer, that is, "filter", its filtering effect is not 100%.

Use Cases:

1571672161224

Bitcoin network:

PS, I personally do not believe that, as detailed in https://www.zhihu.com/question/43572793 Gangster reasoned explanation of the block chain, here is an example

1571672222928

Attach the meaning of each node:

1571672290462

SPV node: quickly determine whether a transaction record, saying no is no, to improve efficiency

1571672344413

If said determination is present, go to block the response to search.


Source: https://zhuanlan.zhihu.com/p/43263751

Bloom filter data structure

Bloom filter is a bit vector or bit arrays , a long way:

img

If we want to map a value to a bloom filter, we need to use a plurality of different hash function to generate a plurality of hash values , and generates a hash value for each bit position of a point, for example, the value "baidu "and three different hash function to generate hash values 1,4,7 respectively, on the FIG into:

img

Ok,我们现在再存一个值 “tencent”,如果哈希函数返回 3、4、8 的话,图继续变为:

img

没有什么玄妙的,只是构建了多个哈希函数和一个二进制数组而已。每一个哈希函数将传入的对象映射为一个整数

值得注意的是,4 这个 bit 位由于两个值的哈希函数都返回了这个 bit 位,因此它被覆盖了。现在我们如果想查询 “dianping” 这个值是否存在,哈希函数返回了 1、5、8三个值,结果我们发现 5 这个 bit 位上的值为 0,说明没有任何一个值映射到这个 bit 位上,因此我们可以很确定地说 “dianping” 这个值不存在。而当我们需要查询 “baidu” 这个值是否存在的话,那么哈希函数必然会返回 1、4、7,然后我们检查发现这三个 bit 位上的值均为 1,那么我们可以说 “baidu” 存在了么?答案是不可以,只能是 “baidu” 这个值可能存在。

这是为什么呢?答案跟简单,因为随着增加的值越来越多,被置为 1 的 bit 位也会越来越多,这样某个值 “taobao” 即使没有被存储过,但是万一哈希函数返回的三个 bit 位都被其他值置位了 1 ,那么程序还是会判断 “taobao” 这个值存在。

支持删除么

目前我们知道布隆过滤器可以支持 add 和 isExist 操作,那么 delete 操作可以么,答案是不可以,例如上图中的 bit 位 4 被两个值共同覆盖的话,一旦你删除其中一个值例如 “tencent” 而将其置位 0,那么下次判断另一个值例如 “baidu” 是否存在的话,会直接返回 false,而实际上你并没有删除它。

布隆过滤器本身是不支持删除的,因为它违反了“说不在就不在”的原则

如何解决这个问题,答案是计数删除。但是计数删除需要存储一个数值,而不是原先的 bit 位,会增大占用的内存大小。这样的话,增加一个值就是将对应索引槽上存储的值加一,删除则是减一,判断是否存在则是看值是否大于0。

如何选择哈希函数个数和布隆过滤器长度

很显然,过小的布隆过滤器很快所有的 bit 位均为 1,那么查询任何值都会返回“可能存在”,起不到过滤的目的了。布隆过滤器的长度会直接影响误报率,布隆过滤器越长其误报率越小。

另外,哈希函数的个数也需要权衡,个数越多则布隆过滤器 bit 位置位 1 的速度越快,且布隆过滤器的效率越低;但是如果太少的话,那我们的误报率会变高。

imgk 为哈希函数个数,m 为布隆过滤器长度,n 为插入的元素个数,p 为误报率

如何计算布隆过滤器的误判率:

来源:https://www.zhihu.com/question/38573286

错误率

假设 Hash 函数以等概率条件选择并设置 Bit Array 中的某一位,假定由每个 Hash 计算出需要设置的位(bit) 的位置是相互独立, m 是该位数组的大小,k 是 Hash 函数的个数.

  • 位数组中某一特定的位在进行元素插入时的 Hash 操作中没有被置位的概率是:

[official]

  • 在所有 k 次 Hash 操作后该位都没有被置 "1" 的概率是:

[official]

  • 如果我们插入了 n 个元素,那么某一位仍然为 "0" 的概率是:

[official]

  • 该位为 "1"的概率是:

[official]

检测某一元素是否在该集合中。标明某个元素是否在集合中所需的 k 个位置都按照如上的方法设置为 "1",但是该方法可能会使算法错误的认为某一原本不在集合中的元素却被检测为在该集合中(False Positives),该概率由以下公式确定:

[official]

如何使得错误率最小,对于给定的m和n,当 [official] 的时候取值最小(求导就能算出来)。关系如下图所示:

imgimg

最佳实践

Common applicable common there, * using the Bloom filter to reduce disk or network IO requests , because once a value must not exist, we can not follow up expensive queries.

In addition, since you use the Bloom filter to speed up finding and determine whether there is, then the low performance of the hash function is not a good choice, recommended MurmurHash, Fnv these.

Large Value Split

Setbit Redis support its operations and getbit, and pure and high memory performance, thus can naturally be used as the Bloom filter. However, improper use of the Bloom filter can easily produce large Value, Redis increased risk of clogging, thus generating environment is recommended for bulky Bloom filter split.

After the split in the form of a variety of methods, but the nature of the request is not after the Hash (Key) distributed over a plurality of small bitmap of a plurality of nodes, but should be split into a plurality of small bitmap, to all of a Key hash functions fall on the small bitmap.

Guess you like

Origin www.cnblogs.com/jiading/p/11717344.html