background
In the article [ link ], we help the BitSet Java source code to try to understand the next BitMap algorithm, but there is a very deadly disadvantage is not resolved, it is very embarrassing data collision problem .
What does it mean, to explain again the next, BitMap we just very simply initialize a Long array, and then use a bit a little bit to indicate the presence or absence of a data, but which will inevitably face a hash collision issue .
We draw a diagram to review under BitMap algorithm.
As shown above, hash function are f1, data points A and D bits are 1 , so it must be present, and the points B and C are the same position, so a hash collision is so easily generated .
Namely: the median element indicates that the data certainly does not exist, there is on-site element can only indicates that the data may exist .
About BF algorithm
There are drawbacks total solution, just an algorithm is incorporated herein describes herein: Bloom filter , English called Bloom the Filter , hereinafter referred to as BF algorithm.
Likewise, we first draw a diagram to visually recognize under BF algorithm.
We can be seen from the figure, in this case A, B, C, D through respective four data f1 and f2 method hash algorithm twice, and then pointing to the top position, only when the upper bit f1 and f2 are directed 1:00, will be marked as present.
- 举例 1:上图的 A 和 B,虽然 f2 计算的位都是同样的 1,但是各自的 f1 计算出的位是不一样的,所以由此判断出 A 和 B 是不一样的数据。
- 举例 2:上图的 B 和 C,经过 f1 和 f2 计算出的两个位是一样的,所以由此判断出 B 和 C 是一样的数据。
小结:BF 算法虽然在一定程度上减少了 BitMap 算法中的哈希碰撞,但是终言之,只是减少而已,没法完全避免,就像上文举的案例2一样。
BF 算法优化
通过上面的图,其实很容易看出,上图中 hash function 的数量是 2,假如我们计算 3 次呢?或者 4 次甚至更多呢?诚然这可以更进一步避免数据的碰撞问题,但是太多的话却适得其反。
所以介绍优化之前我们先小结下 BF 算法的劣势,因为优化都是基于某些劣势来进行的:
- 误判率,或者换言之可以用 hash 碰撞的概率来帮助理解:虽然相对 BitMap 而言在一定程度上减少了 hash 碰撞的概率,但是也是存在一定的误判性的,所以对精度很高的应用场景,BF 算法并不适合。
- 元素的删除:因为一个位可能会对应着好几个数据,所以我们不能随意删除任意一个位上面的元素,否则其他的数据可能会判断错误。
所以,针对上面的两个点,我们逐个来突破下:
关于误判率
BF 算法优劣的影响因素,其实很容易就可以联想到,一个是根据插入的数据总量(n)来计算出最合适的位数组的大小(m)和 hash 函数的个数(k),还有一个便是最优的 误判率(使用 P(error) 表示)的选择问题。
比如:我们假设 P 为 0.01,此时最优的 m 应大概是 n 的 13 倍,而 k,应大概为 8。
详细的证明过程见下文。
关于元素删除的需求
因为数据对应的位会牵动其它的数据,所以 BF 是不可以删除位数据的,那么如果有这样的需求呢?可以使用 couting Bloom Filter 来解决,大致思路就是使用一个 counter 数组来代替位数组。
什么意思呢?简言之就是在原来的 BF 算法的位上面,不再是用简单的 0 或 1 来表示了,而是存储该位上面的数据总量,比如有两个数据经过 hash function 计算都有指向同一个位,则将该位标记为2,代表有两个数据,当删除其中一个数据时,只需要将该位上面的 2 调整为 1 即可,如此便不再影响其它数据的正确性。
应用场景
BF 算法虽然有着一定的缺点(主要是误判率),但是它的优点更为突出,所以应用场景也是很广的。
比如我们在爬虫业务下,有很多的 URL,我们可以通过 BF 算法来判断每个 URL 是否已经被我们的爬虫程序处理过。
再比如邮箱服务中垃圾邮件的过滤策略,由于垃圾邮件是海量的,我们不可能使用一个很完整的散列映射来标记每一个垃圾邮箱的地址,此处可以使用 BF 算法来标记,从而节约了容量。
另外,BF 算法在很多开源框架中也都有相应的实现,例如:
Elasticsearch:org.elasticsearch.common.util.BloomFilter
Guava:com.google.common.hash.BloomFilter
Hadoop:org.apache.hadoop.util.bloom.BloomFilter(基于BitSet实现)
BF 算法的数学计算过程
误判概率的证明和计算
假设布隆过滤器中的hash function满足simple uniform hashing假设:每个元素都等概率地hash到m个slot中的任何一个,与其它元素被hash到哪个slot无关。若m为bit数,则对某一特定bit位在一个元素由某特定hash function插入时没有被置位为1的概率为:
则k个hash function中没有一个对其置位的概率为:
如果插入了n个元素,但都未将其置位的概率为:
则此未被置位的概率为:
现在考虑query阶段,若对应某个待query元素的k bits全部置位为1,则可判定其在集合中。因此将某元素误判的概率为:
由于 ,当 时,并且 当m很大时趋近于0,所以:
从上式中可以看出,当m增大或n减小时,都会使得误判率减小,这也符合直觉。 现在计算对于给定的m和n,k为何值时可以使得误判率最低。设误判率为k的函数为:
设 , 则简化为: ,
两边取对数得:,
两边对k求导:
下面求最值:
因此,即当 时误判率最低,此时误判率为:
可以看出若要使得误判率 ≤1/2,则:
To illustrate this in a fixed false positive rate remains constant, the number m of bits of the Bloom filter element to be added and the number n should be linear synchronous increased.
Design and Application of Bloom filter method
When the application is first determined by the first user to add a desired number of elements n and the error rate P. The only two parameters and this is a complete design of the Bloom filter require user input, all the parameters will be calculated after the system, and thus established a Bloom filter.
The system first calculates the required memory size m bits:
Then by m, n to give the number of the hash function:
The system has so far required parameters ready, then added to the n elements Bloom filter, then query.
According to the formula, when k Optimal:
Thus verify when P = 1%, each storage element requires 9.6 bits:
And when want to reduce the false positive rate is 1/10, then store each element needs to be increased 4.8 bits:
It should be particularly noted that, 9.6 bits / element includes not only the k-bit is set to 1, and also contains a number of bits is not set to 1. At this time,
Each element is the number of bits corresponding to 1 bit.
So that P (error) minimum, we note:
Medium, the immediate
This probability is a probability that bit position is not set in the insertion of the n elements. Thus, want to keep the error rate, the space required Bloom filter usage is 50%.
Code BF algorithm to achieve
The code is relatively long, the article in full display temporarily, see a more complete code demo [link] .
Listed below are a small part of the code block, action is calculated according to the actual size of the inserted false positive rate and the number of elements in the filter vessel:
public double getFalsePositiveProbability(double numberOfElements) {
// (1 - e^(-k * n / m)) ^ k
return Math.pow((1 - Math.exp(-k * (double) numberOfElements
/ (double) bitSetSize)), k);
}
Reproduced in: https: //www.jianshu.com/p/8cbf81846db7