BloomFilter - a powerful tool for large-scale data processing

Reference: http://www.cnblogs.com/heaad/archive/2011/01/02/1924195.html

 

Bloom Filter is a fast search algorithm for multi-hash function mapping proposed by Bloom in 1970. It is usually used in some situations where it is necessary to quickly determine whether an element belongs to a set, but it is not strictly required to be 100% correct.

 

1. Examples  

  In order to illustrate the significance of the existence of Bloom Filter, take an example:

  Suppose you want to write a web crawler. Because of the intricate links between webs, spiders crawling across webs are likely to form "loops". To avoid "rings", you need to know which URLs the spider has already visited. Given a URL, how do you know if a spider has already visited it? If you think about it for a while, there are several options:

  1. Save the visited URL to the database.

  2. Save the visited URL with HashSet. It only takes close to O(1) cost to check whether a URL has been visited.

  3. The URL is saved to HashSet or database after one-way hashing such as MD5 or SHA-1.

  4. Bit-Map method. Create a BitSet and map each URL to a certain bit through a hash function.

  Methods 1 to 3 all save the visited URL completely, and method 4 only marks one mapping bit of the URL.

 

  The above methods can perfectly solve the problem when the amount of data is small, but the problem comes when the amount of data becomes very large.

  Disadvantage of method 1: When the amount of data becomes very large, the query efficiency of relational database will become very low. And is it too much to start a database query every time a URL comes?

  Disadvantage of method 2: Too much memory consumption. As the number of URLs increases, more and more memory is occupied. Even if there are only 100 million URLs, each URL only counts 50 characters, which requires 5GB of memory.

  Method 3: Because the length of the message digest after MD5 processing is only 128Bit, and the SHA-1 processing is only 160Bit, so method 3 saves several times the memory than method 2.

  Method 4 consumes relatively little memory, but the disadvantage is that the probability of collision of a single hash function is too high. Remember the various solutions to hash table conflicts you learned in data structures class? To reduce the probability of conflict to 1%, set the length of BitSet to 100 times the number of URLs.

 

  In essence, the above algorithms ignore an important implicit condition: allowing a small probability of error, it does not have to be 100% accurate! That is to say, a small number of URLs are not actually visited by web spiders, and the cost of misjudging them as visited is very small - it's a big deal to grab a few pages less. 

 

2. Algorithm of Bloom Filter 

 

  The nonsense is said here, let's introduce the protagonist of this article - Bloom Filter. In fact, the idea of ​​method 4 above is very close to Bloom Filter. The fatal disadvantage of method 4 is the high probability of conflict. In order to reduce the concept of conflict, Bloom Filter uses multiple hash functions instead of one.

    The Bloom Filter algorithm is as follows:

    Create an m-bit BitSet, initialize all bits to 0, and choose k different hash functions. The result of the ith hash function hashing the string str is denoted as h(i, str), and the range of h(i, str) is 0 to m-1.

 

(1)  Join the string process 

 

  The following is the process of processing each string, starting with the process of "recording" the string str into the BitSet:

  For the string str, calculate h(1, str), h(2, str) ... h(k, str) respectively. Then set the h(1, str), h(2, str)... h(k, str) bits of BitSet to 1.

 

  Figure 1. Bloom Filter join string process

  Simple, right? This maps the string str to the k bits in the BitSet.

 

(2)  The process of checking whether the string exists 

 

  The following is the process of checking whether the string str has been recorded by BitSet:

  对于字符串str,分别计算h(1,str),h(2,str)…… h(k,str)。然后检查BitSet的第h(1,str)、h(2,str)…… h(k,str)位是否为1,若其中任何一位不为1则可以判定str一定没有被记录过。若全部位都是1,则“认为”字符串str存在。

 

  若一个字符串对应的Bit不全为1,则可以肯定该字符串一定没有被Bloom Filter记录过。(这是显然的,因为字符串被记录过,其对应的二进制位肯定全部被设为1了)

  但是若一个字符串对应的Bit全为1,实际上是不能100%的肯定该字符串被Bloom Filter记录过的。(因为有可能该字符串的所有位都刚好是被其他字符串所对应)这种将该字符串划分错的情况,称为false positive 。

 

(3) 删除字符串过程 

   字符串加入了就被不能删除了,因为删除会影响到其他字符串。实在需要删除字符串的可以使用Counting bloomfilter(CBF),这是一种基本Bloom Filter的变体,CBF将基本Bloom Filter每一个Bit改为一个计数器,这样就可以实现删除字符串的功能了。

 

  Bloom Filter跟单哈希函数Bit-Map不同之处在于:Bloom Filter使用了k个哈希函数,每个字符串跟k个bit对应。从而降低了冲突的概率。

 

. Bloom Filter参数选择 

 

   (1)哈希函数选择

     哈希函数的选择对性能的影响应该是很大的,一个好的哈希函数要能近似等概率的将字符串映射到各个Bit。选择k个不同的哈希函数比较麻烦,一种简单的方法是选择一个哈希函数,然后送入k个不同的参数。

   (2)Bit数组大小选择 

     哈希函数个数k、位数组大小m、加入的字符串数量n的关系可以参考参考文献1。该文献证明了对于给定的m、n,当 k = ln(2)* m/n 时出错的概率是最小的。

     同时该文献还给出特定的k,m,n的出错概率。例如:根据参考文献1,哈希函数个数k取10,位数组大小m设为字符串个数n的20倍时,false positive发生的概率是0.0000889 ,这个概率基本能满足网络爬虫的需求了。  

 

. Bloom Filter实现代码 

    下面给出一个简单的Bloom Filter的Java实现代码:

 

复制代码
import java.util.BitSet;

publicclass BloomFilter 
{
/* BitSet初始分配2^24个bit */ 
privatestaticfinalint DEFAULT_SIZE =1<<25
/* 不同哈希函数的种子,一般应取质数 */
privatestaticfinalint[] seeds =newint[] { 571113313761 };
private BitSet bits =new BitSet(DEFAULT_SIZE);
/* 哈希函数对象 */ 
private SimpleHash[] func =new SimpleHash[seeds.length];

public BloomFilter() 
{
for (int i =0; i < seeds.length; i++)
{
func[i] 
=new SimpleHash(DEFAULT_SIZE, seeds[i]);
}
}

// 将字符串标记到bits中
publicvoid add(String value) 
{
for (SimpleHash f : func) 
{
bits.set(f.hash(value), 
true);
}
}

//判断字符串是否已经被bits标记
publicboolean contains(String value) 
{
if (value ==null
{
returnfalse;
}
boolean ret =true;
for (SimpleHash f : func) 
{
ret 
= ret && bits.get(f.hash(value));
}
return ret;
}

/* 哈希函数类 */
publicstaticclass SimpleHash 
{
privateint cap;
privateint seed;

public SimpleHash(int cap, int seed) 
{
this.cap = cap;
this.seed = seed;
}

//hash函数,采用简单的加权和hash
publicint hash(String value) 
{
int result =0;
int len = value.length();
for (int i =0; i < len; i++
{
result 
= seed * result + value.charAt(i);
}
return (cap -1& result;
}
}
}
复制代码

 

 

 

参考文献:

 

[1]Pei Cao. Bloom Filters - the math.

http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html

[2]Wikipedia. Bloom filter.

http://en.wikipedia.org/wiki/Bloom_filter

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326949925&siteId=291194637