Bloom Filter:Bloom Filter

Bloom Filter is a fast search algorithm for multiple hash function maps proposed by Howard Bloom in 1970, including a long binary vector and K hash functions. Each hash function maps an element to a bit in a binary vector.

A common application scenario of Bloom Filter is to determine whether an element belongs to a certain set A. Assuming that the binary vector has a total of M bits, using K hash functions, the size of the set A is N, where M is often much larger than K and N:

[1] First, it is necessary to assign all the K bits mapped by K hash functions to all elements in A in advance, and set the value of the corresponding bit to be 1. For example, the figure below shows the Bloom Filter with M=16 and K=2. Set A contains only a and b, and the hash values ​​of a and b are (3, 6) and (10, 3) respectively, so we mark bits 3, 6, and 10 as true.

[2] Then, we look at the hash value of the element to be checked. If one bit is not true, then the element is definitely not in set A. If all K bits of the element mapped by K hash functions are true, then we think that this element is in this set, and there is a certain misjudgment rate at this time. The so-called false positive (False Positive) means that an element is not in the Bloom Filter, but the positions of all its hash values ​​are set to true, and we still think that the set contains this element. For example, in the previous step, our set A only contains elements a and b, but our judgment is that d is also in set A, and the bits corresponding to the hash value that should be d are all true.

The advantage of Bloom Filter is that the space efficiency and query time are far more than the general algorithm, but this efficiency has a certain price: when judging whether an element belongs to a certain set, although it is impossible to misjudge the elements belonging to this set It does not belong to this set, but it is possible to mistake elements that do not belong to this set as belonging to this set. Let's take a look at how big this misjudgment rate is.

  1. After an element is hashed K times, the probability that this particular bit is still 0 is (1-1/M) K ;
  2. After all N elements of set A have been hashed, the probability that this particular bit is 1 is 1-(1-1/M) NK .
  3. To judge whether an element belongs to this set, we examine whether the K bits of the mapping are all 1. The probability that K bits are all 1, that is, the false positive rate P is [1-(1-1/M) NK ] K . When x tends to infinity, (1+1/x) = e, so the false positive rate P=[1-(1-1/M) NK ] K = (1-e -KN/M ) K .

So for a given M and N, how many hash functions should be selected to minimize the error rate P during element query? It can be proved mathematically that when K is ln2*M/N, which is about 0.7*M/N, the minimum false positive rate is P=0.5 K =0.6185 M/N . The minimum value is often not reached, because in practice K is an integer.

 

In the case of different M/N and K, the probability of misjudgment rate is detailed here .

The Python implementation of BloomFilter is detailed here .

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327090411&siteId=291194637