Bloom filter concept and principle

 

write in front

In the era of big data and cloud computing development, we often encounter such problems. Can we efficiently judge whether a user has visited the homepage of a website (hundreds of millions of visits per day) or needs to count the pv and uv of the website. The most straightforward idea is to store all the visitors and compare it with the previous collection each time a user visits. Whether the access information is stored in memory (or database) will cause a lot of pressure on the server. Is there a way to tolerate a certain error rate and achieve efficient (computational complexity, space complexity) tracking and statistics of traffic information? The Bloom Filter introduced next can meet the current usage scenarios (note: the cardinality counting method can also meet the statistics of pv and uv).

Introduction to Bloom Filters

Bloom filter (Bloom Filter ) is a random data structure with high space efficiency proposed by Bloom in 1970. It uses a bit array to represent a set succinctly and judges whether an element belongs to this set. With the Bloom filter, there are Type 1 errors (False positives ), but there are no Type 2 errors (False negatives ), so the Bloom filter has 100% recall. That is to say, the Bloom filter can accurately determine that an element is not in the set, but can only determine that an element may be in the set. Therefore, Bloom Filter is not suitable for "zero error" applications. In applications that can tolerate low errors, Bloom Filter trades very few errors for great savings in storage space. We can add elements to a bloom filter, but cannot remove elements from it (normal bloom filters, enhanced bloom filters can remove elements). As the number of elements in the bloom filter increases, so does the likelihood of making a Type 1 error.

Algorithm Description

An empty Bloom filter consists of a bit array of length M bits, with all bits initialized to 0 . An element is randomly hashed to K positions in the bit array through K different hash functions, and K must be much smaller than M. The sizes of K and M are determined by the false positive rate .


An example set of Bloom  Filters S{x, y, z}. Colored arrows indicate the position of elements in M ​​(bit array) through the k (k = 3 ) hash function. The element W is not in the S set, because the element W passes through k hash functions to obtain a position with a value of 0 in the k positions of M (bit array) .

Add an element x to the set S: x gets k positions in M ​​after k hash functions, and then set the value of these k positions to 1 .

Determine whether the x element is in the set S: after x has passed k hash functions, the value to k positions, if there is 0 among the k values, it means that the element x is not in the set - the element x has been inserted into the If the set S is passed, the k positions in M ​​will all be set to 1 ; if the k positions in M ​​are all 1 , there are two situations. Case 1: This element is in this set; Case 2: When an element is inserted, the value of the k positions is set to 1 (the cause of the first type of error is False Positive ). A simple bloom filter cannot distinguish between the two cases, which is fixed in the enhanced version.

It may be a lot of work to design k independent hash functions, but a good hash function is the key to reducing the false positive rate. A good hash function should have a wide output, and the conflict between them should be as low as possible, so that k hash functions can hash values ​​in more places than possible. The hash function is designed so that we can pass in k distinct values ​​(  0, 1, ..., k   1 ) as arguments, or add them to the primary key. For large M or k, the independence between hash functions has a very large impact on the false positive rate ( (Dillinger & Manolios (2004a), Kirsch & Mitzenmacher (2006)) ) , Dillinger in k hash functions, multiple times Use the same function to hash and analyze the impact on the false positive rate.

It is impossible for a simple bloom filter to remove an element x from the set S, and false negatives are not allowed. The element is hashed to k positions, although the value of these k positions can be set to 0 to remove this element, but this colleague also removes those elements that have values ​​​​in these k positions after they are scattered. Therefore, there is no way to determine whether the removal of this element affects other elements that have been added to the set, and setting k positions to 0 will introduce a second type of error (false negative ).

time complexity and space complexity

In the case of false positives , Bloom filters require only a small amount of storage space compared to other sets (balanced binary trees, trees, hash tables, arrays, linked lists). The complexity of adding and checking whether an element is inside the set is O (K) for the Bloom filter . Hash table has lower average complexity than Bloom filter. The Bloom filter has an average of about 1.44 bits per element when the error is optimal .

Error rate estimation

Bloom filter will have a set error rate (false positive rate ) when judging whether an element belongs to the set it represents, and then estimate the size of the error rate. Before estimating the error, we assume that kn < m (the number of k hash functions, the number of elements in the n set, the length of the m bit array) and the hash functions are independent of each other. The position in the bit array M is completely random.

For a bit array of length m, the probability that the value of a certain position of the bit array is not set to 1 after the element is hashed once during insertion is :

After hashing by k hash functions, the probability that it has not been set to 1 is:

If n elements are inserted, the probability that the position has not been set to 1 is

So the probability of being set to 1 is

Now to determine whether an element is in the combination, hash it to different positions of the k bit array through k functions. The probability that all of these positions have a value of 1 - the false positive rate.


limit is used here


This calculation method is not strict because the previous assumption is that the hash function and the distribution of the hashed value are independent of each other. However, as m and n increase, the false positive rate is closer to the true false positive rate.

Mitzenmacher and Upfal showed that the expected value of the false positive rate is the same without the hypothesis.

The optimal number of hash functions

Since the Bloom filter maps collections into bit arrays, how many hash functions to choose is the case with the lowest error rate. There are two mutually exclusive reasons: if the number of hash functions is large, the probability of getting a 0 when querying an element that does not belong to the set is high; on the other hand, if the number of hash functions is small , then there are more 0s in the bit array . In order to get the optimal number of hash functions, we need to calculate according to the error rate formula in the previous section.

false positive rate


Take the natural logarithm on both sides,


, as long as g takes the minimum value, p can take the minimum value. Since p = e^(-nk/m), we can rewrite g as


According to the symmetry rule, when p = 1/2 , that is, k = ln2*(m/n) , g takes the minimum value. In this case, the minimum error rate p = ( 1/2 ) ^k (0.6185)^(m/n) . p = 1/2 corresponds to half of 0 and half of 1 in the bit array . In other words, to keep the error rate low, it is best to leave half of the bit array empty.

the size of the bit array

The size of the bit array M is calculated given n (the number of elements in the set) and the error rate (the error rate for the optimal number k of functions), in the case of optimal k


Simplify to


get


This means that with an error rate p, a Bloom filter of length m can accommodate n elements (the above calculation is based on n, m-> ∞).

Estimation of the number of elements in a bloom filter

Swamidass & Baldi (2007) gave a method for estimating the number of elements of Bloom filter (refer to the paper for the detailed proof method)


Among them, n * represents the estimated value of the number of bloom filter elements, m represents the size of the bloom filter, k represents the number of hash functions, and X represents the number of 1 bits in the bloom filter bit value .

Union and Intersection of Bloom Filters

Bloom filters can be used to estimate the union between two sets. Here is how the union between two sets is calculated:



The number of unions between A and B is:


So the number of intersections between A* and B * is:


Features of Bloom Filtering

The bloom filter can accommodate as many elements as you want (the false positive rate will increase), and elements can always be added to the bloom filter without reporting errors (Out Memory , etc.) ;

The bloom filter can easily calculate the intersection and union between two set elements through the computer 's or  \ and operation , but it also affects the accuracy of the bloom filter.

example

Google bigtable , apache hbase and apache cassandra use bloom filters to determine whether the row (rows) or (colums) exists to reduce disk access and improve database access performance;

Bitcoin uses Bloom filtering to determine whether the wallet is in sync.

Summarize

In the field of computers, we often encounter the situation of time for space \ or space for time, in order to achieve the performance of one aspect, sacrifice another aspect. Bloom Filter introduces another concept between time and space - error rate. That is to say, the Bloom filter mentioned above cannot accurately determine whether an element is in the set (similar designs and cardinality statistics). After introducing the error rate, the storage space is greatly saved.

Bloom Filter has been widely used in spell checking and database systems since Burton Bloom proposed Bloom Filter in the 1970s . In the past decade or two, with the popularization and development of the network, Bloom Filter has gained a new life in the network field, and various Bloom Filter variants and new applications have been emerging. It is foreseeable that with the continuous deepening of network applications, new variants and applications will continue to appear, and Bloom Filter will surely gain greater development.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325066867&siteId=291194637