Bloom filter

to put forward a problem

Before we dwell on Bloom filter, let's throw a question: to give you a huge data sets (one million, one hundred million level ...), how to determine whether an element in this data set? Or how to determine if an element is not in this data set?

Think about this question, first thought may be a hash table, the data set smaller when this method is feasible, of course, when huge data sets can also be distributed hash table way of adoption. When large-scale data sets, especially in applications only need to determine if an element is not the case in this data set, we can learn from the idea of a hash table, using a Bloom filter to solve this problem. Since we only care about the elements is not, I do not care what element value, as long as the elements are mapped to a Boolean value that indicates not enough. The following design Bloom filter data structure detailing.

Bloom filter data structure

Bloom filter (Bloom filter) efficient probabilistic data structure for testing elements membership space. Data structure at the expense of the provisions of the false positive rate achieved at the expense of huge data compression. A Bloom filter as an marray of bits set to all ones 0. Selecting a set ktime random hash functions, Bloom filter element added, the element will hashes respectively, and for each koutput, the corresponding bits of the Bloom filter index will be set 1. Bloom filter to complete the query by using the same hash function previously. If all accessed bloom filter kbits are set 1, then this may indicate that the element is located in the collection. Removing elements can only be re-created from scratch and be completed by the abolition of Bloom filter, this issue will be discussed later.

Bloom filter is a combination of the hash function of the underlying array and work together, depending on the requirements of the false alarm rate, a hash function may be selected, it may be selected from 2, 3, generally selected from 3. Different from the hash table, to save space, the underlying array of each Bloom filter is a one bit 1represents the mapping, 0indicating no mapping. Length of the array and scale factor problems, hash functions, etc. For false alarm rate, depending on the size of the dataset, the choice of a suitable hash function with a suitable array size. Because of the particular issue, it is difficult to say that implementation is best. Following are examples of process Bloom filter.

Data set {x,y,z}, the length of the underlying array m=18, all of the initial value 0, the number of hash functions k=3, respectively, $H_1$ 、 $H_2$ 、 $H_3$ 。

First, each element of the data sets are mapped to the underlying array by three different hash function, the value is set to a position corresponding to an array of 1. You can see a hash collision, here goes resolve hash collision.

When an element is determined to be in the data set, e.g. w, the three hash values calculated successively, map values in the array to find, if there is a map value of 0the element data set does not exist, if the map corresponding to the values of all three is 1, the probability of a large element in the data set, not entirely sure (because there are hash collisions). FIG, an element wwhich is a hash map 0, it wmust not in the data set.

We can see the working process of Bloom filters is very simple.

Issues related to the Bloom filter

Why Bloom filter is not 100% certain elements in the data set it?

Obviously, Bloom filter is a 100% certain data element is not set, but a determining element in the data set can only large probability.

Because of hash collisions, mapping value corresponding to the underlying array 1, there may be other elements of the element you want to find a collision, in fact, this element does not exist in the dataset. False positive rate there is a Bloom filter.

Bloom filter element can have a delete operation

You can see, Bloom filter, only insert, the search operation, no deletion, and why?

Because of a hash collision, two elements may have hash collision occurs, then delete one of the elements, corresponding to the value of the underlying array is set 0, then the other elements are also equal deleted, so there is no deletion operation . That single centralized data from a deleted element corresponding to the value of the underlying array unchanged, could it not? Logically, it is possible, and will not cause an error, but in fact, this would considerably increase the rate of false positives, if ever deletion insertion, and finally will find an array of almost the entire ground floor was filled, lost Bloom filter quickly determine whether the significance of the elements in the data set. Of course, if you delete a few words, but also can be solved, but to allow the false positive rate in the range regularly rebuild Bloom filter (when the data set is very large, the process of continuous reconstruction cost is great).

Of course, the reality of demand, there is likely to remove elements of the operational requirements, which how to do it? One possible idea is based on the Bloom filters, make improvements, similar ideas reference count value of the underlying array is no longer a boolean value, but rather an integer, each time a collision occurs, the corresponding value is incremented once, when you delete an element, decreasing again. Of course, just do this, or not enough, there may be other issues, how to achieve specific, yet to be Honorable great God to solve, where deeper thinking no longer do.

There is great God is proposed Cuckoo Filter, reference papers Cuckoo the Filter: Practically of Better Than Bloom .

Calculate the rate of false positives

False positive rate associated with a plurality of factors, the number length, element of an array, a hash function itself, and so the number of the hash function. How to calculate the false positive rate, reference Bloom filter , not dwell here.

Reference document: Bloom filter

In addition to the theoretical calculation of the rate of false positives, the program can actually feel the change in the rate of false positives, if required, can do for every false alarm statistics program 误报的次数/总的次数.

Bloom filter applications

After the above study, we can answer questions beginning of the article, just to the original data set are all mapped to the underlying array Bloom filter, otherwise the space overhead for Bi Haxi tables, etc. needed to be much smaller, because it do not store raw dataset, storing only a boolean value whether the data symbol exists. When an element is determined, calculated hash value to find the corresponding map is determined.

Note that the FIG KV in a storage system, a Bloom improve query response speed.

It can be seen for key1such queries, you no longer need to access the KV storage system, and can access the return result: data does not exist; for key2such a request, then continue to visit KV storage system returns the result; for key3this is part of false positives, rarely happens. If the actual application requirements, there are a large number of data queries are not actually stored in the system, this approach can significantly reduce access to the KV storage system, improve response efficiency.

Welcome attention to micro-channel public number, push more data structures, distributed, block chain, back-end development!