bloomfilter Bruce filter

Reprinted link: https://github.com/Snailclimb/JavaGuide/blob/master/docs/dataStructures-algorithms/data-structure/bloom-filter.md

Here we will be divided into several areas to introduce the Bloom filter:

  1. What is the Bloom filter?
  2. Principle Bloom filter introduction.
  3. Bloom filter usage scenarios.
  4. Bloom filter implemented by a Java programming manual.
  5. Google's use of open source Guava comes with Bloom filter.
  6. Redis the Bloom filter.

1. What is the Bloom filter?

First, we need to understand the concept of the Bloom filter.

Bloom filter (Bloom Filter) is called a brother Bloom raised in 1970. We can see it as a binary vector data structures (or bit array) and a series of random mapping function (hash function) consisting of two parts. We usually compared to the commonly used List, Map, Set and other data structures, it takes up less space and is more efficient, but the drawback is the result of its return is probabilistic and not very accurate. Adding to the theory under circumstances more elements of the collection, the greater the likelihood of false positives. And, stored in the Bloom filter data is not easy to remove.

 

Each bit array element occupies only 1 bit, and each element is 0 or 1 only. 100w apply such a bit array elements occupy only 1000000Bit / 8 = 125000 Byte = 125000/1024 kb ≈ 122kb space.

Summary: a man named Bloom's proposed to retrieve element is in the data structure of a given large collection of such data structure is efficient and very good performance, but the disadvantage is a certain error recognition rate and the difficulty of deleting . Further, the theoretical situation, to add more elements in a set, the greater the likelihood of false positives.

Principle 2. Introduction Bloom filter

When an element added when the Bloom filter, will proceed as follows:

  1. Bloom filter using a hash function element value has been calculated, a hash value (hash function to give a few of several hash values).
  2. The hash value obtained at the bit array corresponding to the target value 1.

When we need to determine whether there is an element to the Bloom filter time, we will proceed as follows:

  1. Given the same elements recalculated hash;
  2. After determining whether the number of bits to obtain the value of each element in the group 1, if the value is 1, then the value described in the Bloom filter, if there is a value other than 1, indicating that the filter element is not in the Bloom .

Here is a simple example:

 

As shown, when the character string is stored to be added to the Bloom filter, the string of first hash values ​​generated by the plurality of different hash functions, and the corresponding elements in the group table number of bits set to 1 (when the bit array is initialized, all positions are zero). When the second string of the same memory, as the previous corresponding positions is set to 1, it is easy to know this value already exists (is convenient to weight).

If we need to determine whether a character string in the Bloom filter, only the string again given the same hash calculation, determining whether the number of bits set after a value obtained in each element 1 is, if the value are both 1, then the value described in the Bloom filter, if there is a value other than 1, indicating that the element is not in the Bloom filter.

It may be the same different strings hash out position, in which case we can appropriately increase the number of bits set or adjust the size of our hash function.

In summary, we can conclude: Bloom filter elements exist, said a small probability of miscarriage of justice. Bloom filter is not to say an element, then this element is not certain.

3. Bloom filter usage scene

  1. Determining whether a given data exists: determining whether a number such that the number contains large numbers of concentration, to prevent the penetration of the cache (data determining whether the requested avoid bypass the cache request database) (large set of numbers, more than 500 million!) etc., spam mail filtering, blacklist and much more.
  2. Deduplication: for example, when climbing a given URL already crawling through the URL to heavy.

In short: the number of large deduplication can be used bloomfilter; say a Bloom filter elements are present, the small probability of miscarriage of justice. Bloom filter is not to say an element, then the element must not

发布了12 篇原创文章 · 获赞 5 · 访问量 6912

Guess you like

Origin blog.csdn.net/Eden_blue/article/details/104294729