Bloom filter (BloomFilter) - Principle (II)

Bloom filter (BloomFilter) - Principle (II)

1. HashSet, HashMap data structure Introduction

  • For comparison, we first look to achieve the same function, the set of common framework. Because the value is generally HashSet HashMap Object of a package made of the same, so we can directly look HashMap.
  • Structure is as follows:
    HashMap configuration in FIG.
  • Structure Description
    1. First constructed an array of length n,
    2. Wherein when the data stored in the first object HashCode calculated, with the results stored HashCode% n determines the position of the subscript of the array
    3. However, because of the different target values ​​calculated HashCode% n may be the same, the same array of target angular position of a plurality of elements to be stored
    4. In this case the list or on the use of red-black tree to a plurality of objects exist, and used at this time equals to determine whether two objects are the same

2. BloomFilter structure

  • First, it is a bit array of length n, for example n = 2 31 2^{31} l (int the maximum non-negative), the array value is set to 0 for all positions
    bloomfilter_1
  • Calculating for each object to be stored in the fingerprint information (8-bit information consisting of e.g. fingerprint int: 352, 1001, 5087, 28721, 72737, 201098, 9749111, 25010128, you can calculate the fingerprint information of five digits, numbers can, which involves the problem of false positives), is required in front of the HashMap HashCode
    bloomfilter_2
  • When a value is set into element 8 according to its fingerprint information, the array corresponding to angular position of the subject 1
    bloomfilter_3
  • Sequentially calculates the fingerprint information of all objects to be inserted, and the array of bit values ​​of these positions is the subscript 1 (regardless of repeated subscripts, or set to 1), can. This constitutes a BloomFilter.

3. explanation of some of the doubts about BloomFliter

  • How BloomFliter query elements? How to see whether an object exists in the BloomFliter in it?
    1. First, the object information calculated fingerprints, 8 int, for example: 10, 46, 12, 234, 4,564, 122, 1,211,234, 227,341
    2. Then, the query BloomFliter, to see whether the eight corresponding corners are the subject of Int 1, as long as there is not a value of 1, then the query is stopped immediately return false, if all 1, then return true
  • Why BloomFliter have false positives?
    • May look the same as explained above query, we should find, two objects calculated digital fingerprint information section, if the object more and more, then it may be a "one object simply does not exist in BloomFliter, but it will be judged there is, return true ". This is why the emergence of false positives BloomFliter set a certain capacity of the array, if the element has become increasingly more, the false positive rate will be higher and higher!
  • Why BloomFliter can not remove elements of it?
    • Consideration is first explained earlier, the object 2 may be the same information calculated digital fingerprint section, when the target value of the angle corresponding to the fingerprint information of an object is cleared (i.e., set to 0), then it might affect an array of bits fingerprint information other objects! ! So you can not do that!
    • You might say, I can set an int array, every time you add objects to calculate the position of the object fingerprint index plus 1, so that each object is deleted, I can direct fingerprint of the index position minus one, to determine whether the object exists You can see the position of the corresponding fingerprint index is greater than 0. However, if an object is deleted, the index value of 8-bit digital is no corresponding reduced to 0, then this object or be sentenced do exist. Meanwhile, int compared to the large amount of data bit occupies much (java an int occupies 4 bytes, each byte representing 8bit bits). Even with byte (byte. 1, 8bit bits) indicates a digit, is eight times the bit, a byte can represent numbers while the upper limit is 127, but in reality, the data amount is too large, there is likely an array index position is the number of "plus 1" is more than 127.
Published 128 original articles · won praise 45 · Views 150,000 +

Guess you like

Origin blog.csdn.net/alionsss/article/details/99769147