04 Bloom filter BloomFilter

what is

  • The Bloom filter (English: Bloom Filter) was proposed by Bloom in 1970.
  • It is actually a very long binary array + a series of random hash algorithm mapping functions, mainly used to judge whether an element is in the set.
  • Usually we will encounter many business scenarios where we need to judge whether an element is in a certain collection. The general idea is to save all the elements in the collection and then determine through comparison.
  • Linked list, tree, hash table (also called hash table, Hash table) and other data structures are all in this way.
  • But as the number of elements in the collection increases, the storage space we need will also increase linearly, eventually reaching the bottleneck. At the same time, the retrieval speed is getting slower and slower. The retrieval time complexities of the above three structures are O(n), O(logn), and O(1) respectively. At this time, the Bloom Filter (Bloom Filter) came into being
  • Bloom filter BloomFilter consists of a bit array with an initial value of zero and multiple hash functions to quickly determine whether a certain data exists
  • insert image description here
  • The essence is to judge whether the specific data exists in a large collection
  • Bloom filter is a data structure similar to set, but the statistical results are not very accurate

features

  • Insert and query efficiently, occupy less space, and return results that are uncertain.
  • An element does not necessarily exist if the judgment result is existence, but it must not exist when the judgment result is non-existence.
  • Bloom filters can add elements, but cannot remove elements. Because deleting elements will increase the false positive rate.
  • Misjudgments will only occur for elements that have not been added by the filter, and no misjudgments will occur for elements that have been added.
  • Conclusion remarks
    • Yes, there may be
    • None, it is definitely None: It can be guaranteed that if the Bloom filter judges that an element is not in a set, then this element must not be in the set

Bloom filter usage scenarios

Solve the problem of cache penetration

  • What is cache penetration
    • Under normal circumstances, first check whether the cached redis has the data, and then query the database if there is no cache.
    • When the data does not exist in the database, each query must access the database, which is cache penetration.
    • The problem caused by cache transparency is that when there are a large number of requests to query data that does not exist in the database, it will put pressure on the database and even drag down the database.
  • Bloom filters can be used to solve the problem of cache penetration
    • Store the key of the existing data in the Bloom filter, which is equivalent to blocking a Bloom filter in front of redis.
    • When there is a new request, first check whether it exists in the Bloom filter:
    • If the piece of data does not exist in the Bloom filter, return directly;
    • If the Bloom filter already exists, it will query the cache redis, if it is not found in redis, it will penetrate to the Mysql database

Blacklist verification

  • If it is found that it is in the blacklist, it will perform a specific operation. For example: to identify spam, as long as the mailbox is in the blacklist, it will be identified as spam.
  • Assuming that the number of blacklists is in the hundreds of millions, storing them would consume a lot of storage space, and Bloom filters are a better solution. Put all the blacklists in the Bloom filter, and when receiving the email, just judge whether the email address is in the Bloom filter.

Bloom filter principle

Traditional hash in Java

  • The concept of a hash function is: a function that converts input data of any size into output data of a specific size. The converted data is called a hash value or hash code, also called a hash value
  • insert image description here
  • If the two hash values ​​are not identical (according to the same function) then the original input of the two hash values ​​is also not identical.
    This property is the deterministic result of the hash function, and the hash function with this property is called a one-way hash function.
  • The input and output of the hash function are not uniquely corresponding. If two hash values ​​are the same, the two input values ​​are likely to be the same, but they may also be different. This situation is called "hash collision (collision)".
  • When using a hash table to store a large amount of data, the space efficiency is still very low, and when there is only one hash function, hash collisions are easy to occur.

first acquaintance

Bloom filter implementation principle and data structure

  • Bloom Filter is an advanced data structure specially designed to solve the deduplication problem.
  • The essence is a large bit array and several different unbiased hash functions (unbiased means uniform distribution). It consists of a bit array whose initial value is zero and multiple hash functions, which are used to quickly determine whether a certain data exists. But like HyperLogLog, it is also a little bit inaccurate, and there is a certain probability of misjudgment
  • When adding a key, use multiple hash functions to perform a hash operation on the key to obtain an integer index value, and perform a modulo operation on the length of the bit array to obtain a position. Each hash function will obtain a different position, and set these positions to 1 to complete the add operation.
  • When querying the key, as long as one of the bits is zero, it means that the key does not exist, but if they are all 1, the corresponding key does not necessarily exist.
  • in conclusion:
    • Yes, there may be
    • no, sure no

further

  • When a variable is added to the set, the variable is mapped to N points in the bitmap through N mapping functions, and they are set to 1 (assuming that two variables pass through 3 mapping functions).
  • insert image description here
  • When querying a variable, we only need to check whether these points are all 1, and we can know whether it exists in the set with a high probability
    • If any of these points is zero, the queried variable must not be in
    • If both are 1, the queried variable is likely to exist
    • Why do we say that it may exist, but not necessarily exist? That's because the mapping function itself is a hash function, and the hash function will have collisions.
    • insert image description here

Bloom filter false positive rate, why not delete

  • The misjudgment of the Bloom filter means that after multiple inputs are hashed, they are set to 1 in the same bit position, so that it is impossible to determine which input is generated. Therefore, the root cause of the misjudgment is that the same bit is mapped multiple times and set to 1.
  • This situation also causes the deletion problem of the Bloom filter, because each bit of the Bloom filter is not exclusive, and it is very likely that multiple elements share a certain bit.
  • If we delete this bit directly, it will affect other element characteristics
  • If the judgment result of an element is no, it must not exist. If the judgment result is existence, the element does not necessarily exist.
  • Bloom filters can add elements, but cannot remove elements. Because deleting elements will increase the false positive rate.

Summarize

  • When using it, it is best not to let the actual number of elements be much greater than the initialization number
  • When the actual number of elements exceeds the initialization number, the Bloom filter should be rebuilt, a filter with a larger size should be reassigned, and all historical elements should be added in batches

Advantages and disadvantages of Bloom filter

  • advantage
    • Efficient insertion and query, small footprint
  • shortcoming
    • Cannot delete element. Because deleting an element will lead to an increase in the false positive rate, and because of hash conflicts, the things that may be stored in the same location are shared by multiple people. When you delete an element, you may also delete the other ones.
    • There was a misjudgment. Different data may come out with the same hash value

Guess you like

Origin blog.csdn.net/m0_56709616/article/details/130965518