Bloom filter and its usage

1 Definition

  Bloom Filter (BF) is a binary vector data structure with high time and space efficiency proposed by Howrad Bloom in 1970. It is used to detect whether an element belongs to this set. Note that the Bloom filter only determines whether the element appears in the set, and cannot give the specific position of the element in the set.

1.1 Constructing a Bloom filter

  Regarding how to construct a Bloom filter, the following is a set S = { S 1 , S 2 , S 3 } S=\{S_{1},S_{2} ,S_{3}\} S={ S1,S2,S3} is explained as an example. First of all, let me explain that although Bloom filters also need to use hash query algorithms, unlike traditional hash query algorithms, Bloom filters do not store specific elements in the collection, but use multiple hash functions to The elements in the collection are mapped to binary bit strings. Once a collection element is mapped to the corresponding binary bit, change the corresponding position to 1.
Insert image description here
  In the above Bloom filter, each position occupies 1 bit. If each position is changed to multiple bits, the information can be expressed. For example, the counting Bloom filter shown in the figure below.
Insert image description here
Note: Bloom filters generally do not delete elements.

1.2 Bloom filter query process

  During the query process, relevant elements need to be calculated S i S_{i} SiMultiple hash addresses corresponding to , and then check whether the corresponding positions in the Bloom filter are all 1. If they are all 1, then S i S_{i} Simay on the Bloom filter (called a false positive problem); if not all 1, then the elements S i S_{ i} Simust not be on Bloom filter.

1.3 Advantages

  The main advantages of Bloom filter are as follows:

  • Low time complexity. Whether it is an insertion or a query operation, the time complexity of the Bloom filter is O ( k ) O(k) O(k), among them, k k k is the number of hash functions.
  • Confidentiality is good. Bloom filters do not store the specific values ​​of elements.
  • Takes up little space.

  The main disadvantage of the Bloom filter is: there is a false positive rate. When two different values ​​produce the same hash value, the Bloom filter cannot determine whether the query element actually exists in the collection. In order to reduce the false positive rate as much as possible, multiple hash functions are generally set up in Bloom filters.

2 Using bloom filter in python

2.1 Installation package

  There are many third-party packages that implement bloom filters in python. The one used here is pybloom-live (pybloomfiltermmap and other packages encounter errors when installing them on Windows and cannot solve them).

pip install pybloom_live
2.2 Simple use case
from pybloom_live import BloomFilter

bf=BloomFilter(capacity=100000,error_rate=0.001)
#构建布隆过滤器
for i in range(100000):
    bf.add(i)

#查询元素
for i in [1,200,345,100323,3233232,'hello']:
    print(i in bf)

The result is as follows:

True
True
True
False
False
False

Replenish

  • What is cache penetration?
    Cache penetration means that in a high-concurrency scenario, a certain Key in the cache (including local cache and Redis cache) is not hit by high-concurrency access. At this time, the data will be accessed back to the database. , causing the database to execute a large number of query operations concurrently, causing huge pressure on it.
    Using Bloom filters can solve the problem of cache penetration. For example, in the redis database, all keys are first stored in the Bloom filter. When a request comes in, the key is first checked in the filter to see if it exists. If it does not exist, null is returned directly.

Guess you like

Origin blog.csdn.net/yeshang_lady/article/details/133382518