Principle and Practice of Redis Bloom Filter

background

In the case of high concurrent requests, business data generally caches the data to increase the concurrency of the system, because disk IO and network IO have hundreds or thousands of times the performance disadvantage of memory IO. To make a simple calculation, if we need some data, it takes 0.1s to read the data from the database disk, and 0.05s to transfer it from the switch, then each request is completed at least 0.15s (of course, in fact, there is no disk and network IO So slow, here is just an example), the database server can only respond to 67 requests per second; and if the data exists in the local memory and only needs 10us to read out, it can respond to 100,000 requests per second. By storing high-frequency used data closer to the cpu to reduce data transmission time, thereby improving processing efficiency, this is the meaning of caching.

cache

However, as mentioned above, the cache will have problems such as breakdown and penetration. For this, we can introduce Bloom Filter to prevent excessive database pressure from causing system abnormalities.

Bloom Filter

If you want to judge whether an element is in a set, the general idea is to save all the elements, and then determine by comparison. Linked lists, trees and other data structures are all this way of thinking. But with the increase of elements in the collection, we need more and more storage space, and the retrieval speed is slower and slower (O(n), O(logn)) .

But there is also a data structure called a hash table (also called a hash table, Hash table), which can map an element to a point in a bit array through a Hash function. In this way, we only need to look at this You can know if it is in the set by clicking whether it is 1. This is the basic idea of ​​Bloom filter.

Bloom Filter (Bloom Filter) was proposed by Bloom in 1970. It is actually a very long binary vector and a series of random mapping functions. In fact, you can simply understand it as an inaccurate set structure. When you use its contains method to determine whether an object exists, It may misjudge. But Bloom filter is not particularly inaccurate, as long as the parameter setting is reasonable, its accuracy can be controlled relatively accurately enough, and there will only be a small probability of misjudgment.

When the Bloom filter says that a certain value exists, the value may not exist; when it says it does not exist, then it must not exist. For example, when it says that it doesn’t know you, it’s really not knowing it, but when it says it knows you, it may be because you look like another friend it knows (faces are somewhat similar), so Misjudged to know you.

principle

When an element is added to the Bloom filter, the following operations are performed:

  1. Use the hash function in the Bloom filter to calculate the element value to get the hash value (there are several hash functions to get several hash values).
  2. According to the obtained hash value, the value of the corresponding subscript is set to 1 in the bit array.

When we need to determine whether an element exists in the Bloom filter, we will perform the following operations:

  1. Perform the same hash calculation again on the given element;
  2. After getting the value, judge whether each element in the bit array is 1. If the value is 1, then it means that the value is in the Bloom filter.

If there is a value other than 1, it means that the element is not in the Bloom filter.

bloom_1

As shown above,

  1. There are 8 hash bits, and the default value is 0,
  2. When the pair tencentis hashed, bits 2, 5, and 7 are set to 1,
  3. After continuing to cloudhash, bits 3, 4, and 6 are set to 1.
  4. At this time, we then otherhash other words , assuming that the hash is 1, 2, 3, and the 1 bit is 0, we can judge that the word does not exist in the set.

Summary process bloom_2

How to choose the number of hash functions and the length of the Bloom filter

If the length of the Bloom filter is too small, all bits will be used up soon, and any query will return "may exist"; if the length of the Bloom filter is too large, the probability of misjudgment will be very high. Small, but serious waste of memory space. Similarly, the more the number of hash functions, the faster the bit of the Bloom filter is occupied; the less the number of hash functions, the more the probability of misjudgment will increase. Therefore, the length of the Bloom filter and the number of hash functions need to be weighed according to business scenarios.

Three parameters

  • The number of hash functions k;
  • The capacity m of the Bloom filter bit array;
  • The number of data inserted by the Bloom filter n;

Regarding the setting of the three values, a big man's article 10 years ago has already made a summary. It requires a deep mathematical foundation, so I won't go into details here. The article link ishttp://blog.csdn.net/jiaomeng/article/details/1495500

The main mathematical conclusions are:

  • In order to obtain the best accuracy, when k = ln2 * (m/n), the Bloom filter obtains the best accuracy;

Pros and cons

Advantages: The advantages are obvious. The binary array takes up very little memory, and the insertion and query speed are fast enough.

Disadvantages: With the increase of data, the false positive rate will increase; there is also the inability to judge that the data must exist; there is another important disadvantage, the data cannot be deleted

Scenes

  • Judging whether big data exists: This can achieve the above-mentioned de-duplication function. If your server memory is large enough, then use HashMap to

It can be a good solution. Theoretically, the time complexity can reach the level of O(1), but when the amount of data is increased, only Bloom filters can be considered.

  • Solve the cache penetration (problem mentioned in the background): Using bloom filters we can pre-determine the primary key of the data query, such as user ID or article ID

Cache into the filter. When querying data based on ID, we first judge whether the ID exists, and if so, proceed to the next step. If it does not exist, return directly, so that subsequent database queries will not be triggered. It should be noted that cache penetration cannot be completely resolved, we can only control it within a tolerable range.

  • Filtering of crawlers/mailboxes and other systems: I don’t know if you have noticed that some normal emails will also be put into the spam directory. This is caused by the misjudgment of the Bloom filter.
  • Google Chrome uses Bloom filters to identify malicious URLs.

Basic operation

Bloom filter There are two basic commands, bf.addadd elements, bf.existsthe query element exists, its use and setthe collection saddand sismemberthe same. Note that bf.addonly add one element, if you want to add more, you need to use the bf.maddcommand. Likewise if you need a query whether there are multiple elements, we need to use the bf.mexistscommand.

127.0.0.1:6379> bf.reserve users 0.01 1000
OK
127.0.0.1:6379> bf.add users u1
(integer) 1
127.0.0.1:6379> bf.add users u2
(integer) 1
127.0.0.1:6379> bf.add users u3
(integer) 1
127.0.0.1:6379> bf.exists users u3
(integer) 1
127.0.0.1:6379> bf.madd users user4 user5 user6
1) (integer) 1
2) (integer) 1
3) (integer) 1
127.0.0.1:6379> bf.mexists users user4 user5 user6 user7
1) (integer) 1
2) (integer) 1
3) (integer) 1
4) (integer) 0
复制代码

bf.reserve There are three parameters, namely key, error_rate (error rate) and initial_size:

The lower the error_rate , the larger the space required. For occasions that do not need to be too precise, it does not matter if the setting is larger. For example, the push system mentioned above will only filter out a small part of the content. The overall viewing experience is still Will not be greatly affected;

initial_size represents the expected number of elements. When the actual number exceeds this value, the misjudgment rate will increase, so you need to set a larger value in advance to avoid exceeding the misjudgment rate;

If not applicable bf.reserve, the default error_rateShi 0.01, the default initial_sizeShi 100.


Author: merlinfeng
 

Guess you like

Origin blog.csdn.net/m0_50180963/article/details/112725247