Data structure: Bloom filter

table of Contents

1. Background

2. Query whether a string exists from massive data

2.1, use set and map

2.2、unordered_map

2.3. Summary

3. Bloom filter

3.1 Principle

3.2, false positive rate


​​​​​​​

1. Background

  • When using word documents, how does word judge whether a word is spelled correctly?

  • The network crawling program, how can I prevent it from crawling the same url page? Allowable error

  • How to design spam (SMS) filtering algorithm? Allowable error

  • When handling a case, how does the police judge whether a suspect is on the escape list? Control error false positive rate

  • How to solve the cache penetration problem? Allowable error

1. Reading steps:

    1) Visit redis first, if it exists, return directly; if it does not exist, go 2;

    2) Visit mysql, if it does not exist, return directly; if it exists, go to 3;

    3) Write the key that exists in mysql back to redis; 

2. Cache penetration:

Hackers frequently request data that neither redis nor mysql has, causing excessive pressure on mysql, and the entire system will be paralyzed.

3. Solution:

    1) Set the <key,null> key-value pair on the redis side to avoid frequent access to mysql; the disadvantage is that if there are too many <key,null>, it will take up too much memory; ( the <key,null> key-value pair on the redis side can Set the expiration time, for example, set the expiration time to 500ms. Even if you request frequently, you will only interact with redis frequently, and you can only request mysql once in 500ms )

    2) Store a bloom filter on the server side, and put the key contained in mysql into the bloom filter; the bloom filter can filter data that must not exist;

2. Query whether a string exists from massive data

2.1, use set and map

The set and map structures in the C++ Standard Library (STL) are implemented using red-black trees, and the time complexity of adding, deleting, modifying, and checking is O( log_{2}n)

  • For strictly balanced binary search tree (AVL), a red-black tree composed of 100w pieces of data. Only need to compare 20 times at most to find the value; for 1 billion pieces of data, only need to compare 30 times at most to find the data; that is, the number of searches is the same as the height of the tree;
  • For red-black trees, the balance is the height of the black nodes, so the height difference of the tree needs to be considered for the number of studies. The best case is that a tree link is all black nodes. Assume that the height is h1 at this time, and the worst case is a chain The road is all black and red node intervals, at this time the tree height is 2*h1;
  • Each node in the red-black tree stores the key and val fields, and the key is the field used for comparison; the red-black tree does not require the key field to be unique, which limits the uniqueness of the key field in the implementation of set and map. Let's look at the red-black tree implementation of nginx:

Nginx source code address: http://hg.nginx.org/nginx/file

The implementation position of ngx_rbtree_insert_value in nginx

//insert操作中的一部分,执行完这个函数还需要检测插入节点后是否平衡(主要看他的父节点是否也是红色节点)
//temp:红黑树的根节点;node:待插入的节点  
void
ngx_rbtree_insert_value(ngx_rbtree_node_t *temp, ngx_rbtree_node_t *node,
    ngx_rbtree_node_t *sentinel)
{
    ngx_rbtree_node_t  **p;

    for ( ;; ) {

        p = (node->key < temp->key) ? &temp->left : &temp->right;

        if (*p == sentinel) {
            break;
        }

        temp = *p;
    }

    *p = node;
    node->parent = temp;
    node->left = sentinel;
    node->right = sentinel;
    ngx_rbt_red(node);
}

2.2、unordered_map

unordered_map is implemented using Hashtable .

Advantages: faster access speed; no need for string comparison;

Disadvantages: Need to introduce strategies to avoid conflicts, storage efficiency is not high; space is exchanged for time.

2.3. Summary

Neither red tree nor hashtable can solve the problem of massive data. They both need to store specific strings. If the amount of data is large, they cannot provide hundreds of gigabytes of memory; therefore, it is necessary to try to explore solutions that do not store keys and have the advantages of hashtable ( No need to compare strings);

3. Bloom filter

3.1 Principle

Bloom filter is a kind of data structure, a more clever probabilistic data structure, which is characterized by efficient insertion or query, and can be used to tell you " something must not exist or may exist ".

Compared with the traditional List, Set, Map and other data structures, it is more efficient and takes up less space, but the disadvantage is that it returns not a certain value, but a probabilistic result.

Let's first visually describe what is the Bloom filter data structure

The principle of the Bloom filter is that when an element is added to the set, the element is mapped to K points in a bit array through the key Hash function, and they are set to 1. When retrieving, we only need to see if these points are all 1 to (approximately) know whether there is it in the set:

  • If there is 0 in these points, the detected element must not exist;
  • If both are 1, then the detected element may exist (the probability that we expect to exist can be set).

Suppose there are three Hash functions to design this Bloom filter

Hash1(“bytedance”) = 0; Hash2(“bytedance”) = 3; Hash3(“bytedance”) = 8;

Hash1(“tencent”) = 1; Hash2(“tencent”) = 3; Hash3(“tencent”) = 5;

When searching for " bytedance " , the three points are calculated by hashing to be 0 , 3 , and 8 , and then the positions of 0 , 3 , and 8 are all 1, and the Bloom filter tells us that " bytedance " may exist.

And if we want to retrieve the value " huawei " at this time, the calculated hash value is 3 , 5 , 8 . At this time, the Bloom filter tells us that " huawei" may exist, but in fact " huawei " does not exist at all (false positive rate, false positives), and this is the disadvantage of the Bloom filter.

Bloom filter does not support deletion, because each slot in the bitmap has only two states (0 or 1). If a slot is set to 1 state, it is not clear how many times it has been set; that is, I don't know how many hashes are mapped and which hash function is mapped; therefore, the delete operation is not supported;

3.2, false positive rate

Bloom filter can definitely not exist, but cannot definitely exist, then the judgment of existence is wrong, and the false positive rate is the probability of the existence of wrong judgment.

n -- 布隆过滤器中元素的个数,如上图 只有"tencent"和"bytedance"两个元素 那么 n=2
p -- 假阳率,在0-1之间 0.000000
m -- 位图所占空间
k -- hash函数的个数
公式如下:
n = ceil(m / (-k / log(1 - exp(log(p) / k))))
p = pow(1 - exp(-k / (m / n)), k)
m = ceil((n * log(p)) / log(1 / pow(2, log(2))));
k = round((m / n) * log(2));

/*
round:四舍五入

ceil:向上取整

exp:以e为底的指数

pow:次方
*/

In actual applications, we determine n and p, and calculate m and k through the calculations above; you can also select appropriate values ​​on the following website  https://hur.st/bloomfilter/

The false positive rate p increases abruptly after the number of elements n increases to 823 

 

The false positive rate p is inversely proportional to the size of the bitmap space m. The larger the space, the lower the probability of conflict

 

It can be seen from this picture that the more the hash function is used, the lower the false positive rate, and there is an optimal solution. After that value, the false positive rate will start to rise again. In fact, it is easy to understand that the more hash functions you have, the more bits a value is stored in and set to 1. When the number of hash functions exceeds a certain critical value, conflicts will increase significantly.

Guess you like

Origin blog.csdn.net/weixin_40179091/article/details/112521477