Bloom filter technology principles and practical applications

Two weeks ago I shared with you the topic about the geohash algorithm - GeoHash technical principles and practical applications. This week I will continue to bring you another data structure tool that is very clever in design ideas - the bloom filter. .

1 Deduction of Bloom filter ideas

1.1 Tribute to Zuo Shen

Let’s start with a practical scenario problem.

In fact, this scenario problem, including the concept of Bloom filter, was first exposed to me when I was watching the Zuo Shen algorithm class at Xiaopo Station during my self-study programming stage. Here I specifically quote this example to illustrate it, which can be regarded as an introduction to Zuo Shen. A small tribute, thanks to everyone who has helped me along the way^_^

1.2 Scenario issues

The problem with this scenario is this:

Now we are involved in the process of assembling a crawler program. We already have a list that stores a large number of URLs as an input source. We need to let the crawler crawl through each URL and expand it in a network based on the content of the crawled web page. Every url link encountered.

In this process, due to the network nature of the network, the same URL may be obtained repeatedly, so we need to establish a reasonable deduplication mechanism to prevent the crawler traversal program from falling into a repeated endless loop.

Now assume that we have a total of 1 billion URLs and require the above process to be implemented through the memory of a single machine. How to design and implement this process?

1.3 Rough solution

The simplest and most intuitive way to solve this problem is to maintain a collection that stores traversed URLs. All traversed URLs are first judged whether they are in the collection. If so, they are ignored directly; if not, they are added to the collection. and process it.

This can achieve the goal, but the cost loss issue needs to be considered. There are now a total of 1 billion URLs. Assuming that the size of each URL is 16 bytes, the total memory space required is 1 billion * 16byte = 16GB. This is for a single machine. It is a too heavy number.

1.4 bitmap solution

We noticed that in this scenario, we only need to identify the information about whether the corresponding url exists in the collection, without recording the actual content of the url. Based on this point, let's imagine if we can use a bit to determine the existence status of the url. To identify, if it is 0, it means that the url does not exist, if it is 1, it means that the url exists, which can also meet our usage requirements. In this way,  we reduce the work required to store a url from 16 bytes to 1 bit, so The space consumed is only 1/128 of the original plan.

The idea of ​​using bitmap to implement is certainly very good, but how do we establish a mapping relationship between a url and a bit? Here, the solution that naturally comes to mind is hashing:

  • First, clarify the total length of the bitmap, assuming it is m
  • Next, get the hash value for each url.
  • The hash value of each URL is modulo m, and its corresponding bit index is obtained.

Through the above process, we established the mapping relationship from url -> bit, and through the discrete nature of the hash function, we ensure that the bits corresponding to each url are dispersed as evenly as possible in various positions of the bitmap.

However, when it comes to hash functions, we cannot avoid one of its problems - hash collisions. Since the input domain of the hash function is infinite and the corresponding output domain is limited, it is inevitable to have multiple The problem of different inputs producing the same output is a phenomenon called hash collision.

For more information about hashing, please see my previous article - Golang map implementation principle.

What's more, after we get the hash value for the URL, we also need to take the modulo based on the inherent bitmap length m, so that the probability of multiple different URLs mapping to the same bit is higher.

In the end, we were still unable to establish a strict one-to-one mapping relationship between a URL and a bit, which caused the bit to lose its due accuracy in identifying whether the URL exists.

1.5 Bloom filter

On the basis of the bitmap solution, we adjust the way we face the problem.

Due to the existence of hash collisions, we know that there will be "misjudgment" problems in information based on bitmap identification URLs. So in what scenarios will this "misjudgment" problem occur, and what consequences will it have?

  • If a url exists, will bitmap misjudge it as not existing?

The answer is no.  Since the hash function has the property of one-way stable mapping, the same URL will generate the same hash value no matter how many times it is entered, and will eventually be mapped to the same bit. If it has been entered before, the corresponding The bit bit must be set to 1, and it will be determined to exist subsequently.

  • If a url does not exist, will bitmap misjudge it as existing?

The answer is maybe.  Due to the existence of hash collision problem, multiple URLs may be mapped to the same bit. Assume that url1 and url2 are both mapped to bit1, and bit1 will be set to 1 after url1 is input, so even if url2 has not been input However, when it is first input, the corresponding bit1 position is also 1, so it will be misjudged as existing.

Regarding the phenomenon that URLs that do not exist are misjudged as existing, first of all, we need to make it clear that this is a problem that only occurs with a small number of URLs that have hash collisions. Secondly, since our crawler program is mainly used in big data scenarios, we need to pay more attention to It is a macroscopic mathematical model and data magnitude, so it is acceptable to accept the problem of missing a small amount of data.

Therefore, the implementation solution based on bitmap is already within the scope that we can accept. What we need to consider next is how to reduce the probability of this misjudgment problem as much as possible through reasonable process design.

Here, the method we adopt is to increase the number of hash functions. For example, if we increase the number of hash functions from 1 to k, then the number of bits corresponding to a url is k, so misjudgment occurs. The premise is that these k bits are all set to 1 due to hash collision. Compared with 1 bit, the probability of misjudgment is greatly reduced, and the accurate probability of misjudgment can be obtained based on mathematical derivation.

At this point, the implementation idea of ​​the Bloom filter has been shown to everyone. Let’s go back and define it:

Bloom filtering consists of a bitmap and a series of random mapping functions. It does not store the detailed content of the data, but only identifies information about whether the data exists. Its biggest advantage is that it has very good space utilization and query efficiency.

1.6 Advantages and Disadvantages of Bloom Filters

Below we summarize the advantages and disadvantages of Bloom filters:

advantage:

  • Saving space: One bit identifies the existence information of a piece of data, and after k hash functions are used for mapping, the length m of the bitmap can be further reduced.
  • Efficient query: use k hash functions for mapping. Since k is a constant, the actual time complexity is O(1)

shortcoming:

  • There is a problem of false positives and misjudgments:

Data that does not exist may be misjudged as existing, but data that already exists cannot be misjudged.

  • There are problems with data deletion:

Due to the hash collision problem, one bit may be used by multiple input data, so it cannot be deleted. In the end, the longer the bitmap is used, the more bits are set to 1, and the higher the probability of misjudgment.

In an extreme scenario, if all bits are set to 1, the probability of misjudgment for all non-existent data is 100%.

In response to the problem of difficulty in deleting Bloom filter data, two solutions are proposed below:

Option 1: Data Archiving

This solution is suitable for detailed records where we still have the full amount of data in the database. The Bloom filter is only used as a caching layer to protect the relational database. At this time, we can regularly perform some operations on some of the old data in the database. Archive, then periodically build a new bitmap using new data within the specified time range, and overwrite the old bitmap to extend the life of the Bloom filter.

Option 2: Cuckoo filter

The cuckoo filter is another type of alternative algorithm tool that can support data deletion operations in maps to a certain extent. This is a very informative topic, and we will write a separate article to describe it later.

2 Derivation of Bloom filter misjudgment rate

This part of the mathematical deduction process is the idea I formed after learning from the technical sharing of Teacher Shi in the group during my current work on Didi Chuxing’s marketing technology. I need to pay special tribute to Teacher Shi here.

2.1 Deduction of false positive rate

First, we set the three basic parameters of the Bloom filter:

  • The length of bitmap is set to  m ;
  • The number of hash functions is set to  k ;
  • The number of input elements in the bitmap is  n ; (note that it is the input elements rather than the bits set to 1)

Now we start the probability deduction:

  • When an element is input and mapped once through the hash function, the probability that a bit will be set to 1 due to this operation is 1/m;
  • On the contrary, the probability that this bit will not be set to 1 due to this operation is 1-1/m;
  • It is further obtained that the probability that this bit is still not set to 1 after k hash mappings is (1-1/m)^k;
  • It is further obtained that the probability that this bit is still not set to 1 after n elements are input is (1-1/m)^(k·n);
  • On the contrary, after inputting n elements, the probability of 1 bit being set to 1 is 1-(1-1/m)^(k·n);

With the above conclusion, we know that every time we input an element, the premise for a misjudgment to occur is that after hash mapping, the corresponding k bits have been set to 1 just before, so we can get the misjudgment The probability of occurrence is -

[1-(1-1/m)^(k·n)]^k

Below we simplify this misjudgment probability expression based on the infinitesimal equivalent rule in advanced mathematics.

In advanced mathematics, we know that when x->0, there is (1+x)^(1/x)~e, where e is a natural constant with a value of approximately 2.7182818.

So we have, when m->∞, 1/m -> 0, so we have (1-1/m)^(-m)~e.

So there are (1-1/m)^(k·n)=(1-1/m)^[(-m)·(-k·n/m)]~e^(-k·n/m)

Finally, we get that when m->∞, the misjudgment probability can be simplified as - [1-e^(-k·n/m)]^k.

2.2 Parameter tuning ideas

Through Section 2.1, we know that the probability of a false positive in a Bloom filter is simultaneously related to the length m of the bimap, the number of hash functions k, and the number n of input elements in the bitmap.

Our next question is, how can we reduce the probability of false positives in the Bloom filter through reasonable parameter selection?

When facing this problem, the perspective we adopt is, on the premise that m and n are known, how to use the value of k to minimize the probability of misjudgment, so m and n are constants for us, k is the variable to be obtained.

In order to further simplify the expression of misjudgment probability, we record the constant expression e^(n/m) as a constant t, so the expression of misjudgment probability is - f(k)=[1-t^(-k)] ^k

We differentiate f(k), and by finding the minimum value of f(k) (f'(k)=0, f''(k)>0), we finally get when k·n/m=ln2 When , the misjudgment probability f(k) reaches the minimum value.

Therefore, when we design the parameters of Bloom filtering, we should follow the following ideas:

  • First, initially set the bitmap length m to a large enough value.
  • Secondly, we estimate the number of elements n that may be stored in this Bloom filter
  • Next, we calculate the appropriate number of hash functions based on k·n/m=ln2
  • Finally, we calculate the probability of possible misjudgment through the misjudgment probability expression [1-e^(-k·n/m)]^k to see if it can meet the requirements

For parameter selection of Bloom filters, here is a ready-made parameter tuning simulator for use:

https://hur.st/bloomfilter/?n=9000000&p=&m=65000000&k=6

2.3 Hash algorithm selection

When selecting hash functions in Bloom filters, we mainly prioritize computing performance. In comparison, hash functions do not need to have encryption properties (it is not required that two different input sources must produce different structures. ).

Based on this, we do not consider using encrypted hash algorithms similar to sha1 and md5, but choose among non-encrypted hash algorithms. Among them, the murmur3 algorithm has relatively good performance in terms of performance. It will be published in Chapter 4 5. In the Long filter implementation code, we choose to use murmur3 as the hash function.

murmur3 github open source address: https://github.com/spaolacci/murmur3

3 Local Bloom filter implementation

The following shows the specific code to implement Bloom filter based on local bitmap:

3.1 Hash encoding

The following is the hash encoding module implemented through murmur3, which converts the input string into a hash value of type int32:

import (
    "math"


    "github.com/spaolacci/murmur3"
)


type Encryptor struct {
}


func NewEncryptor() *Encryptor {
    return &Encryptor{}
}


func (e *Encryptor) Encrypt(origin string) int32 {
    hasher := murmur3.New32()
    _, _ = hasher.Write([]byte(origin))
    return int32(hasher.Sum32() % math.MaxInt32)
}

3.2 Bloom filter service

The following is a bloom filter service built based on local bitmap:

  • m: The length of bimap, input by the user
  • k: The number of hash functions, input by the user
  • n: The number of elements in the Bloom filter, counted by the Bloom filter
  • bitmap: Bitmap, type []int, which uses 32 bits of each int element, so the length of []int is m/32. In order to avoid the problem of endless division during construction, the slice length is additionally increased 1
  • encryptor: hash function encoding module
import (
    "context"


    "github.com/demdxx/gocast"
)


type LocalBloomService struct {
    m, k, n   int32
    bitmap    []int
    encryptor *Encryptor
}


func NewLocalBloomService(m, k int32, encryptor *Encryptor) *LocalBloomService {
    return &LocalBloomService{
        m:         m,
        k:         k,
        bitmap:    make([]int, m/32+1),
        encryptor: encryptor,
    }
}

3.3 Query process

The following is the query process to determine whether an element val exists in the Bloom filter:

  • First, based on the LocalBloomService.getKEncrypted method, the offset offset of k bits corresponding to val is obtained.
  • Since each int element in []int uses 32 bits, for each offset, the corresponding index position in []int is offset >> 5, that is, offset/32
  • The position of offset in an int element corresponds to offset & 31, that is, offset % 32
  • If any bit flag is 0, it means that the element val must not exist in the Bloom filter.
  • If all bit flags are 1, it means that the element val is likely to exist in the Bloom filter.
func (l *LocalBloomService) Exist(val string) bool {
    for _, offset := range l.getKEncrypted(val) {
        index := offset >> 5     // 等价于 / 32
        bitOffset := offset & 31 // 等价于 % 32


        if l.bitmap[index]&(1<<bitOffset) == 0 {
            return false
        }
    }


    return true
}

The implementation of obtaining the k bit offset offset corresponding to an element val is as follows:

  • When mapping for the first time, the element val is used as input to obtain the hash value mapped by murmur3.
  • Next, each time the previous round of hash values ​​are used as input, the murmur3 mapping is obtained to obtain a new round of hash values.
  • Return the result after gathering k hash values
func (l *LocalBloomService) getKEncrypted(val string) []int32 {
    encrypteds := make([]int32, 0, l.k)
    origin := val
    for i := 0; int32(i) < l.k; i++ {
        encrypted := l.encryptor.Encrypt(origin)
        encrypteds = append(encrypteds, encrypted%l.m)
        if int32(i) == l.k-1 {
            break
        }
        origin = gocast.ToString(encrypted)
    }
    return encrypteds
}

3.4 Add process

The following is the process of appending elements into the Bloom filter:

  • Each time a new element arrives, n in the Bloom filter is incremented
  • Call the LocalBloomService.getKEncrypted method to obtain the k-bit offset offset corresponding to the element val.
  • Obtain the index of the bit in []int through offset >> 5. The idea is the same as in section 3.3.
  • Obtain the bit position in int through offset & 31. The idea is the same as in section 3.3.
  • Through the | operation, set the corresponding bit position to 1
  • Repeat the above process and set all k bits to 1
func (l *LocalBloomService) Set(val string) {
    l.n++
    for _, offset := range l.getKEncrypted(val) {
        index := offset >> 5     // 等价于 / 32
        bitOffset := offset & 31 // 等价于 % 32


        l.bitmap[index] |= (1 << bitOffset)
    }
}

4 Implement Bloom filter based on redis

The following shows the specific code for implementing Bloom filter based on redis bitmap:

4.1 Hash encoding

The murmur3 hash coding module is the same as the implementation in the local Bloom filter module in Section 3.1 of this article, and will not be described again:

import (
    "math"


    "github.com/spaolacci/murmur3"
)


type Encryptor struct {
}


func NewEncryptor() *Encryptor {
    return &Encryptor{}
}


func (e *Encryptor) Encrypt(origin string) int32 {
    hasher := murmur3.New32()
    _, _ = hasher.Write([]byte(origin))
    return int32(hasher.Sum32() % math.MaxInt32)
}

4.2 redis client

redigo github open source address: https://github.com/gomodule/redigo

The redis client implementation code based on redigo is as follows:

  • Based on the redis connection pool, connections are reused. Each operation needs to obtain the connection from the connection pool first. After use, the connection needs to be manually put back into the pool.
  • The redis client encapsulates an Eval interface, which is used to execute lua scripts and complete the atomic assembly of compound instructions.
import (
    "context"
    "fmt"


    "github.com/demdxx/gocast"
    "github.com/gomodule/redigo/redis"
)


type RedisClient struct {
    pool *redis.Pool
}


func NewRedisClient(pool *redis.Pool) *RedisClient {
    return &RedisClient{
        pool: pool,
    }
}


// 执行 lua 脚本,保证复合操作的原子性
func (r *RedisClient) Eval(ctx context.Context, src string, keyCount int, keysAndArgs []interface{}) (interface{}, error) {
    args := make([]interface{}, 2+len(keysAndArgs))
    args[0] = src
    args[1] = keyCount
    copy(args[2:], keysAndArgs)


    // 获取连接
    conn, err := r.pool.GetContext(ctx)
    if err != nil {
        return -1, err
    }


    // 放回连接池
    defer conn.Close()


    // 执行 lua 脚本
    return conn.Do("EVAL", args...)
}

4.3 Bloom filter service

Define Bloom filter service module:

  • m: bitmap length, input by the user
  • k: number of hash functions, input by the user
  • client: client connecting to redis
// 布隆过滤器服务
type BloomService struct {
    m, k      int32
    encryptor *Encryptor
    client    *RedisClient
}


// m -> bitmap 的长度; k -> hash 函数的个数;
// client -> redis 客户端;encryptor -> hash 映射器
func NewBloomService(m, k int32, client *RedisClient, encrytor *Encryptor) *BloomService {
    return &BloomService{
        m: m,
        k: k,
        client:    client,
        encryptor: encrytor,
    }
}

4.4 Query process

Query whether the input content exists in the Bloom filter:

  • key corresponds to the identification key key of the bitmap in the Bloom filter. Elements corresponding to different keys are isolated from each other.
  • Val corresponds to the input element and belongs to the bitmap corresponding to a certain key.
  • Call the BloomService.getKEncrypted method to obtain the offset offset corresponding to k bits.
  • Call the RedisClient.Eval method to execute the Lua script. If one of the k bits is not 1, it will return false if it does not exist, otherwise it will return true if it exists.
// key -> 布隆过滤器 bitmap 对应的 key   val -> 基于 hash 映射到 bitmap 中的值
func (b *BloomService) Exist(ctx context.Context, key, val string) (bool, error) {
    // 映射对应的 bit 位
    keyAndArgs := make([]interface{}, 0, b.k+2)
    keyAndArgs = append(keyAndArgs, key, b.k)
    for _, encrypted := range b.getKEncrypted(val) {
        keyAndArgs = append(keyAndArgs, encrypted)
    }


    rawResp, err := b.client.Eval(ctx, LuaBloomBatchGetBits, 1, keyAndArgs)
    if err != nil {
        return false, err
    }


    resp := gocast.ToInt(rawResp)
    if resp == 1{
        return true,nil
    }
    return false, nil
}

The execution method of mapping the input element to k bit offset offset is getKEncrypted. The logic is the same as that in Section 3.3 and will not be described again.

func (b *BloomService) getKEncrypted(val string) []int32 {
    encrypteds := make([]int32, 0, b.k)
    origin := val
    for i := 0; int32(i) < b.k; i++ {
        encrypted := b.encryptor.Encrypt(origin)
        encrypteds = append(encrypteds, encrypted)
        if int32(i) == b.k-1 {
            break
        }
        origin = gocast.ToString(encrypted)
    }
    return encrypteds
}

The following is a Lua script that performs bitmap query operations in batches: k bits will be queried. As long as one bit is marked as 0, 0 will be returned; if all bits are marked as 1, 1 will be returned.

const LuaBloomBatchGetBits = `
  local bloomKey = KEYS[1]
  local bitsCnt = ARGV[1]
  for i=1,bitsCnt,1 do
    local offset = ARGV[1+i]
    local reply = redis.call('getbit',bloomKey,offset)
    if (not reply) then
        error('FAIL')
        return 0
    end
    if (reply == 0) then
        return 0
    end
  end
  return 1
`

4.5 Add process

The process of adding an input element to a Bloom filter is as follows:

  • key corresponds to the identification key key of the bitmap in the Bloom filter. Elements corresponding to different keys are isolated from each other.
  • Val corresponds to the input element and belongs to the bitmap corresponding to a certain key.
  • Call the BloomService.getKEncrypted method to obtain the offset offset corresponding to k bits.
  • Call the RedisClient.Eval method to execute the lua script and set all k bits to 1
func (b *BloomService) Set(ctx context.Context, key, val string) error {
    // 映射对应的 bit 位
    keyAndArgs := make([]interface{}, 0, b.k+2)
    keyAndArgs = append(keyAndArgs, key, b.k)
    for _, encrypted := range b.getKEncrypted(val) {
        keyAndArgs = append(keyAndArgs, encrypted)
    }


    rawResp, err := b.client.Eval(ctx, LuaBloomBatchSetBits, 1, keyAndArgs)
    if err != nil {
        return err
    }
    
    resp := gocast.ToInt(rawResp)
    if resp != 1 {
        return fmt.Errorf("resp: %d", resp)
    }
    return nil
}

Also based on Lua script, the atomic assembly of compound instructions is implemented, and k bits are set to 1 at the same time.

const LuaBloomBatchSetBits = `
  local bloomKey = KEYS[1]
  local bitsCnt = ARGV[1]


  for i=1,bitsCnt,1 do
    local offset = ARGV[1+i]
    redis.call('setbit',bloomKey,offset,1)
  end
  return 1
`

5 Project case introduction

In the personal project I implemented before, the distributed timer xtimer, Bloom filters were used as an auxiliary tool for task idempotence verification.

For a detailed introduction to this project, see the article - Implementing distributed timer XTimer based on coroutine pool architecture

The open source address of xtimer is as follows: https://github.com/xiaoxuxiansheng/xtimer

The xtimer architecture diagram is as follows:

In xtimer, the actual execution of scheduled tasks focuses on the executor module, which is started asynchronously by the upstream trigger module. It can only be ensured that the scheduled tasks meet the at least once requirement through a slice expiration time extension operation similar to ack. semantics, but cannot achieve the semantics of exactly once.

Therefore, before the executor module actually executes the task, it needs to query the scheduled task execution status in the database and complete the idempotence check. In this process, I use bloomFilter to clearly identify which part of the task has not been executed. At this time, you can save one database check operation and directly enter the subsequent execution process; for tasks marked as executed by bloomFilter, you need to check the database a second time to complete the verification.

The entire execution flow chart is as follows:

6 Summary

In this issue, I share with you a data structure with a very clever design idea - Bloom filter.

Bloom filtering consists of a bitmap and a series of random mapping functions. It does not store the detailed content of the data, but only identifies information about whether a piece of data exists. Its biggest advantage is that it has very good space utilization and query efficiency. Its existence The disadvantages are that it is difficult to delete data and there is a certain probability of false positives and misjudgments.

Small advertisement at the end of the article:

Bosses are welcome to follow my personal public account: Mr. Xiao Xu’s Programming World

Guess you like

Origin blog.csdn.net/iamonlyme/article/details/133278859