bitmap and bloom filter

Whether there is a value in the massive integer --bitmap

In a program, there is often a case that allows us to judge whether a certain number exists in a set; in most cases, we only need to use a simple data structure such as map or list. If we use a high-level language, we can also multiply by Express calls several packaged APIs, adds a few if elses, and two or three lines of code can see their "perfect" and "robust" code running on the console.

However, nothing is perfect. In a high-concurrency environment, all cases will be extreme. If this is a very large collection (give a specific value to this huge amount, 100 million), a simple hash map, regardless of the linked list The required pointer memory space, 100 million integers of type int, requires more than 380 M (4byte × 10 ^8), and 1 billion is 4 G, regardless of performance, just calculate the memory overhead, even if it is full now The ground are all 128G servers, and I can't eat this pot.

The bitmap uses the number of bits to represent the size of the number, and the 0 or 1 stored in the bit identifies whether the integer exists. The specific model is as follows:

This is a "bitmap" that can identify 0-9, where the four numbers 4321 exist

Calculate the memory overhead of bitmap. If it is a data search within 100 million, we only need 100 million bits = about 12MB of memory space to complete a massive data search. Is it an extremely attractive memory reduction? The following is The bitmap code implemented in Java:

public class MyBitMap {
 
    private byte[] bytes;
    private int initSize;
 
    public MyBitMap(int size) {
        if (size <= 0) {
            return;
        }
        initSize = size / (8) + 1;
        bytes = new byte[initSize];
    }
 
    public void set(int number) {
        //相当于对一个数字进行右移动3位，相当于除以8
        int index = number >> 3;
        //相当于 number % 8 获取到byte[index]的位置
        int position = number & 0x07;
        //进行|或运算  参加运算的两个对象只要有一个为1，其值为1。
        bytes[index] |= 1 << position;
    }
 
 
    public boolean contain(int number) {
        int index = number >> 3;
        int position = number & 0x07;
        return (bytes[index] & (1 << position)) != 0;
    }
 
    public static void main(String[] args) {
        MyBitMap myBitMap = new MyBitMap(32);
        myBitMap.set(30);
        myBitMap.set(13);
        myBitMap.set(24);
        System.out.println(myBitMap.contain(2));
    }
 
}

Using simple byte arrays and bit operations, you can achieve the perfect balance of time and space, isn't it beautiful, wrong! Just imagine, if we make it clear that this is a set of less than 100 million, but the order of magnitude is only 10, we use bitmap, which also requires 12M of data. If it is less than 1 billion data, the overhead will rise to 120M, and the space overhead of bitmap It is always linked to the value range of his data. Only with massive data can he show his skills.

Let's talk about the extreme case just mentioned. Suppose the amount of data is 10 million, but the value range is within 1 billion. Then we inevitably have to face the overhead of 120M. Is there a way to deal with it?

Bloom filter

If faced with the above problems mentioned by the author, let’s combine conventional solutions, such as hashing, I will hash a certain data within 1 billion into a value within 100 million, and then go to the bitmap to check how? As shown below, the Bloom filter does just that:

Use the values obtained by multiple hash algorithms to reduce the probability of hash collisions

As mentioned in the above legend, we can use multiple hash algorithms to reduce the probability of collision, but as long as there is a collision, there must be a wrong judgment. We cannot be 100% sure whether a value really exists, but the charm of the hash algorithm The thing is, I can't be sure if you exist, but I can be sure if you really don't exist, which is why the above implementation is called a "filter".

High Concurrency Cache Design Strategy

why cache??

If the reader is a student majoring in computer science, the word cache should have a frequency that can make the ears cocoon. In the computer system, cache is a peacemaker between CPU and memory, which is used to ease the gap between CPU and memory processing speed; in OS, page cache is a peacemaker between memory and IO. (Search the public account Java bosom friend, reply "2021", and send you a collection of Java interview questions)

cache is a peaceful thing? ? It sounds weird, but it's also pretty impressive.

I talked about most of the algorithm theory earlier. In order to prevent readers from getting sleepy, I will directly enter the second half of the topic, high concurrency cache design.

Even in the software layer, we also need such a peace of mind, starting from the simplest service architecture, usually we initiate a request on the server side, and then CURD a relational database such as Mysql. However, an architecture like this requires a disk as a terminal for persistence. Even if an index is added, the B+ tree data structure is used to optimize the query, and the efficiency will still be stuck on the IO that requires frequent seeks. At this time, the role of an old one is very obvious. We will add some memory operations to ease the pressure caused by the slow IO processing speed. cache is not a problem, how to use it is actually a problem.

cache coherency issues

There are several mechanisms for cache processing:

cache aside；
read through；
write through；
write behind caching；

cache penetration problem

The so-called cache breakdown means that when a request is sent and the data cannot be read in the cache, the request will still affect the database. In this case, the effect of cache decompression will no longer exist.

Imagine such a scenario, if a user maliciously and frequently uses a large amount of traffic to query a record that is not in the database, and keeps breaking down the cache, the database is bound to be killed. How to avoid the breakdown of the cache is a problem.

There are two options. The first is to add a null value to the cache. If the query in the database fails, we can set the value to null to prevent the database from being accessed again next time. This is simple and convenient, but it is a bit of a waste of space. .

The second solution is to use a bloom filter (point of question), add a layer of bloom filter between the cache and the web server, and record the accessed keys. In this way, the problem of cache breakdown can also be solved.

cache avalanche problem

A cache avalanche occurs when the caches are invalidated at the same time at a certain point of time. For example, the cache sets the invalidation time, which will cause a large number of cache breakdown problems in linkage.

Adding distributed locks is a solution, only the request to get the lock can access the database. However, this is a temporary solution. When there are too many requests, a large number of threads will be blocked, and the memory will be damaged.

Warm up the data and set the invalidation time distributedly, which can reduce the probability of cache avalanches.

To improve cache availability, the same single point of cache will be a hidden danger of cache avalanche. Most cache middleware provides high-availability architecture, such as redis master-slave + sentinel architecture.

From bitmap to bloom filter to high concurrency cache design strategy