Massive data solutions Bitmap

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: gudepeng.github.io/note/2019/1...

About a .Bitmap

1: Bitmap bitmap algorithm known algorithm, the principle is, instead of using the index value or specific sense, the use of this bit is 0 or 1. Representative properties exists.
2: Bitmap algorithm has high efficiency, space-saving features for large amounts of data de-duplication, query, since only a bit in the bit computer, and takes 32 bit int type, so saving space 32 times.

II. Using the Scene

If a group of people information, give these people to tag, for example, you want to give someone a label marked with membership and play on people to buy goods over 30 days. You may be marked with labels in every person, so when you get a collection of members required to traverse each individual to determine whether he included members of labels, such a calculation is too big.
You can put it another thought, we have 10 people, we will define a 10-bit array, we can set the crowd everyone's id (starting from 0) in accordance with the self-energizing type, then each bit corresponding to a person with 0,1 to judge whether or not this person has this label.
会员标签 |0|1|1|1|0|0|0|1|0|0|
So that we can be very intuitive to queries, we id as 1,2,3,7 members of the personnel. Next, we continue to play a purchased item within 30 days people
30天内购买过商品的人 |1|1|0|0|0|0|0|0|0|0|
can visually see Chu had 30 days of purchase goods person id is 0. So we want to find out 30 days of purchase through membership of goods, we just have to do with the operation (&).
会员标签 |0|1|1|1|0|0|0|1|0|0|
30天内购买过商品的人 |1|1|0|0|0|0|0|0|0|0|
30天内购买过商品的会员 |0|1|0|0|0|0|0|0|0|0|
So we want to find out over 30 days of purchase of goods or members, we only need to do or operation (|).
会员标签 |0|1|1|1|0|0|0|1|0|0|
30天内购买过商品的人 |1|1|0|0|0|0|0|0|0|0|
30天内购买过商品的会员 |1|1|1|1|0|0|0|1|0|0|
Then you will say, I have created a map you can get this, why do you want to use bitmap.
Since in the computer, an int occupies 4 bytes or 32 bits, and the Bitmap words, only 1/32 of memory.
You might say, then I have a 100,000 crowd, then I have a number of how many labels you need 100,000 of bit open, but if I have only one person in line with this tag will waste a lot of resources, then I will tell everyone in the use of specific methods of how to solve this problem.

III. Specific implementation

Here we mainly explain the two main methods of use, the first one is EWAHCompressedBitmap (Google realization of the Bitmap), and the second is RoaringBitmap (this is also the most used in mainstream applications, such as Spark, Hive, Kylin, etc.).

1.EWAHCompressedBitmap

Bitmap EWAH is to exist in a long Chu array, each element can be seen as a 64-bit binary number in EWAH also called a word, EWAH initialization word is four, when all the word are occupied, it will be expansion. When data is added a large span of time, EWAH creates a RLW (Running Length word), RLW is divided into two parts, the lower 32 bits identify the current word across how many empty word, behind high 32 identifies the current number of RLW consecutive word. This would solve the span is large and open up a lot of space problems.

2.RoaringBitmap

Roaring Bitmap is divided into 32-bit integer data block 16 an integer power of 2, 16 share the same most significant bits. Use special containers to hold their 16 least significant bits. When a data block is an integer of not more than 4096, a 16-bit integers ordered array (using the short type array in java). When an integer of more than 4096, we use 2 ^ 16-bit bitmap (using long type arrays in java). So we have two types of containers, the container for a sparse array of data blocks (ArrayContainer) and for data block bitmap dense container (BitmapContainer). 4096 assurance level threshold containers, each integer no more than 16 bits. When a bitmap container, means that more than 2 ^ 16 to 4096 (= 2 ^ 12) integer, less than 16 bits / integer (2 ^ 16/2 ^ 12 = 2 ^ 4 = 16, if the value of the array are filled with long , where the 1-bit best / integer). When using arrays with precise container 16 bits / integer.
Why 4096 this threshold it? Because of less than 4096, the bitmap container may be greater than 16 bits / integer, greater than 4096, an array of the container more than 2 ^ 16 (2 ^ 12 * 16 = 2 ^ 16), space is clearly more than 2 ^ 16 the lower 16 bits capacity numbers. In short, when a small integer base, using an array of space-saving, high base, the use of bitmap more space.
These containers are stored in a shared 16 most significant bits of dynamic array: As an index thereof. Using arrays guarantee high 16 order. We believe that an index is generally small. When n = 1 000 000, it contains up to 16 entities. Therefore, it should be stored in the CPU cache. The container itself should not use more than 8KB.
Here is a comparison of paper on the performance:
db.ucsd.edu/wp-content/...

Four .RoaringBitmap use

1.maven introduced

<dependencies>
    <dependency>
        <groupId>org.roaringbitmap</groupId>
        <artifactId>RoaringBitmap</artifactId>
        <version>0.8.12</version>
    </dependency>
</dependencies>复制代码

2. The method of use

public static void main(String[] args) {
    RoaringBitmap rb = RoaringBitmap.bitmapOf(1,2,3,4,7,33,55);
    //select 返回第几位的值
    System.out.println(rb.select(1));
    //rank 返回小于等于参数的值得个数
    System.out.println(rb.rank(55));
    //contains 是否包含参数
    System.out.println(rb.contains(56));
    //contains 是否包含参数
    System.out.println(rb.contains(5L,56L));
    //add 添加从左闭到右开区间内的值
    rb.add(10L,15L);
    System.out.println(rb);

    RoaringBitmap rb1 = RoaringBitmap.bitmapOf(2,3,4,44);
    System.out.println(rb1);
    //取两个bitmap的并集
    RoaringBitmap rb1or2=RoaringBitmap.or(rb,rb1);
    System.out.println(rb1or2);
    //取两个bitmap的交集
    RoaringBitmap rb1and2=RoaringBitmap.and(rb,rb1);
    System.out.println(rb1and2);
    rb.and(rb1);
    System.out.println(rb);
    //获取第一位
    System.out.println(rb.first());
}复制代码

Guess you like

Origin juejin.im/post/5e0725146fb9a01626646cdf