Redis accurate count to weight - bitmap roar

If you want to count the amount of reading an article, you can use the direct Redis incr command to complete. If the amount of reading required by the user to be heavy, it can be used to set the record read this article all user id, get the length of the collection is set to re-read volume. However, if the amount is too large explosion models articles to read, set'll waste too much storage space. This time we will use HyperLogLog Redis data structures provided instead of the set, it will only take up to 12k of memory space to complete the massive weight to the statistics. But at the expense of accuracy, it is obscured count, the error rate of about 0.81%.

Is there a less accurate method of counting it a waste of space? Our first thought is that the bitmap, you can use a bit bitmap to represent a user id. If a user id is 32 bytes, it only needs to use a bitmap 1/256 space can be completed accurately counted. But how will the user id mapping bitmap position it? If the user id is a consecutive integer which is good to do, but user id is not usually an integer users of the system, but there is a certain string or random large integer.

We can force a given user id for each sequence of integers, then there will be a correspondence between redis user id and an integer.

$next_user_id = incr user_id_seq
set user_id_xxx   $next_user_id
$next_user_id = incr user_id_seq
set user_id_yyy  $next_user_id
$next_user_id = incr user_id_seq
set user_id_zzz   $next_user_id
复制代码

Here you may ask questions, you said that in order to save space is used to store the user id and the mapping between integers not waste space yet? Mention this issue very well, but at the same time we also see that this mapping relationship is reusable, it can count the amount of reading all articles, you can also sign Nikkatsu user statistics, monthly living, it can also be used in many other require the user to re-statistical occasions. The so-called "power in the contemporary benefit future generations," what means.

With this mapping relationship, we can easily construct a bitmap RBI read every article to a user, it will be a bitmap corresponding to a respective position. If the bit from 0 to 1, then you can add 1 to the number of reading. So that you can easily get to read the article number.

And we also can dynamically calculated read two articles of the amount of public user how many? The two bitmaps AND do some calculations, and statistical bitmap number of 1 bits. Similarly, the calculation can also be OR, XOR calculations, etc. are possible.

Problem again! Redis bitmap is a bitmap-intensive, what does that mean? If there is a big bitmap that only the last bit is 1, the other is zero, the bitmap will still occupy the entire memory space, which is not an ordinary waste. Most of the articles you can imagine the amount of reading is not large, but their footprint is very close, which occupy explosion models of memory and similar articles.

It seems that the program does not work, we need to think about other options! In this case roar bitmap (RoaringBitmap) came.

It will be a whole big bitmap block, if the entire block is zero, then it would not have saved the entire block. However, if the 1-bit scattered, each block which are 1, although rarely a single block in the 1, so only the block is not enough, then how to do it? Let us think for a single block, is not it can continue to optimize? If the individual cells within a small number of bits, we can only store all the bits of a block offset (an integer), which is stored a list of integers, then stored in the block may be down. This is the form of a single memory block sparse bitmap - storing an offset list of integers. Only a single bit in a block exceeds a threshold value, it will convert to a disposable sparse memory intensive storage.

In addition to bitmap roar which saves space, the calculation will reduce the efficiency of AND, OR and other bit operations. Needs to be calculated before the entire bitmap, now only need to calculate the partial blocks. If within the block is very sparse, so only you need to AND the collection of these small list of integers, OR operation, the case can continue to reduce the amount of calculation.

Here neither the space for time, nor for space with time, but at the same time with the complexity of the logic exchange space and time.

Roaring maximum bit length bitmap of 2 ^ 32, the corresponding space 512M (Normal Bitmap), the offset is divided into higher 16 bits and lower 16 bits, the upper 16 bits represent a block offset, a block 16 is low the location, can be expressed as a single block of 64k bit length, i.e. 8K bytes. There will be a maximum of 64k blocks. L1 cache of modern processors generally greater than 8K, it can guarantee to fit in a single block may L1 Cache, can significantly improve performance.

If all bits in a single block are all zero, then it need not be stored. Whether a particular block may be used to express the bitmap exists, when few blocks, represented by a list of integers, when the block can be converted into a more normal position in FIG. List of integers occupy less space, it also is similar to the dynamic expansion of the ArrayList mechanism to avoid repeated expansion copy the contents of an array. When the digital list exceeds 4096, it will be immediately converted into a normal position in FIG.

Building block used to express the presence or absence of a data structure and expression of a single block of data may be the same, whether a block exists in nature as is 0 and 1, is the normal flag.

But there is no native support for Redis and roaring bitmap data structure ah? How do we use it?

Redis does not have a native, but Redis Module roaring bitmap there.

github.com/aviggiano/r…

Quantity star of this project is not a lot, we look at its official performance comparison

ON TIME/OP (us) ST.DEV. (us)
R.SETBIT 31.89 28.49
SETBIT 29.98 29.23
R.GETBIT 29.90 14.60
GETBIT 28.63 14.58
R.BITCOUNT 32.13 0.10
BITCOUNT 192.38 0.96
R.BITPOS 70.27 0.14
BITPOS 87.70 0.62
R.BITOP NOT 156.66 3.15
BITOP NOT 364.46 5.62
R.BITOP AND 81.56 0.48
BITOP AND 492.97 8.32
R.BITOP OR 107.03 2.44
BITOP OR 461.68 8.42
R.BITOP XOR 69.07 2.82
Bitola XOR 440.75 7.90

Obviously contrast here is sparse bitmap, only sparse bitmap can exhibit such a nice figure. If intensive bitmap, bitmap roaring performance will certainly be weaker in the ordinary bitmaps, but usually not too weak.

Let's look at the source code to see its internal structure is like

// 单个块
typedef struct roaring_array_s {
    int32_t size;
    int32_t allocation_size;
    void **containers;  // 指向整数数组或者普通位图
    uint16_t *keys;
    uint8_t *typecodes;
    uint8_t flags;
} roaring_array_t;

// 所有块
typedef struct roaring_bitmap_s {
    roaring_array_t high_low_container;
} roaring_bitmap_t;
复制代码

Clearly can be seen that the presence or absence and the block data within blocks using the same data structures are expressed, they are roaring_bitmap_t. This structure there are multiple encoding formats, the type field to indicate use typecodes.

#define BITSET_CONTAINER_TYPE_CODE 1
#define ARRAY_CONTAINER_TYPE_CODE 2
#define RUN_CONTAINER_TYPE_CODE 3
#define SHARED_CONTAINER_TYPE_CODE 4
复制代码

See the type defined here, we found common bitmap array list and it's more than two forms mentioned above, there are two types RUN and SHARED. RUN bit map form is a compressed form, such as several consecutive bits to a rear 101,102,103,104,105,106,107,108,109 RUN is 101,8 (1 followed by 8 integer increment), so that in a lot of space can be significantly compressed. Roaring internal bitmap in the normal type of block without RUN. Only display optimized API calls will roar into RUN bitmap format, this API is roaring_bitmap_run_optimize.

SHARED type and for sharing among a plurality of blocks of bitmap roar, it also provides a write replication. When this block is modified copies the new one.

There are more details of the calculation logic roaring bitmap behind us free and then continue to introduce.

Guess you like

Origin blog.csdn.net/weixin_34273046/article/details/91379679