Explain the bitmap application of redis in detail

We can use Redis' bitmap (bitmap) to store data.

1. What is Redis bitmap

That is: manipulate the bit at the specified offset of the string stored in the key of the String data structure, and return the value of the original position

1.1 Advantages:
Space saving: A bit is used to indicate the value or state corresponding to an element, where the key is the value of the corresponding element. In fact, 8 bits can form a Byte, so it is extremely space-saving.
High efficiency: The time complexity of setbit and getbit are both O(1), and other bit operations are also efficient.

1.2 Disadvantages: In
essence, there is only a difference between 0 and 1, so if you use bits for business data recording, you don't need to care about the value of value.

2. Redis bitmap command

2.1 The setbit command
sets or modifies the value of the offset (offset) on the key.

Syntax: setbit key offset value
Return value: Specify the original stored value of the offset (offset).
Note: If the offset is too large, 0
offset will be filled in the middle to the maximum to 2^32-1, then the largest string is 512M
Insert picture description here
bitmap
setkey command 2.2 getbit command to
query the string value stored in the key to obtain the offset A bit of quantity.

Syntax: getbit key offset
Return value: Return the offset on the specified key, if the key does not exist, then return 0.
The getbit command of bitmap
2.3 bitcount command
Calculate the number of bits set to 1 in the string value of the given key

Syntax: bitcount key [start] [end]
Return value: the number of 1 bit.
Note: setbit is to set or clear the bit position. This is the count of the number of times the key appears 1.
Note that: [start][end] (unit) is actually byte, what does this mean? Entering redis is actually multiplying by 8.

// 计算长度为 count 的二进制数组指针 s 被设置为 1 的位数量
// 这个函数只能在最大为 512 MB 的字符串上使用
size_t redisPopcount(void *s, long count) {
    
    
    size_t bits = 0;
    unsigned char *p = s;
    uint32_t *p4;
    // 通过查表来计算,对于 1 字节所能表示的值来说
    // 这些值的二进制表示所带有的 1 的数量
    // 比如整数 3 的二进制表示 0011 ,带有两个 1
    // 正好是查表 bitsinbyte[3] == 2
    
    static const unsigned char bitsinbyte[256] = {
    
    0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8};

    /* Count initial bytes not aligned to 32 bit. */
    while((unsigned long)p & 3 && count) {
        bits += bitsinbyte[*p++];
        count--;
    }

    /* Count bits 16 bytes at a time */
    // 每次统计 16 字节
    // 关于这里所使用的优化算法,可以参考:
    // http://yesteapea.wordpress.com/2013/03/03/counting-the-number-of-set-bits-in-an-integer/
    p4 = (uint32_t*)p;
    while(count>=16) {
        uint32_t aux1, aux2, aux3, aux4;

        aux1 = *p4++;
        aux2 = *p4++;
        aux3 = *p4++;
        aux4 = *p4++;
        count -= 16;

        aux1 = aux1 - ((aux1 >> 1) & 0x55555555);
        aux1 = (aux1 & 0x33333333) + ((aux1 >> 2) & 0x33333333);
        aux2 = aux2 - ((aux2 >> 1) & 0x55555555);
        aux2 = (aux2 & 0x33333333) + ((aux2 >> 2) & 0x33333333);
        aux3 = aux3 - ((aux3 >> 1) & 0x55555555);
        aux3 = (aux3 & 0x33333333) + ((aux3 >> 2) & 0x33333333);
        aux4 = aux4 - ((aux4 >> 1) & 0x55555555);
        aux4 = (aux4 & 0x33333333) + ((aux4 >> 2) & 0x33333333);
        bits += ((((aux1 + (aux1 >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24) +
                ((((aux2 + (aux2 >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24) +
                ((((aux3 + (aux3 >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24) +
                ((((aux4 + (aux4 >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24);
    }

    /* Count the remaining bytes. */
    // 不足 16 字节的,剩下的每个字节通过查表来完成
    p = (unsigned char*)p4;
    while(count--) bits += bitsinbyte[*p++];
    return bits;
}

Use of bitcount instruction
2.4 bitop command
Perform meta-operation on one or more binary string keys, and save the result to destkey.

Syntax: operation can be one of and, or, not, and xor.
bitop and destkey key [key...], Logically merge one or more keys, and save the result to destkey.
bitop or destkey key [key...], Logical OR for one or more keys, and save the result to destkey.
bitop xor destkey key [key...], Logically XOR one or more keys, and save the result to destkey.
bitop xor destkey key, For one or more keys logical negation, the result is saved to destkey.
In addition to NOT, most other operations can accept one or more keys as input.

(Knocking on the blackboard, highlighting the key points) BITOP's time complexity is O(N). When dealing with large matrices or large amounts of data statistics, it is best to assign tasks to slave nodes to avoid blocking the main node.

Advantage
1. Storage is based on the smallest unit bit, so it is very space-saving.
2. The time complexity of setting is O(1), and the time complexity of reading is O(1), and the operation is very fast.
3. The storage of binary data is very fast when performing related calculations.
4. Convenient expansion

Limiting the
bit mapping in redis is limited to 512MB, so the maximum is 2^32 bits.

3. Usage scenarios of bitmap

There are many ways to use them, according to different business needs, but in general there are two kinds, taking the user as an example:

1. One is the horizontal expansion of a certain user, that is, the various status values ​​of the current user are recorded in this key value, allowing unlimited expansion (within 2^32)

Comment: This usage is basically rarely used, because each key carries uid information. If the space of the stored key is larger than the value, there is a certain optimization space from the perspective of space. If it is to record the long tail, it can be considered.

2. One is the vertical expansion of a user, that is, each key only records the status of the current business attribute, and each uid is used as a bit to record information (users exceeding 2^32 need to be stored in fragments)

Comment: Basically, the scenarios used in the project are based on this method. It is convenient to recycle resources according to the business distinction. There is only one key value. The storage of uid is converted to bit storage. The corresponding value can be found by uid very cleverly. The main storage capacity is value, which is in line with expectations.

1.视频属性的无限延伸

demand analysis:

A short video app with billions of data volume, the video has various attributes (whether locked, whether special effects, etc.), and needs to be marked.

Possible solutions:

1. Stored in mysql, definitely not. One is that the attributes keep increasing with business growth and there are time-limited attributes. It is very unreasonable to directly add and subtract fields to the database. Even if there is a field with compression technology such as json, there is a problem of read efficiency, and for hundreds of millions of data, it is very troublesome to recycle the discarded fields.

2. Record directly in redis, and store it according to the business attribute + uid as the key. There is nothing wrong with the read and write efficiency, but from the storage perspective, the amount of data in the key is larger than the value, which is too space-consuming. Even if it is stored using compression technologies such as json. There are also problems. Decompression takes time, and data recovery of hundreds of millions is also a problem.

Design:

Use redis bitmap for storage.
The key is composed of attribute id + video segment id. Value modulates the segmentation range according to the video id to determine the offset offset. One billion videos with an attribute of about 120m is quite cost-effective.

Fake code:

function set($business_id , $media_id , $switch_status=1){
    
    
    $switch_status = $switch_status ? 1 : 0;
    $key = $this->_getKey($business_id, $media_id);
    $offset = $this->_getOffset($media_id);
    return $this->redis->setBit($key, $offse, $switch_status);
}

function get($business_id , $media_id){
    
    
    $key = $this->_getKey($business_id,$media_id);
    $offset = $this->_getOffset($media_id);
    return $this->redis->getBit($key , $offset);
}

function _getKey($business_id, $media_id){
    
    
        return 'm:'.$business_id.':'.intval($media_id/10000);
}

function _getOffset($media_id){
    
    
    return $media_id % 10000;
}

This basically realizes the storage of attributes, and subsequent addition of new attributes is just another value for business_id.

As for why fragmentation? How to measure the granularity of fragmentation?

There are two reasons for fragmentation: 1. In the non-intensive distribution, the length is too long, there will be a large number of 0 useless values ​​occupying memory resources. 2. The bitmap has a length limit of 2^32.

How to measure the fragmentation granularity: 1. If there is a fault in the primary key id, then please choose the granularity as far as possible to avoid the id range of this segment and prevent space waste. Because there is a 00000...9999 0...01, then because of the storage of an attribute If you save all of it, it's wasted. 2. The fragmentation granularity can be judged by referring to the growth value of a certain unit time, which is also conducive to how much space the budget occupies, although the space will not take up much.

2.用户在线状态

demand analysis:

Need to provide an interface to the sub-project to provide whether a user is online?

Design:

Using bitmap is a space-saving and efficient method. It only needs a key, and the user id is the offset offset. If it is online, it is set to 1, and it is set to 0 if it is offline. 300 million users only need 36MB of space. .

Fake code:

$status = 1;
$redis->setBit('online', $uid, $status);
$redis->getBit('online', $uid);

Need to add the same fragmentation method as in Example 1. One billion is really too much. 10w divided into one piece.

3.统计活跃用户

demand analysis:

Need to calculate the data situation of active users.

Design:

Use time as the key for the cache, and then the user id is offset, and set to 1 if it has been active that day. Afterwards, binary calculations are carried out through bitOp to calculate the user's activity in a certain period of time.

Fake code:

$status = 1;
$redis->setBit('active_20170708', $uid, $status);
$redis->setBit('active_20170709', $uid, $status);
$redis->bitOp('AND', 'active', 'active_20170708', 'active_20170709'); 

Hundreds of millions of users need to add fragmentation as in Example 1. If it is hundreds of thousands or less, there is no need to divide the business to save complexity.
Other similar situations:
key: date;
offset: user id [number or binary];
value: whether to log in/do any operation;
generate a bitmap by date

Calculation 月活: All bitmaps for 30 days can be calculated by OR, and the bitcount calculation is performed;
calculation 留存率: Yesterday retained = the number of people who logged in continuously yesterday and today/the number of people who logged in yesterday, that is, yesterday's bitmap and today's bitmap are calculated and divided by The number of bitcount yesterday.

4.用户签到

demand analysis:

The user needs to sign in, and the sign-in data needs to be analyzed and corresponding operation strategy.

Design:

Using redis bitmap, because it is a long-tailed record, the key is mainly composed of uid, and an initial time is set. Afterwards, if one day is not added, it corresponds to the position of the offset in the value.

Fake code:

$start_date = '20170708';
$end_date = '20170709';
$offset = floor((strtotime($start_date) - strtotime($end_date)) / 86400);
$redis->setBit('sign_123456', $offset, 1);

//算活跃天数
$redis->bitCount('sign_123456', 0, -1)

No sharding is required. For 365 days a year, 300 million users account for about 300000000*365/8/1000/1000/1000=13.68g. Is the storage cost very low?

使用bitmap过程中可能会遇到的坑

1. The trap of bitcout
If you read the previous usage carefully, you will find that there is such a note "return the number of bits in a specified key with a value of 1 (in bytes, not bits)", this is where the pit is. .

There are pictures and the truth:
Insert picture description here
So bitcount 0 0 then it should be the number of 1 in the first byte, pay attention to the byte, the first byte is 0,1,2,3,4,5,6, 7 these eight positions.

4. Comparison of using set and BitMap storage,

Insert picture description here
Insert picture description here

Through the above comparison, we can see that if there are a lot of independent users, using BitMap is obviously more advantageous and can save a lot of memory. However, if the number of independent users is small, it is recommended to use set storage, as BitMap will generate redundant storage overhead.

Use experience
type = string, BitMap is a sting type, the maximum is 512 MB.
Pay attention to the offset during setbit, which may be time
- consuming . Bitmap is not absolutely good.

Guess you like

Origin blog.csdn.net/qq_26249609/article/details/103563391