Principle Redis SCAN command to achieve limited warranty

SCAN command to ensure users: from the beginning until the complete traversal of traversal of the complete end of the period, has been present in the data set will be complete traversal of all elements of return, but the same element could be returned many times. If an element is added in an iterative process to a data set, or is being removed from the centralized data in the iterative process, then this element may be returned, it may not return.

This is how to achieve it, starting with Redis in the dictionary dict begin. Dict Redis database is used as the underlying implementation.

Dictionary data types

Redis in the dictionary by the dict.h/dictrepresentation structure:

typedef struct dict {
    dictType *type;
    void *privdata;
    dictht ht[2];
    long rehashidx; /* rehashing not in progress if rehashidx == -1 */
    unsigned long iterators; /* number of iterators currently running */
} dict;

typedef struct dictht {
    dictEntry **table;
    unsigned long size;
    unsigned long sizemask;
    unsigned long used;
} dictht;

Dictionary of two hash tables dicthtconstituted mainly used as the rehash, usually mainly used ht [0] hash table.

Members of a hash table dictEntryarray configuration sizeattribute record size of the array, usedthe properties of an existing node record number, sizemaskthe value of the attribute is equal size - 1. Array size is generally 2 n- , so sizemaskbinary is 0b11111...used mainly as a mask, and the hash value should be determined together with the key array in which position.

Seeking key index calculated in the array as follows:

index = hash & d->ht[table].sizemask; 

That is, the value of the mask according to demand low.

rehash issues

Rehash dictionary will use two hash tables, first for ht [1] distribution space, if a spreading operation, the size of ht [1] is greater than or equal to the first two times ht[0].usedthe 2 n- , if the contraction operation, HT [ size 1] is greater than or equal to the first ht[0].usedof the 2 n- . Then ht [0] to rehash all the keys to [1] in ht, finally released ht [0], the ht [1] is set to ht [0], create a new blank hash table as ht [1] . not a complete rehash, but in multiple, progressively completed.

For example, now the size of a hash table 4 ht[0](sizemask to 11, index = hash & 0b11) rehash to a hash table size 8 ht[1](sizemask of 111, index = hash & 0b111) .

the lower two bits of the key hash value ht [0] is the position in bucket0 00, rehash then taken to a low three ht index [1] is possible 000(0)and 100(4). I.e. ht [0] after the elements in bucket0 rehash dispersed ht [1] and the bucket0 bucket4, and so, the corresponding relationship is:

    ht[0]  ->  ht[1]
    ----------------
      0    ->   0,4 
      1    ->   1,5
      2    ->   2,6
      3    ->   3,7

If the SCAN command sequence taken 0-> 1-> 2-> 3 is traversed, a problem occurs:

  • Extended operation, if the cursor is in progress to return rehash. 1, the partial data ht [0] in bucket0 might have to rehash ht [1] of the bucket [0] or bucket [4], in ht [1] in from the start traversing bucket1, when traversing to bucket4, wherein the elements have been traversed in bucket0 [0] in the ht, which creates duplication.
  • Reduced operation, when the cursor returns to 5, but narrow Hou Haxi table size is only 4, how to reset the cursor?

SCAN traversal order

SCANTraversal order commands, you can look at an example:

127.0.0.1:6379[3]> keys *
1) "bar"
2) "qux"
3) "baz"
4) "foo"
127.0.0.1:6379[3]> scan 0 count 1
1) "2"
2) 1) "bar"
127.0.0.1:6379[3]> scan 2 count 1
1) "1"
2) 1) "foo"
127.0.0.1:6379[3]> scan 1 count 1
1) "3"
2) 1) "qux"
   2) "baz"
127.0.0.1:6379[3]> scan 3 count 1
1) "0"
2) (empty list or set)

It can be seen order is 0->2->1->3difficult to see regularity, converted into binary look:

00 -> 10 -> 01 -> 11

Binary very clear, the order of addition is also used to traverse, but each time plus one is high, i.e. from left to right are added into the descending position.

SCAN source

SCANTraversing the dictionary source dict.c/dictScan, two cases, the dictionary is not in progress or being rehash rehash.

When not in progress rehash, the cursor is calculated as:

T0- = M0> sizemask; 

// the cursor bit umask bits are set to 1 
V | = ~ M0; 

// cursor inverted 
v = rev (v); 
// After inversion +1, plus 1 reaches the high the effect of 
V ++; 
// reset again inverted 
v = rev (v);

When the size is 4, sizemask 3 (00000011), the cursor calculation:

         v |= ~m0    v = rev(v)    v++       v = rev(v)

00000000(0) -> 11111100 -> 00111111 -> 01000000 -> 00000010(2)

00000010(2) -> 11111110 -> 01111111 -> 10000000 -> 00000001(1)

00000001(1) -> 11111101 -> 10111111 -> 11000000 -> 00000011(3)

00000011(3) -> 11111111 -> 11111111 -> 00000000 -> 00000000(0)

Traversing the size of cursor state 4 is transferred 0->2->1->3.

Similarly, size 8 for the cursor state is transferred 0->4->2->6->1->5->3->7, that is 000->100->010->110->001->101->011->111.

Combined with the preceding rehash:

    ht[0]  ->  ht[1]
    ----------------
      0    ->   0,4 
      1    ->   1,5
      2    ->   2,6
      3    ->   3,7

As can be seen, when the size from small to large, all the original position corresponding to the cursor can be found in large hash table, and in the same order, and will not miss reading will not be repeated.

When made into small size, the size is assumed that the four 8 becomes, the two cases, one is the cursor as 0,2,1,3a medium, the case continues to read, not gaps and overlaps.

If the cursor is returned but not four, for example 7,7 & 11 returned to become a 3, it will be for the bucket3 hash table 4 starts continue to traverse from the size, but bucket3 contains hash table size 8 in bucket3 and bucket7, will therefore result in a case where bucket3 read repeatedly size of the hash table. 8.

So, redis in rehash when small to large, SCAN command will not be repeated and will not be missed. While descending, it may cause duplicate and not miss.

When being rehash, the cursor calculation:

        /* Make sure t0 is the smaller and t1 is the bigger table */
        if (t0->size > t1->size) {
            t0 = &d->ht[1];
            t1 = &d->ht[0];
        }

        m0 = t0->sizemask;
        m1 = t1->sizemask;

        /* Emit entries at cursor */
        if (bucketfn) bucketfn(privdata, &t0->table[v & m0]);
        de = t0->table[v & m0];
        while (de) {
            next = de->next;
            fn(privdata, de);
            de = next;
        }

        /* Iterate over indices in larger table that are the expansion
         * of the index pointed to by the cursor in the smaller table */
        do {
            /* Emit entries at cursor */
            if (bucketfn) bucketfn(privdata, &t1->table[v & m1]);
            de = t1->table[v & m1];
            while (de) {
                next = de->next;
                fn(privdata, de);
                de = next;
            }

            /* Increment the reverse cursor not covered by the smaller mask.*/
            v |= ~m1;
            v = rev(v);
            v++;
            v = rev(v);

            /* Continue while bits covered by mask difference is non-zero */
        } while (v & (m0 ^ m1));

Algorithm will ensure that t0 is smaller hash table is not the case t0 and t1 exchange, first traversing bucket t0 the cursor, and then traverse the large t1.

Seeking a cursor next process is basically the same, but the m0replaced after rehash of the hash table m1, and also added a judgment condition:

v & (m0 ^ m1)

m0 size4 is 00000011, the Size8 m1 is 00000111, m0 ^ m1the value is 00000100, that is to take two different positions of the mask, see the cursor flag is 1 or not.

Suppose returned cursor 2, and the rehash in progress, this time into a size of 4 8-bit two different third mask is low.

First, the traverse t0 bucket2, t1, and then traverse the bucket2, the formula for the next cursor 6 (00000110), the third lower bit is 1, continue to cycle, the traverse t1 bucket6, is then calculated as a cursor, the end of cycle.

So is being rehash, the two hash tables are traversed in order to avoid missing.

Guess you like

Origin www.linuxidc.com/Linux/2019-07/159652.htm