Redis——hash

foreword

Hash table is a key-value data structure, which can query data with O(1) time complexity. Pass in the specified key, and you can query the specified value, which is widely used.

But because the capacity is determined, there will be hash collisions. Redis uses zippering to resolve hash collisions.

When there is more and more data, the original capacity can no longer meet the data storage. Redis will trigger expansion after a certain condition is met. The expansion method is to use two Hash tables for gradual expansion.

insert element

The bottom layer of the Hash table usually uses an array as the data structure. When inserting an element, the hash value of the element is calculated first, and then the hash value is used to modulo the length of the array. The result obtained is the subscript of the element to be inserted into the array.

image.png

structure

struct dict {
    // hash表类型,不同类型有不同的比较函数
    dictType *type;
    
    // 2 个 hash 表,
    dictEntry **ht_table[2];
    
    // 2 个哈希 table 里面各自存了多少个元素
    unsigned long ht_used[2];
    
    // 渐进式 rehash 现在处理的哈希槽索引值
    long rehashidx;
    
    // 用来暂停渐进式rehash的开关
    int16_t pauserehash;
    
    // 记录两个哈希table的长度,实际是是记录2的n次方中的 n 这个值
    signed char ht_size_exp[2]; 
};

image.png

  1. type, which is the type of the hash table, is also a structure, which defines many function interfaces, similar to the interfaces in Java. The implementation class of the interface needs to be passed.
  2. ht_table, it is an array type, and its size is 2, because it needs to use 2 hash tables when rehash.
  3. ht_used, indicating how many elements are in each of the two hash tables
  4. rehashidx, record the hash slot being processed by rehash, because progressive rehash assigns migration to each operation, so it needs to record which slot is currently processed.
  5. pauserehash, whether to pause rehash
  6. ht_size_exp, indicating the capacity of two hash tables

The structure of each element in the hash table is as follows:

typedef struct dictEntry {
    void *key;
    union {
        void *val;
        uint64_t u64;
        int64_t s64;
        double d;
    } v;
    struct dictEntry *next;     /* Next entry in the same hash bucket. */
    void *metadata[];           /* An arbitrary number of bytes (starting at a
                                 * pointer-aligned address) of size as returned
                                 * by dictType's dictEntryMetadataBytes(). */
} dictEntry;

Each element in the hash table will contain three parts: key, value, pointer to the next element. The pointer to the next node is to resolve hash conflicts.

value is defined by a union, which means it can be any type in it. It can be a pointer, a 64-bit unsigned integer, a 16-bit integer, or a double value. The advantage of this definition is that if it is an integer type, you can directly embed the value into the entry to save memory, because there is no need to use pointers to record addresses.

hash collision

在插入函数那个段落中,可以知道每一个元素都会使用 key 计算 hash 值,然后再求模得到可以插入的数组坐标。但是 hash 值是有限的、数组容量是有限的,而数据是无限的,那么就会造成两种情况:

  1. 不同的数据存在一样的 hash 值
  2. 不一样的 hash 值求模得到的坐标相同

这两种情况都会将不同的元素都插入到数组的同一个坐标下,这是不允许出现的情况。

为了解决这个问题,redis 使用拉链法。拉链法就是在每一个元素中都记录一个指针,这个指针会指向下一个坐标相同的元素。当一个元素计算得到的坐标下已经有元素时,就会将数组中的元素的指针指向新的元素。从而得到下面这种情况

image.png

当需要查看 key 为 jshd 的元素时,首先会通过计算找到坐标 1,然后比较 key 值,发现 a 不等于 jshd,那么就沿着链表继续往下比较 key 值。直到找到或者 NULL 为止。

所以,hash 表的底层是数组和链表的组合。

我们可以试想这么一种极端情况,所有的元素都在同一个坐标下,那么 hash 表就会退化成链表,我们执行查询操作时,时间复杂度就会从 O(1) 退化为 O(n)。

为了解决这个问题,我们就需要进行扩容。

扩容

Redis 会在两种情况下进行扩容;

  • 当元素数量 / 容量 >= 1,并且没有执行 RDB 快照和 AOF 重写时,会进行扩容
  • 当元素数量 / 容量 > 5,那么将会直接进行扩容。

还记得之前介绍的,在一个 hash 结构中,会有两个 hash 表吗?正常情况下,只有一个 hash 表1 是正常使用的,另外一个表2 并不会分配空间。

但是当需要扩容的时候,会执行以下步骤:

  1. 给表 2 分配内存空间,通常是表 1 的两倍
  2. 然后将表 1 的数据逐渐迁移到表 2 中,迁移完成后,将表 1 的内存空间释放
  3. 切换表 2 为正常使用的表。

如果表 1 的元素数量很少,那么迁移的过程是非常快的,但是如果表 1 的元素数量很多,那么进行迁移的过程就会阻塞客户端的请求,此时 Redis 对外表现就是无法处理请求。这是无法接受的情况

Progressive rehash

During expansion, the data in Table 1 will be migrated, but if the amount of data in Table 1 is large, it will affect the Redis service. So Redis uses a method called progressive rehash for migration .

Progressive rehash does not migrate all the data at once, but migrates several elements in each access to the hash table. At a certain moment, all the data in Table 1 will be migrated to Table 2.

In this way, the migration work is allocated to each operation, which greatly reduces the time consumption of an operation.

During the migration, the data in both tables is in normal use. When performing delete, update, and query operations, they will be performed in both tables. For example, to query an element, it will be queried in table 1, and if it cannot be queried, it will continue to query in table 2.

But for an insert operation, only new elements will be inserted into table2. This ensures that the elements in Table 1 will only decrease and not increase, so that all the elements in Table 1 can be migrated at a certain time.

If the insertion is fast enough, will it happen that the data in table 1 has not been migrated yet, and table 2 needs to be expanded?

When Redis detects that rehash is in progress, it will not expand again.

Guess you like

Origin juejin.im/post/7245936314850459706