Redis principle - the underlying implementation of the data structure

The address of the original text is updated, and the reading effect is better!

Redis principle - the underlying implementation of the data structure | CoderMast programming mast icon-default.png?t=N5K3https://www.codermast.com/database/redis/redis-datastruct-underlying-implementation.html

Dynamic String SDS

#basic concept

The Key stored in Redis is a string, and the Value is often a string or a collection of strings. Visible string is the most common data structure in Redis.

Redis is written in C language, and there are strings in C language, but Redis does not directly use C language strings, because there are many problems with C language strings:

  • Obtaining the string length requires operations
  • Non-binary safe, cannot contain special characters

Because the C string uses a null character as the end of the string, and for some binary files (such as pictures, etc.), the content may include an empty string, so the C string cannot be accessed correctly; and all SDS APIs process the elements in buf in a binary manner, and SDS does not use an empty string to determine whether the string ends, but uses the length indicated by the len attribute to determine whether the string ends.

  • It cannot be modified. The string in C language is essentially an array of characters and \0ends with .

In order to solve the above problems, Redis built a new string structure called Simple Dynamic String (SDS for short).

#underlying implementation

The implementation of SDS in Redis is in  /src/sds.hthe /src/sds.c file, the specific core implementation is as follows:

/* Note: sdshdr5 is never used, we just access the flags byte directly.
 * However is here to document the layout of type 5 SDS strings. */
struct __attribute__ ((__packed__)) sdshdr5 {
    unsigned char flags; /* 3 lsb of type, and 5 msb of string length */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr8 {
    uint8_t len; /* used */
    uint8_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr16 {
    uint16_t len; /* used */
    uint16_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr32 {
    uint32_t len; /* used */
    uint32_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr64 {
    uint64_t len; /* used */
    uint64_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
  • len: the number of string bytes stored in buf, excluding the end mark
  • alloc: the total number of bytes requested by buf, excluding the end flag
  • flags: different SDS header types, used to control the header size of SDS
  • buf: the actual stored character array

Identifier correspondence information

Identification information corresponding value
SDS_TYPE_5 0
SDS_TYPE_8 1
SDS_TYPE_16 2
SDS_TYPE_32 3
SDS_TYPE_64 4

For example, an namesds structure containing strings is as follows:

 

#Memory pre-allocation

The reason why SDS is called a dynamic string is because it has the ability to dynamically expand, for example, an SDS whose content is "hi"

 

If we want to add a string ",Amy" to SDS, because there is not enough space, we need to apply for a new memory space:

  • If the new string is less than 1M, the new space is twice the length of the string after expansion + 1
  • If the new string is larger than 1M, the new space is the string length after expansion + 1M + 1. memory preallocation

 

advantage

  1. The time complexity of getting the length of a string is O(1)
  2. Support dynamic expansion
  3. Reduce the number of memory allocations
  4. Binary safe, strings can store special characters

# SDS Summary

The string representation of Redis is SDS, not the C string (char* ending with \0), which is the string representation used by the bottom layer of Redis, and it is used in almost all Redis modules. You can see the following comparison:

 

Generally speaking, in addition to saving string values ​​in the database, SDS can also be used as a buffer: including the AOF buffer in the AOF module and the input buffer in the client state.

#Integer set IntSet

#basic concept

IntSet is an implementation of the Set collection type in Redis. It is implemented based on an integer array and has the characteristics of variable length and order.

When a collection contains only integer-valued elements, and the number of elements in this collection is not large, Redis will use the integer collection as the underlying implementation of the collection key.

In order to facilitate the search, Redis will store all the integers in the IntSet in the contents array in ascending order. The structure diagram is as follows

 

Now each number in the array is stored in the range of int16_t, so the encoding method adopted is INTSET_ENC_INT16, and the byte size occupied by each part is:

  • encoding: 4 bytes
  • length: 4 bytes
  • contents: 2 bytes * 3 = 6 bytes
  • Total 4 + 4 + 6 = 16 bytes

#underlying implementation

typedef struct intset {
    uint32_t encoding;
    uint32_t length;
    int8_t contents[];
} intset;
  • encoding: encoding method, supports storage of 16-bit, 32-bit, and 64-bit integers

    The encoding contains three modes, indicating that the stored integer sizes are different:

    /* Note that these encodings are ordered, so:
    * INTSET_ENC_INT16 < INTSET_ENC_INT32 < INTSET_ENC_INT64. */
    #define INTSET_ENC_INT16 (sizeof(int16_t))
    #define INTSET_ENC_INT32 (sizeof(int32_t))
    #define INTSET_ENC_INT64 (sizeof(int64_t))
    
    • int16_t: 2-byte integer, the range is similar to Java's short
    • int32_t: 4-byte integer, the range is similar to Java's int
    • int64_t: 8-byte integer, range similar to Java's lang
  • length: the number of elements
  • contents[]: Integer array, saving collection data. The continuous memory area that points to the actual storage value is an array; each element of the integer set is an array item (item) of the contents array, and each item is sorted in ascending order according to the value in the array, and the array does not contain any duplicate items. (Although the intset structure declares the contents attribute as an array of int8_t type, in fact the contents array does not hold any value of int8_t type, the real type of the contents array depends on the value of the encoding attribute)

#Array expansion

When an int16 type data element is added to an int8 type integer collection, the elements in the entire integer collection will be upgraded to int16 type, and the capacity will be expanded when the memory is insufficient. The specific steps are as follows:

  1. Depending on the data type of the new element, the encoding is changed. Expand the array according to the new encoding method and the number of elements
  2. Copy the elements in the array to the correct position after expansion in reverse order. The reverse order is to prevent data overwriting during positive sequence copy, and no data loss will occur.
  3. Put the element to be added at the end of the array
  4. Modify encoding information and add length + 1 to maintain various attributes

underlying implementation

  • insert element

code details

 
 
  • Array expansion

code details

 
 
  • array query

code details

 
 

think

When data is added, the capacity will be expanded, but when data is deleted, will the capacity be reduced? So if the int16 type just added is deleted, will there be a downgrade operation?

Answer: no. The main trade-off is to reduce overhead.

# IntSet summary

IntSet can be regarded as a special integer array with some characteristics:

  • Redis will ensure that the elements in IntSet are unique and ordered
  • With a type upgrade mechanism, it can save memory space
  • The bottom layer uses a binary search method to query

# dictionary/hash table Dict

#basic concept

Dict  Dict consists of three parts, namely: hash table (DictHashTable), hash node (DictEntry), dictionary (Dict)

hash algorithm

Redis calculates the hash value and index value as follows:

  1. Use the hash function set by the dictionary to calculate the hash value of the key key hash = dict->type->hashFunction(key);

  2. Use the sizemask attribute of the hash table and the hash value obtained in the first step to calculate the index value index = hash & dict->ht[x].sizemask;

hash collision

Hash Collision (Hash Collision) refers to the situation where two or more different keys (Key) are mapped to the same location by a hash function when using a hash table to store data. This situation complicates the storage and lookup of data, so something needs to be done to resolve hash collisions.

The solution to hashing in Dict is the chain address method.

other ways

In addition to the chain address method to solve the hash conflict, you can also use the open address method, the hash method, and the establishment of a public overflow area to solve the problem.

#underlying implementation

  • hash table
typedef struct dictht{
    // entry 数组
    // 数组中保存的是指向 entry 的指针
    dictEntry **table;
    // 哈希大小
    unsigned long size;
    // 哈希表大小的掩码,总等于 size - 1
    unsigned long sizemask;
    // entry 个数
    unsigned long used;
} dictht;
  • hash node
typedef struct dictEntry {
    void *key;  // 键
    union {
        void *val;
        uint64_t u64;
        int64_t s64;
        double d;
    } v;   // 值
    // 下一个 Entry 的指针
    struct dictEntry *next;
} dictEntry;

When we add a key-value pair to Dict, Redis first calculates the hash value (h) based on the key, and then uses h & sizemask to calculate which index position the element should be stored in the array.

  • dictionary
typedef struct dict{
    // dict 类型,内置不同的 hash 函数
    dictType *type;
    // 私有数组,在做特殊 hash 运算时使用
    void *privdata;
    // 一个Dict包含两个哈希表,其中一个是当前数据,另一个一般是空,rehash时使用
    dictht ht[2];
    // rehash 的进度,-1 表示未进行
    long rehashidx;
    // rehash是否暂停,1则暂停,0则继续]
    int16_t pauserehash; 
}dict;

#Expansion and contraction

When there are too many or too few key-value pairs stored in the hash table, the hash table must be expanded or contracted by rehash (re-hashing).

expansion

The HashTable in Dict is the realization of an array combined with a one-way linked list. When there are many elements in the set, it will inevitably lead to more hash conflicts. If the linked list is too long, the query efficiency will be greatly reduced.

Dict will check the load factor every time a new key-value pair is added, and the expansion of the hash table will be triggered when the following two conditions are met:

  • The LocalFactor of the hash table is >= 1, and the server does not execute background processes such as BGSAVE or BGREWRITEAOF.
  • The LocalFactor of the hash table > 5, regardless of whether a BGSAVE command or a BGREWRITEAOF command is being executed.

load factor

Load factor = the number of nodes saved in the hash table / the size of the hash table.

shrink

In addition to capacity expansion, Dict will also check the load factor every time an element is deleted. When LocalFactor < 0.1, hash shrinkage will be performed.

The specific steps of expansion and contraction are as follows:

  1. If the expansion operation is performed, a hash table with a size equal to that will be created based on the original hash table  ht[0].used*2n (that is, each expansion will double the space used by the original hash table to create another hash table). On the contrary, if the shrinking operation is performed, each shrinking is to create a new hash table based on the doubling of the used space.

  2. Reuse the hash algorithm, calculate the index value, and then put the key-value pair in the new hash table position.

  3. After all key-value pairs are migrated, release the memory space of the original hash table.

#ReHash

Regardless of expansion or contraction, a new hash table must be created, resulting in changes in the size and sizemask of the hash table, and the query of the key is related to the sizemask. Therefore, the index must be recalculated for each key in the hash table and inserted into a new hash table. This process is called rehash. The specific steps are as follows.

  1. Calculate the realesize of the new hash table, the value depends on whether the current expansion or contraction is to be done:
    • If it is expansion, the new size is the first dict.ht[O].used + 12n2 ^ n2n greater than or equal to
    • If it is contraction, the new size is the first dict.ht[O].used2n2 ^ n2n (not less than 4) that is greater than or equal to
  2. Apply for memory space according to the new realeSize, create it dictht, and assign it todict.ht[1]
  3. set dict.rehashidx=0, startrehash
  4. will  dict.ht[O] each dictEntryof rehashthedict.ht[1]
  5. Assign  dict.ht[1] the value to dict.ht[O]dict.ht[1] initialize to an empty hash table, and release the original dict.ht[O]memory

The rehash of Dict is not completed at one time. If the Dict contains millions of entries, rehash must be completed at one time, which may cause the main thread to block. Therefore, the rehash of Dict is completed in multiple times and gradually, so it is called progressive rehash.

  1. Calculate the realesize of the new hash table, the value depends on whether the current expansion or contraction is to be done:

    • If it is expansion, the new size is the first dict.ht[O].used + 12n2 ^ n2n greater than or equal to
    • If it is contraction, the new size is the first dict.ht[O].used2n2 ^ n2n (not less than 4) that is greater than or equal to
  2. Apply for memory space according to the new realeSize, create it dictht, and assign it todict.ht[1]

  3. set dict.rehashidx=0, startrehash

  4. Every time you perform a new, query, modify, or delete operation, check to see dict.rehashidxif it is greater than -1, and if so, rehash the linked list dict.ht[0].table[rehashid], and delete it . All the data until the rehash toentrydictht[1]rehashidx++dict.ht[0]dict.ht[1]

  5. Assign  dict.ht[1] the value to dict.ht[O]dict.ht[1] initialize to an empty hash table, and release the original dict.ht[O]memory

  6. Assign rehashidx to -1, which means the end of rehash

  7. In the rehash process, new operations are directly written ht[1], and queries, modifications, and deletions are searched dict,ht[0]and dict.ht[1]executed in sequence. In this way, the data that can be ensured ht[0]will only decrease but not increase, and the rehash will eventually be empty

What is progressive rehash?

That is to say, the expansion and contraction operations are not completed in a one-time and centralized manner, but are completed in multiple times and gradually. If there are only a few dozens of key-value pairs stored in Redis, then the rehash operation can be completed in an instant, but if there are millions, tens of millions, or even hundreds of millions of key-value pairs, then performing rehash at one time will inevitably cause Redis to be unable to perform other operations for a period of time. So Redis adopts progressive rehash, so that during the progressive rehash, operations such as deletion, search and update of the dictionary may be performed on two hash tables. If the first hash table is not found, it will go to the second hash table to search. However, the addition operation must be performed on a new hash table.

It can be simply understood as slowly migrating the old hash table to the new hash table.

#DictSummary _

Dict structure

  • Similar to java's HashTable, the underlying layer is an array plus a linked list to resolve hash conflicts

  • Dict contains two hash tables, ht[0]usually used ht[1]for rehash

Dict scaling

  • When LoadFactor is greater than 5 or LoadFactor is greater than 1 and there are no child process tasks, Dict expands
  • Dict shrinks when LoadFactor is less than 0.1
  • The expansion size is the first used + 12n2 ^ n2n greater than or equal to
  • shrink size to the first used2n2^n2n greater than or equal to
  • Dict uses progressive rehash, and rehash is performed each time Dict is accessed
  • When rehash, ht[0]it only decreases but not increases, and the new operation is only ht[1]executed, and other operations are performed in the two hash tables

# Compression list ZipList

ZipList can be regarded as a special kind of double-ended linked list, which consists of a series of specially encoded contiguous memory blocks. A push-pop operation can be done at either end, and the time complexity of this operation is O(1).

 

  • zlbytes: uint32_t type, 4 bytes, records the number of bytes occupied by the entire compressed list.
  • zltail: uint32_t type, 4 bytes, records how many bytes the tail node of the compressed list is from the start address of the compressed list, and the address of the tail node can be determined through this offset.
  • zllenuint16_t type, 2 bytes, records the number of nodes contained in the compressed list, the maximum value is UINT16_MAX (65534), if it exceeds this number, it will be recorded as 65535, but the actual number of nodes needs to traverse the entire compressed list to get it.
  • entry: list node, the length is variable, each node contained in the compressed list, the length of the node is determined by the content saved by the node.
  • zlend: uint8_t type, 1 byte, special value 0xFF (decimal 255), used to mark the end of the compressed list.

#ZipListEntry

The Entry in ZipList does not record the pointers of the front and back nodes like ordinary linked lists, because recording two pointers takes up 16 bytes and wastes memory, but uses the following structure:

 

  • previous_entry_length: the length of the previous node, 1 or 5 bytes
    • If the length of the previous node is less than 254 bytes, use 1 byte to save and this length value
    • If the length of the previous node is greater than 254 nodes, use 5 bytes to save the length value, the first byte is 0xfe, and the last four bytes are the real length data.
  • encoding: encoding attribute, used to record the data type (string or integer) and length of the content, occupying 1, 2 or 5 bytes.
  • contents: responsible for saving the data of the node, which can be a string or an integer

Why is ZipList particularly memory-efficient

After understanding the Entry structure of ZipList, it is easy to understand why ZipList saves memory.

  • The memory saving of ziplist is relative to the ordinary list. If it is an ordinary array, then the memory occupied by each element is the same and depends on the largest element (obviously it needs to reserve space)
  • Therefore, when designing ziplist, it is easy to think that each element should be stored according to the actual content size as much as possible, so the encoding field is added to refine the storage size for different encodings.
  • A problem that needs to be solved at this time is how to locate the next element when traversing elements? In an ordinary array, each element has a fixed length, so there is no need to consider this issue; however, the memory occupied by each data in a ziplist is different, so in order to solve the traversal, it is necessary to increase the length of the previous element recorded, so the prelen field is added.

# Encoding encoding

The Encoding encoding in ZipListEntry is divided into two types: string and integer:

  • String: Encoding starts with "00", "01" and "10", then the content is a string type

    • |00pppppp| : At this time, the encoding length is 1 byte, and the last six digits of this byte represent the length of the string stored in the entry. Since it is 6 digits, the length of the string stored in the entry cannot exceed 63;
    • |01pppppp|qqqqqqqq| At this time, the length of the encoding is two bytes; at this time, the last 14 bits of the encoding are used to store the length of the string, and the length cannot exceed 16383;
    • |10000000|qqqqqqqq|rrrrrrrr|ssssssss|ttttttt| At this time, the encoding length is 5 bytes, and the following 4 bytes are used to indicate the length of the string stored in the encoding, and the length cannot exceed 2^32 - 1;
  • Integer: Encoding starts with "11", then content is an integer type, and encoding only occupies 1 byte.

    • 11000000:int16_t (2 bytes)
    • 11010000:int32_t (4 bytes)
    • 11100000:int64_t (8 bytes)
    • 11110000: 24-bit signed integer (3 bytes)
    • 11111110: 8-bit signed integer (1 bytes)
    • 1111xxxx: Save the value directly in the xxxx position, the range is from 0001 to 1101, and the result is the actual value after subtracting 1
  • 11111111 : zlend

#Chain update problem

The size and length of the previous node is stored in the ZipListEntry node. If the length of the previous node is less than 254 bytes, one byte is used to save the length. If it is greater than or equal to 254 bytes, 5 bytes are used to save the length. Then when the data of a node changes, it happens to change from less than 254 bytes to more than 254 bytes, then the previous_entry_length attribute changes from 1 byte to 5 bytes. Since the Entry nodes in ZipList exist continuously, all subsequent nodes need to be moved. If the follow-up space is insufficient, you need to apply for a new space and other issues.

The continuous multiple space expansion operations generated in this special case of ZipList are called chain updates. New additions and deletions may lead to chain updates. ZipList does not reserve memory space, and shrinks immediately after removing nodes, which means that memory allocation will be performed for each write operation.

# ZipListSummary

  1. The compressed list ZipList can be regarded as a "double-ended linked list" of continuous memory space.
  2. The nodes in the list are not connected by pointers, but are addressed by recording the length of the previous node and the current node, and the memory usage is low.
  3. If the list data is large, the linked list is too long, which may affect the query efficiency. When querying, only traversal can be performed, O(n)
  4. Continuous update problems may occur when adding or deleting large data.

think

  • Although ZipList saves memory, the application memory must be a continuous space. If the memory occupies a lot, the efficiency of application memory is very low. what to do?

To alleviate this problem, we must limit the length of the ZipList and the size of the Entry.

  • We want to store a lot of data, what should we do if we exceed the optimal upper limit of ZipList?

We can create multiple ZipLists to store data in pieces.

  • After the data is split and stored, it is scattered and inconvenient to manage and search. How do these multiple ZipLists establish a relationship?

Redis version 3.2 introduces a new data structure, QuickList, which is a double-ended linked list, except that each node in the linked list is a ZipList.

#Quick list QuickList

#basic concept

The structure of QuickList is newly added after Redis3.2 version, the previous version is list (namely LinkedList), which is used in String data type.

QuickList is a double-ended linked list structure with ZipList as the node. From a macro point of view, QuickList is a doubly linked list, from a micro point of view, each node of QuickList is a ZipList.

QuickList Diagram

#underlying implementation

  • quicklistNote

code details

 
 
  • quicklistLZF

code details

 
 
  • quicklistBookmark

code details

 
 
  • quicklist

code details

 
 
  • quicklistIter

code details

 
 
  • quicklistEntry

code details

 
 
  • quicklistNode: Macroscopically, quicklist is a linked list, and this structure describes the nodes in the linked list. It holds the underlying ziplist through the zl field. In short, it describes a ziplist instance
  • quicklistLZF: ziplist is a piece of continuous memory. After being compressed by LZ4 algorithm, it can be packaged into a quicklistLZF structure. Whether to compress each ziplist instance in the quicklist is a configurable item. If this configuration item is enabled, then the quicklistNode.zl field does not point to a ziplist instance, but a compressed quicklistLZF instance
  • quicklistBookmark: A bookmark added at the end of the quicklist, it will only be used if the excess memory usage of a large number of nodes is negligible and it is really necessary to iterate them in batches. When not in use, they don't add any memory overhead.
  • quicklist: This is the definition of a double-linked list. head and tail respectively point to the head and tail pointers. len represents the nodes in the linked list. count refers to the number of entries in all ziplists in the entire quicklist. The fill field affects the maximum space occupied by the ziplist in each linked list node, and compress affects whether each ziplist should be further compressed with the LZ4 algorithm to save memory space.
  • quicklistIter: is an iterator quicklistEntry is the encapsulation of the entry concept in ziplist.
  • quicklist: As a well-encapsulated data structure, users do not want to perceive its internal implementation, so the concept of ziplist.entry needs to be repackaged.

# limit compression

limit

In order to avoid too many entries in each ZipList in QuickList, Redis provides a configuration item: list-max-ziplist-size to limit.

  • If the value is positive, it represents the maximum number of entries allowed by ZipList
  • If the value is negative, it represents the maximum memory size of ZipList, divided into 5 cases:
    • -1 : The memory footprint of each ZipList cannot exceed 4 kb
    • -2 : The memory footprint of each ZipList cannot exceed 8 kb
    • -3 : The memory footprint of each ZipList cannot exceed 16 kb
    • -4 : The memory footprint of each ZipList cannot exceed 32 kb
    • -5 : The memory footprint of each ZipList cannot exceed 64 kb
    • The default value is -2, which can  config get list-max-ziplist-sizebe viewed with the command.

compression

In addition to controlling the size of ZipList, QuickList can also compress the ZipList of nodes. It is controlled by the configuration item list-compress-depth. Because linked lists are generally accessed more from the beginning to the end, the beginning and the end are not compressed. This parameter is to control the number of nodes that are not compressed at the beginning and end:

  • 0: special value, representing no compression
  • 1: Indicates that the first and last nodes of the QuickList are not compressed, and the middle nodes are compressed
  • 2: Indicates that the first and last nodes of the QuickList are not compressed, and the middle nodes are compressed
  • ......And so on
  • The default value is 0, you can use config list-compress-depththe command to view

# QuickList Summary

  • QuickList is a double-ended list whose nodes are ZipList
  • The node adopts ZipList, which solves the memory occupation problem of the traditional linked list
  • Control the size of ZipList to solve the problem of continuous memory space application efficiency
  • Intermediate nodes can be compressed, further saving memory

#jump table SkipList

For a singly linked list, even if the data stored in the linked list is ordered, if we want to find a certain data in it, we can only traverse the linked list from beginning to end. In this way, the search efficiency will be very low, and the time complexity will be very high, which is O(n). For example, to find 12, it takes 7 searches. In order to solve this problem, we can add multi-level index pointers to the linked list, so that we can quickly find the desired node.

#basic concept

SkipList (jump list) is a linked list first, but there are some differences compared with traditional linked lists:

  • The elements in SkipList are stored in ascending order
  • A node may contain multiple pointers with different spans, up to 32 levels of pointers are supported.

Several levels of pointers represent several nodes at a time.

 

SkipList memory structure

#underlying implementation

  • zskiplist
typedef struct zskiplist {
    struct zskiplistNode *header, *tail;
    unsigned long length;
    int level;
} zskiplist;
  • zshiplistNode
typedef struct zskiplistNode {
    sds ele;
    double score;
    struct zskiplistNode *backward;
    struct zskiplistLevel {
        struct zskiplistNode *forward;
        unsigned long span;
    } level[];
} zskiplistNode;

# SkipListSummary

  • The jump list is a doubly linked list, each node contains score and ele values
  • Nodes are sorted according to the score value, and if the score value is the same, they are sorted according to the ele dictionary
  • Each node can contain multiple layers of pointers, the number of layers is a random number between 1 and 32
  • Different layer pointers have different spans to the next node, the higher the layer, the larger the span
  • The efficiency of adding, deleting, modifying and checking is basically the same as that of red-black tree, but the implementation is simpler

Guess you like

Origin blog.csdn.net/qq_33685334/article/details/131324299