Redis dynamic string

Q: What is SDS

A: SDS is a kind of "dynamic string" used by Redis in the implementation process. Since Redis code is basically implemented in C language, SDS still relies on char buf[]storing data at the bottom layer . The data structure of the SDS object is roughly as shown in the figure below

It can be seen that there are three attributes in SDS structure members: len, free, buf. Among them, len identifies the number of valid characters in a string managed by an SDS object, and free represents how many valid characters the SDS can store without expanding the space, and buf is a char[]type of pointer that points to a continuous segment Memory space, this is where the string is actually stored (a valid string refers to a collection of strings other than \0).

Q: With C strings, why do I need SDS?

A: By reading the relevant data and reviewing the Redis documentation, you can summarize the following points of the benefits of using SDS instead of native C strings

* 更高效的获取一个 SDS 对象内保存的字符串的长度
* 杜绝缓冲区溢出
* 减少因字符串的修改导致的频繁分配和回收内存空间操作
* 二进制安全
* 和 C 语言有关字符串的库函数有一个更高的兼容性

In fact, seeing this, if you have previously implemented a "dynamic array" using "normal arrays" in other languages, then apart from the benefit of "binary security" that may not be understood, the rest should be familiar. Let's talk about these benefits separately below.

Q: How to get the length of the string more efficiently?

A: This problem is a pain point in traditional C strings. In a linear data structure, we can only get its exact length by traversing all valid elements in the data structure. The time complexity of this operation is O(N) level. But when we just take the C string as a member of the SDS data structure, we can lencalculate the exact length of the string in real time by adding another member . The calculation method is also very simple, that is, len+1 when doing the "add element" operation on the string , and len-1 when doing the "decrease element" operation . In this way, lenthe length of the string stored in the SDS can be obtained through access . An implementation similar to this:

void add(char a){
    buf[len++] = a;
}

void sub(char a){
    len--;
}

int length(char a){
    return len;
}

Q: How to prevent buffer overflow?

A: Buffer overflow is replaced by another more straightforward statement: tampering with data in the memory that does not belong to you. This phenomenon is more common in the operation of string splicing and adding characters to the string. The way to deal with this problem is also very simple: when the memory capacity allows, when a string needs more memory space, reallocate a "larger" continuous space to replace the valid data in the original space copy past. Among them, to detect whether the remaining space is exceeded, freethe value of the attribute can be used because it represents how much space is still available in the array. If you read the content of the previous paragraph carefully, you can find that there are several "ugly" steps in the process of preventing buffer overflow:

  1. May allocate a contiguous space in memory multiple times
  2. It is possible to copy the valid data in the original space to the new space multiple times
  3. If the allocated space has not been reclaimed and has been continuously allocated, memory leaks may occur

In response to emerging problems, we have adopted the following methods to solve them:

  1. Allocate new memory space according to a certain strategy to minimize the number of allocations
  2. When the free space reaches a certain threshold, reclaim the excess memory space

In Redis, the size of the "pre-allocated" space is determined through two steps:

  1. If the modified string length (len) is less than 1MB, in addition to allocating the necessary space, you also need to allocate lenfree space equal to the size . For example, if the length of the modified string is 10 (len=10), then after the modification, the size of the new memory space is =10+10+1=21.
  2. If the modified string length (len) is greater than 1MB, in addition to allocating the necessary space, a free space equal to 1MB needs to be allocated.

In SDS-related modification operations, the available space is compared with the actual space required. If it exceeds, new space will be allocated, otherwise the old space will be used. Through the above strategy, it is basically possible to reduce the number of times of "reallocating memory space" and "copying valid data in the original space to the new space" from each occurrence to a maximum of N (N is the modification operation). Number of times).

Here is an insight from the author: Many programmers tend to find a perfect solution when solving problems. If the author is the author, he might think when he sees this problem, whether there is a perfect solution. Solve the above problem. However, we can see that in industrial-grade projects such as Redis, the solutions it adopts are still very common, even the "implementation" that we usually use in practice. A seemingly "delay letting risk happen" approach is sometimes the most "perfect" approach. The programmer should pay more attention to how to solve the problem, rather than how to solve the problem "perfectly".

In addition to reducing the number of "allocation" operations by allocating "reserved space", we are also worried that if the allocation has been unlimited, then the memory will eventually be exhausted. This is the memory leak problem we often talk about. It is also very simple to solve it, which is to reclaim the allocated memory space according to a certain strategy. For example: when the usage of the memory space bound to an SDS has been lower than 25%, then we will reduce its memory space to half of the original. As for why we only shrink the original general instead of reclaiming all the free space, think about it carefully and you will know that if the recycling method is too extreme, then all the advantages of "pre-allocated" space will be wiped out (increase the number of memory allocations).

Therefore, in the SDS-related modification (mainly deleting elements) operations, the free space will not be reclaimed immediately, but will be used as "reserved space". In order to prevent "memory leaks", Redis provides a special API to truly release memory space.

Q: How to ensure that SDS is binary safe?

A: "Binary security" sounds like a relatively unfamiliar word, but if you combine the characteristics of C language strings and the characteristics of binary content, you can know that binary security mainly prevents \0special characters like this from appearing in its content . Interference with the correct interpretation of the original string. A problem that sounds relatively tall, and often the solutions to it are relatively simple. In Redis, in order to ensure "binary security", instead of using the \0characters of the C language string as the boundary of the stored string, len this attribute is used to identify the number of valid characters in the string.

Although, in order to ensure "binary security" we can ignore \0the fact that the C language string ends as a string. However, in most cases, people still use Redis to store "text information" (the content that complies with the C language string rules is not included \0). At this time, their operations may depend on C language and string-related library functions, so two conventions will be maintained in the implementation of SDS:

  1. When allocating memory space for a string, it will consider allocating 1 byte more space to\0
  2. When modifying the content of the string, a \0character will be appended at the end

Strings in Redis

In C language, a string can be represented by an   array at the \0 end  char. For example,  hello world it can be expressed as in C language  "hello world\0" . This simple string representation can meet the requirements in most cases, but it does not efficiently support the two operations of length calculation and append:

  • strlen(s)The complexity of calculating the string length ( ) each time is θ(N)θ(N).
  • To append a string N times, it must be necessary to re-allocate the string N memory ( realloc).

In Redis, string appending and length calculation are very common, and  APPEND  and  STRLEN  are these two operations. Direct mapping in Redis commands, these two simple operations should not become a performance bottleneck. In addition, in addition to processing C strings, Redis also needs to deal with simple byte arrays and server protocols. Therefore, for convenience, Redis's string representation should also be binary safe : the program should not save the string Make any assumptions about the data. The data can be \0 a C string at the  end, or a simple byte array, or data in other formats.

Considering these two reasons, Redis uses the sds type to replace the default string representation of the C language: sds can efficiently implement appending and length calculation, and is binary safe.

Implementation of sds

In the previous content, we have been describing sds as an abstract data structure. In fact, its implementation consists of the following two parts:

typedef char *sds;

struct sdshdr 
{
    int len;    // buf 已占用长度
    int free;   // buf 剩余可用长度
    char buf[]; // 实际保存字符串数据的地方
};

Wherein the type  sds is the  char * alias (Alias), and the structure  sdshdr is preserved  len ,  free and  buf three attributes.

As an example, the following is newly created, which also saves  hello world the sdshdr structure of the string  :

struct sdshdr {
    len = 11;
    free = 0;
    buf = "hello world\0";  // buf 的实际长度为 len + 1
};

Through  len attributes,  sdshdr a length calculation operation with a complexity of θ(1)θ(1) can be realized.

On the other hand, by  buf allocating some extra space and  free recording the size of the unused space, the  sdshdr number of memory reallocations required to perform the append operation can be greatly reduced. We will discuss this in detail in the next section.

Of course, sds also puts forward requirements for the correct implementation of operations-all processed  sdshdr functions must be updated len and  free attributes correctly  , otherwise it will cause bugs.

Optimize append operation

As mentioned earlier, using the  sdshdr structure, in addition to obtaining the length of the string with the complexity of θ(1)θ(1), it can also reduce the number of memory redistributions required for the append operation. The details are as follows Explain the principle of this optimization.

For ease of understanding, let's use a Redis execution instance as an example to explain what happens inside Redis when the following code is executed:

redis> SET msg "hello world"
OK

redis> APPEND msg " again!"
(integer) 18

redis> GET msg
"hello world again!"

First, the  SET command is created and saved  hello world to one  sdshdr , sdshdr the value of this  is as follows:

struct sdshdr {
    len = 11;
    free = 0;
    buf = "hello world\0";
}

When the APPEND  command is executed  , the corresponding  sdshdr is updated, and the string  " again!" will be appended to the original  "hello world" :

struct sdshdr {
    len = 18;
    free = 18;
    buf = "hello world again!\0                  ";     // 空白的地方为预分配空间,共 18 + 18 + 1 个字节
}

Note that when the  SET command is created  sdshdr ,  sdshdr the  free attribute is  0 that Redis did not  buf create additional space - and after executing  APPEND  , Redis  buf created more than double the size of the required space.

In this example, "hello world again!" a total 18 + 1 of bytes are needed to  save  , but the program allocates 18 + 18 + 1 = 37 a byte for us  -in this way, if the same one sdshdr is appended again in the future  , as long as the length of the appended content does not exceed  free the value of the attribute, then There is no need to  buf reallocate memory.

For example, executing the following command will not cause  buf memory reallocation, because the length of the newly appended string is less than  18 :

redis> APPEND msg " again!"
(integer) 25

After executing the  APPEND  command again,  msg the sdshdr structure corresponding to the value  can be expressed as follows:

struct sdshdr {
    len = 25;
    free = 11;
    buf = "hello world again! again!\0           ";     // 空白的地方为预分配空间,共 18 + 18 + 1 个字节
}

sds.c/sdsMakeRoomFor The function describes  sdshdr this memory pre-allocation optimization strategy. The following is a pseudo-code version of this function:

def sdsMakeRoomFor(sdshdr, required_len):

    # 预分配空间足够,无须再进行空间分配
    if (sdshdr.free >= required_len):
        return sdshdr

    # 计算新字符串的总长度
    newlen = sdshdr.len + required_len

    # 如果新字符串的总长度小于 SDS_MAX_PREALLOC
    # 那么为字符串分配 2 倍于所需长度的空间
    # 否则就分配所需长度加上 SDS_MAX_PREALLOC 数量的空间
    if newlen < SDS_MAX_PREALLOC:
        newlen *= 2
    else:
        newlen += SDS_MAX_PREALLOC

    # 分配内存
    newsh = zrelloc(sdshdr, sizeof(struct sdshdr)+newlen+1)

    # 更新 free 属性
    newsh.free = newlen - sdshdr.len

    # 返回
    return newsh

In the current version of Redis (the author’s version is redis-6.0.9), 

#ifndef __SDS_H
#define __SDS_H

#define SDS_MAX_PREALLOC (1024*1024)
extern const char *SDS_NOINIT;

#include <sys/types.h>
#include <stdarg.h>
#include <stdint.h>

typedef char *sds;

....

void *sds_malloc(size_t size);
void *sds_realloc(void *ptr, size_t size);
void sds_free(void *ptr);

#ifdef REDIS_TEST
int sdsTest(int argc, char *argv[]);
#endif

#endif

可以看到,SDS_MAX_PREALLOC The value of is  1024 * 1024 , that is, when 1MB the string size is smaller than  the append operation, sdsMakeRoomFor more than twice the required size is allocated for them; when the size of the string is greater than  1MB , then  sdsMakeRoomFor additional 1MB space is allocated for them  .

Will this allocation strategy waste memory?

  •  The string that has executed the  APPEND command will have additional pre-allocated space, and the pre-allocated space will not be released unless the key corresponding to the string is deleted, or the characters that are reloaded when Redis is shut down and restarted String objects will not have pre-allocated space.
  • Because the  number of string keys to execute the  APPEND command is usually not large, and the memory footprint is usually not large, this is generally not a problem.
  • On the other hand, if there  are many keys to perform the  APPEND operation, and the volume of the string is very large, it may be necessary to modify the Redis server to release some pre-allocated space of the string keys regularly, so as to use memory more efficiently.

Summary :

1. When obtaining the length of the string, the C string needs to traverse the string until it finds'\0', its complexity is O(n), and SDS can directly access the len attribute to directly obtain the length and complexity of the string It is O(1).

2. SDS API prevents buffer overflow. When SDS calls SdsCat, it will first determine whether the SDS space is sufficient. If it is not enough, expand SDS first, and then perform string splicing.

3. In order to reduce the performance impact of memory redistribution, SDS string growth will do memory pre-allocation operations, through the pre-allocation strategy, can effectively reduce the number of redis allocation of memory.

4. SDS is binary safe. The C string finds the end of the string by judging whether it is'\0', and the SDS finds the end of the string by the len attribute, so that there is no fear of'\0' in the middle of the string.

In addition,

  • Redis's string is represented as  sds , not C string (  \0 end with  char*).
  • Compared with C strings, it  sds has the following characteristics:
    • Length calculation can be performed efficiently ( strlen);
    • Can perform append operations efficiently ( append);
    • Binary security;
  • sds It will optimize the append operation: speed up the append operation and reduce the number of memory allocations, at the cost of occupying more memory, and these memory will not be actively released.

Guess you like

Origin blog.csdn.net/u013318019/article/details/110691642