SDS-Redis source code analysis

SDS (simple dynamic string) is the encapsulation of the string provided by Redis. It is also the most extensive data structure in redis. It is also the basis of many other data structures, so I chose to introduce SDS first. SDS is also compatible with some C string APIs (strcmp, strlen). How it is compatible with C strings I think there is also a very sao operation. You will understand after reading my blog. Before starting the formal content, I first throw a few questions (some are also high-frequency interview questions). Learning with questions is also a very good learning method.

  1. C language also supports String, why does Redis encapsulate one by itself?
  2. What is the meaning of D (dynamic) in SDS?
  3. What is the data structure of SDS? Why design like that?
  4. How is SDS compatible with C strings?

The source code related to sds in Redis is in src/sds.c  and src/sds.h (the link can jump directly to my Chinese annotated version of redis source code), in which sds.h defines all SDS apis, and of course it is also implemented Some APIs, such as the length of the sds, the remaining free space of the sds, etc., don’t rush to look at the code. Let’s take a look at the data structure of the sds. After reading it, you can see why the code is so written.

sdshdr data structure

Redis provides sdshdr5, sdshdr8, sdshdr16, sdshdr32, and sdshdr64 implementations of the sds. In addition to the special sdshdr5, the other sdshdr differ not only in the type of the two fields. I will take sdshdr8 and sdshdr16 as examples. The struct definitions are as follows.

struct __attribute__ ((__packed__)) sdshdr8 {
    uint8_t len; /* 已使用空间大小 */
    uint8_t alloc; /* 总共可用的字符空间大小,应该是实际buf的大小减1(因为c字符串末尾必须是\0,不计算在内) */
    unsigned char flags; /* 标志位,主要是识别这是sdshdr几,目前只用了3位,还有5位空余 */
    char buf[];   /* 真正存储字符串的地方 */
};
struct __attribute__ ((__packed__)) sdshdr16 {
    uint16_t len; /* used */
    uint16_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};

sdshdr32 sdshdr64 is also consistent with the above structure, the only difference is that the data types of len and alloc are different. Compared with the native string of c, sds has three fields of len, alloc, and flag to store some additional information. Redis takes into account the huge loss caused by string splicing, so it will pre-allocate each time a new sds is created. Some space to cope with future growth, the relationship between sds and C string, friends who are familiar with java may decide that it is like the relationship between String and StringBuffer in java. Because of the mechanism of reserving space, redis needs to record the allocated and total space, and of course the available space can be directly calculated.
Insert picture description here

Next question, why does redis bother to provide five SDSs from sdshdr5 to sdshdr64? I think this can only show that the Redis author has the mechanism of picking up the memory, sacrificing the simplicity of the code in exchange for a few bytes of memory space saved by each sds. From the sds initialization methods sdsnew and sdsnewlen, we can see that redis needs to pass the initialization length when creating a new sds, and then determine which sdshdr to use according to the initialized length. If the length is less than 2^8, use sdshdr8, so that len ​​and alloc are only Occupying two bytes, a relatively short string may be very large, so the memory saved is still very considerable. Knowing the data structure and design principle of sds, the code of sdsnewlen is very easy to understand, as follows:

sds sdsnewlen(const void *init, size_t initlen) {
    void *sh;
    sds s;
    // 根据初始化的长度确定用哪种sdshdr
    char type = sdsReqType(initlen);
    /* 空字符串大概率之后会append,但sdshdr5不适合用来append,所以直接替换成sdshdr8 */
    if (type == SDS_TYPE_5 && initlen == 0) type = SDS_TYPE_8;
    int hdrlen = sdsHdrSize(type);
    unsigned char *fp; /* flags pointer. */

    sh = s_malloc(hdrlen+initlen+1);
    if (sh == NULL) return NULL;
    if (init==SDS_NOINIT)
        init = NULL;
    else if (!init)
        memset(sh, 0, hdrlen+initlen+1);
    /* 注意:返回的s并不是直接指向sds的指针,而是指向sds中字符串的指针,sds的指针还需要
     * 根据s和hdrlen计算出来 */
    s = (char*)sh+hdrlen;  
    fp = ((unsigned char*)s)-1;
    switch(type) {
        case SDS_TYPE_5: {
            *fp = type | (initlen << SDS_TYPE_BITS);
            break;
        }
        case SDS_TYPE_8: {
            SDS_HDR_VAR(8,s);
            sh->len = initlen;
            sh->alloc = initlen;
            *fp = type;
            break;
        }
        case SDS_TYPE_16: {
            SDS_HDR_VAR(16,s);
            sh->len = initlen;
            sh->alloc = initlen;
            *fp = type;
            break;
        }
        case SDS_TYPE_32: {
            SDS_HDR_VAR(32,s);
            sh->len = initlen;
            sh->alloc = initlen;
            *fp = type;
            break;
        }
        case SDS_TYPE_64: {
            SDS_HDR_VAR(64,s);
            sh->len = initlen;
            sh->alloc = initlen;
            *fp = type;
            break;
        }
    }
    if (initlen && init)
        memcpy(s, init, initlen);
    s[initlen] = '\0';
    return s;
}

Use of SDS

In the above code, I specially marked a note . The sds pointer returned by sdsnewlen() does not directly point to the address of sdshdr, but directly points to the address of buf in sdshdr . What are the benefits of doing this? The advantage is that it is compatible with c native strings. buf is actually a C native string + part of the free space, separated by a special symbol'\0' in the middle, and'\0' is a symbol that identifies the end of the C string, so that it is compatible with the C native string. The C string API can also be used directly. Of course, this is also disadvantageous, so you can't get the specific values ​​of len and alloc directly, but it's not impossible.

When we get an sds, let’s call sit assuming the sds . In fact, we didn’t know anything about the sds at the beginning, even if it was sdshdr. At this time, we can look at the first byte of s, we already know The data structure of sdshdr is the first byte is flag . According to the specific value of flag , we can infer which sdshdr s is, and we can also infer the real address of sds, and we know its len and alloc accordingly. With this in mind, the somewhat obscure code below is easy to understand.

oldtype = s[-1] & SDS_TYPE_MASK; // SDS_TYPE_MASK = 7 看下s前面一个字节(flag)推算出sdshdr的类型。 

// 这个宏定义直接推算出sdshdr头部的内存地址
#define SDS_HDR(T,s) ((struct sdshdr##T *)((s)-(sizeof(struct sdshdr##T))))
#define SDS_TYPE_5_LEN(f) ((f)>>SDS_TYPE_BITS)

// 获取sds支持的长度  
static inline size_t sdslen(const sds s) {
    unsigned char flags = s[-1];  // -1 相当于获取到了sdshdr中的flag字段  
    switch(flags&SDS_TYPE_MASK) {  
        case SDS_TYPE_5:
            return SDS_TYPE_5_LEN(flags);
        case SDS_TYPE_8:
            return SDS_HDR(8,s)->len;  // 宏替换获取到sdshdr中的len
        ...
        // 省略 SDS_TYPE_16 SDS_TYPE_32的代码…… 
        case SDS_TYPE_64:
            return SDS_HDR(64,s)->len;
    }
    return 0;
}
// 获取sds剩余可用空间大小 
static inline size_t sdsavail(const sds s) {
    unsigned char flags = s[-1];
    switch(flags&SDS_TYPE_MASK) {
        case SDS_TYPE_5: {
            return 0;
        }
        case SDS_TYPE_8: {
            SDS_HDR_VAR(8,s);
            return sh->alloc - sh->len;
        }
        ... 
        // 省略 SDS_TYPE_16 SDS_TYPE_32的代码…… 
        case SDS_TYPE_64: {
            SDS_HDR_VAR(64,s);
            return sh->alloc - sh->len;
        }
    }
    return 0;
}
/* 返回sds实际的起始位置指针 */
void *sdsAllocPtr(sds s) {
    return (void*) (s-sdsHdrSize(s[-1]));
}

SDS expansion

When doing string splicing, the remaining free space of sds may be insufficient. At this time, it needs to be expanded. When should it be expanded, and how should it be expanded? This is a question that has to be considered. Many data structures in Java have dynamic expansion mechanisms, such as StringBuffer and HashMap, which are very similar to sds. They will dynamically determine whether the space is sufficient during use , and basically use exponential expansion first and then reach a certain size limit. The linear expansion method has just started, and Redis is no exception. Redis expands by 2 times within 1024 1024. As long as it does not exceed 1024 1024, an additional 200% of the space is requested first, but once the total length exceeds 1024 1024 bytes, It will only expand at most 1024 1024 bytes each time . The code for sds expansion in Redis is in sdsMakeRoomFor(). You can see that many string change APIs call this directly or indirectly at the beginning. Unlike the StringBuffer expansion in Java, Redis also needs to consider the change of sdshdr type when different string lengths are used. The specific code is as follows:

// 扩大sds的实际可用空间,以便后续能拼接更多字符串。 
// 注意:这里实际不会改变sds的长度,只是增加了更多可用的空间(buf) 
sds sdsMakeRoomFor(sds s, size_t addlen) {
    void *sh, *newsh;
    size_t avail = sdsavail(s);
    size_t len, newlen;
    char type, oldtype = s[-1] & SDS_TYPE_MASK; // SDS_TYPE_MASK = 7 
    int hdrlen;

    /* 如果有足够的剩余空间,直接返回 */
    if (avail >= addlen) return s;

    len = sdslen(s);
    sh = (char*)s-sdsHdrSize(oldtype);
    newlen = (len+addlen);
    // 在未超出SDS_MAX_PREALLOC前,扩容都是按2倍的方式扩容,超出后只能递增 
    if (newlen < SDS_MAX_PREALLOC)  // SDS_MAX_PREALLOC = 1024*1024
        newlen *= 2;
    else
        newlen += SDS_MAX_PREALLOC;

    type = sdsReqType(newlen);

    /*  在真正使用过程中不会用到type5,如果遇到type5直接使用type8*/
    if (type == SDS_TYPE_5) type = SDS_TYPE_8;

    hdrlen = sdsHdrSize(type);
    if (oldtype==type) {
        newsh = s_realloc(sh, hdrlen+newlen+1);
        if (newsh == NULL) return NULL;
        s = (char*)newsh+hdrlen;
    } else {
        // 扩容其实就是申请新的空间,然后把旧数据挪过去  
        newsh = s_malloc(hdrlen+newlen+1);
        if (newsh == NULL) return NULL;
        memcpy((char*)newsh+hdrlen, s, len+1);
        s_free(sh);
        s = (char*)newsh+hdrlen;
        s[-1] = type;
        sdssetlen(s, len);
    }
    sdssetalloc(s, newlen);
    return s;
}

Common API

I haven’t posted a lot of source code for sds.c. The other codes are essentially written around the sdshdr data structure and various string operations (basically all kinds of string creation, splicing, copying, expansion...) , As long as you know the design principle of sds, I believe you can easily write it out, here I will list all the APIs related to sds, friends who are interested in the source code can move to src/sds.c , the Chinese annotation version of the API See src/sds.c for the list

sds sdsnewlen(const void *init, size_t initlen);  // 新建一个容量为initlen的sds
sds sdsnew(const char *init); // 新建sds,字符串为null,默认长度0 
sds sdsempty(void);  // 新建空字符“” 
sds sdsdup(const sds s); // 根据s的实际长度创建新的sds,目的是降低内存的占用
void sdsfree(sds s); // 释放sds 
sds sdsgrowzero(sds s, size_t len); // 把sds增长到指定的长度,增长出来的新的空间用0填充 
sds sdscatlen(sds s, const void *t, size_t len); // 在sds上拼接字符串t的指定长度部分 
sds sdscat(sds s, const char *t);  // 把字符串t拼接到sds上 
sds sdscatsds(sds s, const sds t); // 把两个sds拼接在一起  
sds sdscpylen(sds s, const char *t, size_t len); //  把字符串t指定长度的部分拷贝到sds上 
sds sdscpy(sds s, const char *t); // 把字符串t拷贝到sds上 

sds sdscatvprintf(sds s, const char *fmt, va_list ap); // 把用printf格式化后的字符拼接到sds上 

sds sdscatfmt(sds s, char const *fmt, ...);   // 将多个参数格式化成一个字符串后拼接到sds上 
sds sdstrim(sds s, const char *cset);  // 在sds中移除开头或者末尾在cset中的字符  
void sdsrange(sds s, ssize_t start, ssize_t end);  // 截取sds的子串 
void sdsupdatelen(sds s); // 更新sds字符串的长度 
void sdsclear(sds s);  // 清空sds中的内容,但不释放空间 
int sdscmp(const sds s1, const sds s2);  // sds字符串比较大小 
sds *sdssplitlen(const char *s, ssize_t len, const char *sep, int seplen, int *count);
void sdsfreesplitres(sds *tokens, int count);
void sdstolower(sds s); // 字符串转小写
void sdstoupper(sds s);  // 字符串转大写
sds sdsfromlonglong(long long value);  // 把一个long long型的数转成sds  
sds sdscatrepr(sds s, const char *p, size_t len); 
sds *sdssplitargs(const char *line, int *argc);
sds sdsmapchars(sds s, const char *from, const char *to, size_t setlen);
sds sdsjoin(char **argv, int argc, char *sep); // 把字符串数组按指定的分隔符拼接起来
sds sdsjoinsds(sds *argv, int argc, const char *sep, size_t seplen); // 把sds数组按指定的分隔符拼接起来

/* sds底层api */
sds sdsMakeRoomFor(sds s, size_t addlen);  // sds扩容
void sdsIncrLen(sds s, ssize_t incr); // 扩容指定长度
sds sdsRemoveFreeSpace(sds s); // 释放sds占用的多余空间
size_t sdsAllocSize(sds s); // 返回sds总共占用的内存大小
void *sdsAllocPtr(sds s); // 返回sds实际的起始位置指针

void *sds_malloc(size_t size); // 为sds分配空间 
void *sds_realloc(void *ptr, size_t size); // 
void sds_free(void *ptr);  // 释放sds空间 

Conclusion

Going back to the opening questions, I believe you can answer them after reading the above content. If you can’t answer them, then read them twice. This blog + source code is more effective.

In order to control the length of the blog, only the core code is explained here. I have not posted the source code of many APIs (after all)

Original link: https://my.oschina.net/xindoo/blog/4650366

If you think this article is helpful to you, you can like it and follow it to support it, or you can follow my public account. There are more technical dry goods articles and related information sharing on it, everyone learns and progresses together!

Guess you like

Origin blog.csdn.net/weixin_50205273/article/details/108813176