《redis设计与实现》-17 集群 gossip协议

一 gossip协议

上一篇介绍了集群的cluster meet命令实现过程，握手过程结束后，A节点会将B节点信息通过gossip协议传播给集群的其他节点，其他节点也会与B节点握手，最终经过一段时间后，B节点会被集群的所有节点认识。

补充下背景知识：

Gossip算法如其名，灵感来自办公室八卦，只要一个人八卦一下，在有限的时间内所有的人都会知道该八卦的信息，这种方式也与病毒传播类似，因此Gossip有众多的别名“闲话算法”、“疫情传播算法”、“病毒感染算法”、“谣言传播算法”。

但Gossip并不是一个新东西，之前的泛洪查找、路由算法都归属于这个范畴，不同的是Gossip给这类算法提供了明确的语义、具体实施方法及收敛性证明。

Gossip 过程是由种子节点发起，当一个种子节点有状态需要更新到网络中的其他节点时，它会随机的选择周围几个节点散播消息，收到消息的节点也会重复该过程，直至最终网络中所有的节点都收到了消息。这个过程可能需要一定的时间，由于不能保证某个时刻所有节点都收到消息，但是理论上最终所有节点都会收到消息，因此它是一个最终一致性协议。

下面，我们通过一个具体的实例来体会一下 Gossip 传播的完整过程

为了表述清楚，我们先做一些前提设定：

（1）Gossip 是周期性的散播消息，把周期限定为 1 秒
（2）被感染节点随机选择 k 个邻接节点（fan-out）散播消息，这里把 fan-out 设置为 3，每次最多往 3 个节点散播。
（3）每次散播消息都选择尚未发送过的节点进行散播
（4）收到消息的节点不再往发送节点散播，比如 A -> B，那么 B 进行散播的时候，不再发给 A。

这里一共有 16 个节点，节点 1 为初始被感染节点，通过 Gossip 过程，最终所有节点都被感染：

完整介绍：参见https://www.jianshu.com/p/8279d6fd65bb

二 redis 实现

在Redis中，节点信息是如何传播的呢？答案是通过发送PING或PONG消息时，会包含节点信息，然后进行传播的。

我们先介绍一下Redis Cluster中，消息是如何抽象的。一个消息对象可以是PING、PONG、MEET，也可以是UPDATE、PUBLISH、FAIL等等消息。他们都是clusterMsg类型的结构，该类型主要由消息包头部和消息数据组成。

消息包头部包含签名、消息总大小、版本和发送消息节点的信息。
消息数据则是一个联合体union clusterMsgData，联合体中又有不同的结构体来构建不同的消息。

PING、PONG、MEET属于一类，是clusterMsgDataGossip类型的数组，可以存放多个节点的信息，该结构如下：

/* Initially we don't know our "name", but we'll find it once we connect
 * to the first node, using the getsockname() function. Then we'll use this
 * address for all the next messages. */
typedef struct {
	  // 节点的名字
    // 在刚开始的时候，节点的名字会是随机的
    // 当 MEET 信息发送并得到回复之后，集群就会为节点设置正式的名字
    char nodename[CLUSTER_NAMELEN];
    // 最近一次发送PING的时间戳
    uint32_t ping_sent;
    // 最近一次接收PONG的时间戳
    uint32_t pong_received;
    // 节点的IP地址
    char ip[NET_IP_STR_LEN];  /* IP address last time it was seen */
    // 节点的端口号
    uint16_t port;              /* port last time it was seen */
    // 节点的标识
    uint16_t flags;             /* node->flags copy */
    // 对齐字节，不使用
    uint16_t notused1;          /* Some room for future improvements. */
    uint32_t notused2;
} clusterMsgDataGossip;

在clusterSendPing()函数中，首先就是会将随机选择的节点的信息加入到消息中。代码如下：

/* Send a PING or PONG packet to the specified node, making sure to add enough
 * gossip informations. */
// 向指定节点发送一条 MEET 、 PING 或者 PONG 消息 
void clusterSendPing(clusterLink *link, int type) {
    unsigned char *buf;
    clusterMsg *hdr;
    int gossipcount = 0; /* Number of gossip sections added so far. */
    int wanted; /* Number of gossip sections we want to append if possible. */
    int totlen; /* Total packet length. */
    /* freshnodes is the max number of nodes we can hope to append at all:
     * nodes available minus two (ourself and the node we are sending the
     * message to). However practically there may be less valid nodes since
     * nodes in handshake state, disconnected, are not considered. */
    // freshnodes 是用于发送 gossip 信息的计数器
    // 每次发送一条信息时，程序将 freshnodes 的值减一
    // 当 freshnodes 的数值小于等于 0 时，程序停止发送 gossip 信息
    // freshnodes 的数量是节点目前的 nodes 表中的节点数量减去 2 
    // 这里的 2 指两个节点，一个是 myself 节点（也即是发送信息的这个节点）
    // 另一个是接受 gossip 信息的节点 
    int freshnodes = dictSize(server.cluster->nodes)-2;

    /* How many gossip sections we want to add? 1/10 of the number of nodes
     * and anyway at least 3. Why 1/10?
     * 计算我们要附加的gossip节数,gossip部分的节点数应该是所有节点数的1/10，但是最少应该包含3个节点信息。
     *  之所以在gossip部分需要包含所有节点数的1/10，是为了能够在下线检测时间，也就是2倍的node_timeout时间内，
     * 如果有节点下线的话，能够收到大部分集群节点发来的，关于该节点的下线报告； 1/10这个数是这样来的： 
     * If we have N masters, with N/10 entries, and we consider that in
     * node_timeout we exchange with each other node at least 4 packets
     * (we ping in the worst case in node_timeout/2 time, and we also
     * receive two pings from the host), we have a total of 8 packets
     * in the node_timeout*2 falure reports validity time. So we have
     * that, for a single PFAIL node, we can expect to receive the following
     * number of failure reports (in the specified window of time):
      * PROB * GOSSIP_ENTRIES_PER_PACKET * TOTAL_PACKETS:
     *
     * PROB = probability of being featured in a single gossip entry,
     *        which is 1 / NUM_OF_NODES.
     * ENTRIES = 10.
     * TOTAL_PACKETS = 2 * 4 * NUM_OF_MASTERS.
     *
     * If we assume we have just masters (so num of nodes and num of masters
     * is the same), with 1/10 we always get over the majority, and specifically
     * 80% of the number of nodes, to account for many masters failing at the
     * same time.
     *
     * Since we have non-voting slaves that lower the probability of an entry
     * to feature our node, we set the number of entires per packet as
     * 10% of the total nodes we have. 
     
     *  如果共有N个集群节点，在超时时间node_timeout内，当前节点最少会收到其他任一节点发来的4个心跳包：
     * 因节点最长经过node_timeout/2时间，就会其他节点发送一次PING包。节点收到PING包后，会回复PONG包。
     * 因此，在node_timeout时间内，当前节点会收到节点A发来的两个PING包，并且会收到节点A发来的，对于我发过去的PING包的回复包，也就是2个PONG包。
     * 因此，在下线监测时间node_timeout*2内，会收到其他任一集群节点发来的8个心跳包。
     * 因此，当前节点总共可以收到8*N个心跳包，每个心跳包中，包含下线节点信息的概率是1/10，
     * 因此，收到下线报告的期望值就是8*N*(1/10)，也就是N*80%，因此，这意味着可以收到大部分节点发来的下线报告。
     */
    // wanted 的值是集群节点的十分之一向下取整，并且最小等于3
    // wanted 表示的意思是gossip中要包含的其他节点信息个数
    wanted = floor(dictSize(server.cluster->nodes)/10);
    if (wanted < 3) wanted = 3;
    // 因此 wanted 最多等于 freshnodes。
    if (wanted > freshnodes) wanted = freshnodes;

    /* Compute the maxium totlen to allocate our buffer. We'll fix the totlen
     * later according to the number of gossip sections we really were able
     * to put inside the packet. */
    // 计算分配消息的最大空间
    totlen = sizeof(clusterMsg)-sizeof(union clusterMsgData);
    totlen += (sizeof(clusterMsgDataGossip)*wanted);
    /* Note: clusterBuildMessageHdr() expects the buffer to be always at least
     * sizeof(clusterMsg) or more. */
    // 消息的总长最少为一个消息结构的大小
    if (totlen < (int)sizeof(clusterMsg)) totlen = sizeof(clusterMsg);
    // 分配空间
    buf = zcalloc(totlen);
    hdr = (clusterMsg*) buf;

    /* Populate the header. */
    // 设置发送PING命令的时间
    if (link->node && type == CLUSTERMSG_TYPE_PING)
        link->node->ping_sent = mstime();
    // 构建消息的头部
    clusterBuildMessageHdr(hdr,type);

    /* Populate the gossip fields */
    int maxiterations = wanted*3;
    // 构建消息内容
    while(freshnodes > 0 && gossipcount < wanted && maxiterations--) {
        // 随机选择一个集群节点
        dictEntry *de = dictGetRandomKey(server.cluster->nodes);
        clusterNode *this = dictGetVal(de);
        clusterMsgDataGossip *gossip;
        int j;

        /* Don't include this node: the whole packet header is about us
         * already, so we just gossip about other nodes. */
        // 1. 跳过当前节点，不选myself节点
        if (this == myself) continue;

        /* Give a bias to FAIL/PFAIL nodes. */
        // 2. 偏爱选择处于下线状态或疑似下线状态的节点
        if (maxiterations > wanted*2 &&
            !(this->flags & (CLUSTER_NODE_PFAIL|CLUSTER_NODE_FAIL)))
            continue;

        /* In the gossip section don't include:
         * 1) Nodes in HANDSHAKE state.
         * 3) Nodes with the NOADDR flag set.
         * 4) Disconnected nodes if they don't have configured slots.
         */
        // 以下节点不能作为被选中的节点：
        /*
            1. 处于握手状态的节点
            2. 带有NOADDR标识的节点
            3. 因为不处理任何槽而断开连接的节点
        */
        if (this->flags & (CLUSTER_NODE_HANDSHAKE|CLUSTER_NODE_NOADDR) ||
            (this->link == NULL && this->numslots == 0))
        {
            freshnodes--; /* Tecnically not correct, but saves CPU. */
            continue;
        }

        /* Check if we already added this node */
        // 如果已经在gossip的消息中添加过了当前节点，则退出循环(不要再选中它,否则就会出现重复)
        for (j = 0; j < gossipcount; j++) {
            if (memcmp(hdr->data.ping.gossip[j].nodename,this->name,
                    CLUSTER_NAMELEN) == 0) break;
        }
        // j 一定 == gossipcount
        if (j != gossipcount) continue;

        /* Add it */
        // 这个节点满足条件，则将其添加到gossip消息中
        freshnodes--;
        // 指向添加该节点的那个空间
        gossip = &(hdr->data.ping.gossip[gossipcount]);
        // 添加名字
        memcpy(gossip->nodename,this->name,CLUSTER_NAMELEN);
        // 记录发送PING的时间
        gossip->ping_sent = htonl(this->ping_sent);
        // 接收到PING回复的时间
        gossip->pong_received = htonl(this->pong_received);
        // 设置该节点的IP和port
        memcpy(gossip->ip,this->ip,sizeof(this->ip));
        gossip->port = htons(this->port);
        // 记录标识
        gossip->flags = htons(this->flags);
        gossip->notused1 = 0;
        gossip->notused2 = 0;
        // 已经添加到gossip消息的节点数加1
        gossipcount++;
    }

    /* Ready to send... fix the totlen fiend and queue the message in the
     * output buffer. */
    // 计算消息的总长度
    totlen = sizeof(clusterMsg)-sizeof(union clusterMsgData);
    totlen += (sizeof(clusterMsgDataGossip)*gossipcount);
    // 记录消息节点的数量到包头
    hdr->count = htons(gossipcount);
    // 记录消息节点的总长到包头
    hdr->totlen = htonl(totlen);
    // 发送消息
    clusterSendMessage(link,buf,totlen);
    zfree(buf);
}

Gossip协议包含的节点信息个数是wanted个，wanted 的值是集群节点的十分之一向下取整，并且最小等于3。为什么选择十分之一，这是因为Redis Cluster中计算故障转移超时时间是server.cluster_node_timeout*2，因此如果有节点下线，就能够收到大部分集群节点发送来的下线报告。
作者在注释解释了十分之一的由来，结论就是N个节点情况下，要发送的节点wanted就是N/10，收到集群下线报告的概率就是8*N/10，也就是80％，这样就收到了大部分集群节点发送来的下线报告。

然后计算消息的总的大小，也就是totlen变量，消息包头部加上wanted个节点信息。为消息分配空间，并调用clusterBuildMessageHdr()函数来构建消息包头部，将发送节点的信息填充进去。最后调用clusterSendMessage发送消息。通过Gossip协议，每次能够将一些节点信息发送给目标节点，而每个节点都这么干，只要时间足够，理论上集群中所有的节点都会互相认识。当然这里不讨论gossip协议的弊端。

clusterProcessGossipSection在建立连接的过程中主要是解析携带的gossip信息并添加到待连接节点当中。

/* Process the gossip section of PING or PONG packets.
 * Note that this function assumes that the packet is already sanity-checked
 * by the caller, not in the content of the gossip section, but in the
 * length. */
// 处理流言中的 PING or PONG 数据包，函数调用者应该检查流言包的合法性 
void clusterProcessGossipSection(clusterMsg *hdr, clusterLink *link) {
	   // 获取该条消息包含的节点数信息
    uint16_t count = ntohs(hdr->count);
    // clusterMsgDataGossip数组的地址
    clusterMsgDataGossip *g = (clusterMsgDataGossip*) hdr->data.ping.gossip;
    // 发送消息的节点
    clusterNode *sender = link->node ? link->node : clusterLookupNode(hdr->sender);
	 
	   // 遍历所有节点的信息
    while(count--) {
    	  // 获取节点的标识信息
        uint16_t flags = ntohs(g->flags);
        clusterNode *node;
        sds ci;

        if (server.verbosity == LL_DEBUG) {
        	   // 根据获取的标识信息，生成用逗号连接的sds字符串ci
            ci = representClusterNodeFlags(sdsempty(), flags);
            serverLog(LL_DEBUG,"GOSSIP %.40s %s:%d %s",
                g->nodename,
                g->ip,
                ntohs(g->port),
                ci);
            sdsfree(ci);
        }

        /* Update our state accordingly to the gossip sections */
        /*使用消息中的信息对节点进行更新 */
        // 根据指定name从集群中查找并返回节点
        node = clusterLookupNode(g->nodename);
         // 如果node存在
        if (node) {
            /* We already know this node.
               Handle failure reports, only when the sender is a master. */
            // 如果 sender 是一个主节点且不是本身，那么我们需要处理下线报告  
            if (sender && nodeIsMaster(sender) && node != myself) {
            	   // 如果标识中指定了关于下线的状态 
                if (flags & (CLUSTER_NODE_FAIL|CLUSTER_NODE_PFAIL)) {
                	   // 将sender的添加到node的故障报告中
                    if (clusterNodeAddFailureReport(node,sender)) {
                        serverLog(LL_VERBOSE,
                            "Node %.40s reported node %.40s as not reachable.",
                            sender->name, node->name);
                    }
                      // 判断node节点是否处于真正的下线FAIL状态
                    markNodeAsFailingIfNeeded(node);
                } else { // 如果标识表示节点处于正常状态
                	   // 如果 sender 曾经发送过对 node 的下线报告，那么清除该报告
                    if (clusterNodeDelFailureReport(node,sender)) {
                        serverLog(LL_VERBOSE,
                            "Node %.40s reported node %.40s is back online.",
                            sender->name, node->name);
                    }
                }
            }

            /* If we already know this node, but it is not reachable, and
             * we see a different address in the gossip section of a node that
             * can talk with this other node, update the address, disconnect
             * the old link if any, so that we'll attempt to connect with the
             * new address. */
            // 虽然node存在，但是node已经处于下线状态
            // 但是消息中的标识却反应该节点不处于下线状态，并且实际的地址和消息中的地址发生变化
            // 这些表明该节点换了新地址，尝试进行握手  
            if (node->flags & (CLUSTER_NODE_FAIL|CLUSTER_NODE_PFAIL) &&
                !(flags & CLUSTER_NODE_NOADDR) &&
                !(flags & (CLUSTER_NODE_FAIL|CLUSTER_NODE_PFAIL)) &&
                (strcasecmp(node->ip,g->ip) || node->port != ntohs(g->port)))
            {
            	  // 释放原来的集群连接对象
                if (node->link) freeClusterLink(node->link);
                // 设置节点的地址为消息中的地址	
                memcpy(node->ip,g->ip,NET_IP_STR_LEN);                
                node->port = ntohs(g->port);
                 // 清除无地址的标识
                node->flags &= ~CLUSTER_NODE_NOADDR;
            }
        } else { // node不存在，没有在当前集群中找到
            /* If it's not in NOADDR state and we don't have it, we
             * start a handshake process against this IP/PORT pairs.
             * 如果 node 不在 NOADDR 状态，并且当前节点不认识 node 
             * 那么向 node 发送 HANDSHAKE 消息。
             *
             * Note that we require that the sender of this gossip message
             * is a well known node in our cluster, otherwise we risk
             * joining another cluster. 
             * 注意，当前节点必须保证 sender 是本集群的节点，
             * 否则我们将有加入了另一个集群的风险。
             */
            if (sender &&
                !(flags & CLUSTER_NODE_NOADDR) &&
                !clusterBlacklistExists(g->nodename))
            {
            	   // 开始进行握手
                clusterStartHandshake(g->ip,ntohs(g->port));
            }
        }

        /* Next node */
        //处理下一个节点
        g++;
    }
}

参考：

https://blog.csdn.net/men_wen/article/details/72871618