Redis study notes (eighteen) cluster (below)

Replication and failover

The nodes in the Redis cluster are divided into master nodes and slave nodes. The master node is used to process the slot, and the slave node is used to replicate a certain master node, and when the replicated master node goes offline, Instead of the offline master node, continue to process the command request.

Set the slave node: CLUSTER REPLICATE <node_id> allows the node receiving the command to be referred to as the slave node of the node specified by node_id, and starts to replicate the master node.

1) The node that receives the command will first find the clusterNode structure of the node corresponding to node_id in its clusterState.nodes dictionary, and point its clusterState.myself.slaveof pointer to this structure to record what the node is copying Master node:

struct clusterNode{
    
    
    //如果这个时一个从节点,那么指向主节点
    struct clusterNode *slaveof;
}

2) The node modifies the attributes in its clusterState.myself.flags, turns off the original REDIS_NODE_MASTER logo, and turns on the REDIS_NODE_SLAVE logo to indicate that this node has changed from the original master node to the slave node.

3) The node will call the copy code and copy the node according to the IP address and port number stored in the clusterNode structure pointed to by clusterState.myself.slaveof.

A node is called a slave node, and the information that it starts to replicate a master node will be sent to other nodes in the cluster through a message, and eventually all nodes in the cluster will know that a slave node is replicating a master node.

All nodes in the cluster will record the list of slave nodes that are replicating the master node in the slaves and numslaves attributes of the clusterNode structure representing the master node:

struct clusterNode{
    
    
    //正在复制这个主节点的从节点数量
    int numslaves;
    //数组,每个数组项指向一个正在复制这个主节点的从节点的clusterNode
    struct clusterNode **slaves;
}

Each node in the cluster will periodically send PING messages to other nodes in the cluster to check whether the other party is online. If the node receiving the PING message does not return the PONG message to the node sending the PING message within the specified time, then The node sending the PING message will mark the node with the PING message after the stage as suspected offline (PFAIL).

Each node in the cluster exchanges the status information of each node in the cluster by sending messages to each other: a node is online, suspected to be offline, or offline.

When a master node A knows through the gap that the master node B thinks that the master node C is in a suspected offline state, the master node A will find the clusterNode structure corresponding to the master node C in its clusterState.nodes dictionary, and put the master node B The offline report is added to the fail_reports linked list of the clusterNode structure

status clusterNode{
    
    
    list *fali_reports;//链表,记录所有其他节点对该节点的下线报告
};

Offline report structure:

struct c;isterNodeFailReport{
    
    
    //报告目标节点已经下线的节点
    struct clusterNode *node;
    //最后一个从node节点收到下线报告的时间(程序使用这个时间戳来检查下线报告是否过期)
    mstime_t time;
} typedef clusterNodeFailReport;

If more than half of the master nodes responsible for processing slots in the cluster report that a master node x is not suspected of being offline, then this master node x will be marked as offline, and the node that marks the master node x as offline will report The cluster broadcasts a FAIL gap about master node x, and all nodes that receive this gap will immediately mark master node x as offline.

Steps of failover:

1) Copy all the slave nodes of the offline master node, there will be a slave node selected,

2) The selected slave node will execute the SLAVEOF no one command and become the new master node.

3) The new master node will revoke all slot assignments to the offline master node and assign these slots to itself.

4) The new master node broadcasts a PONG message to the cluster. This message lets other nodes in other clusters immediately know that this node has changed from a slave node to a master node, and the master node has taken over the original offline node for processing的槽。 The slot.

5) The new master node starts to receive command requests related to the slot it is responsible for processing, and the failover is completed.

Election of a new master node:

1) The configuration epoch of the cluster is a counter. His initial value is 0;

2) When a node in the cluster starts a failover operation, the value of the cluster configuration epoch will be increased by 1.

3) Each master node responsible for processing slots in the cluster has a voting opportunity, and the first slave node that requests a vote from the master node will get the master node's vote.

4) When the slave node finds that the master node that it is replicating enters the offline state, the slave node will post a CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST message to the cluster official, asking all master nodes that have received this message and have voting rights to vote for this slave node.

5) If a master node has the right to vote, and this master node has not yet voted for other slave nodes, then the master node will return a CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK message to the slave node that requires voting, indicating that the master node supports the slave node to become the new master node.

6) Each slave node participating in the election will receive the CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK message, and count how many master nodes it has received based on how many such messages it has received.

7) If there are N master nodes with voting rights in the cluster, then when a slave node is greater than or equal to N/2 + 1 supporting votes, the slave node is elected as the new master node.

8) If there are not enough support votes collected from the nodes in a configuration epoch, the cluster enters the next epoch and conducts elections again until a new master node is elected.

news

Each node in the cluster communicates by sending and receiving messages. We call the node that sends the message the sender, and the node that receives the message is the receiver:

1) MEET message, when the sender receives the CLUSTER MEET command sent by the client, the sender will send a MEET message to the receiver, requesting the receiver to join the cluster where the sender is currently located.

2) PING message, each node in the cluster will randomly select five nodes from the list of known nodes every second by default, and then send it to the node that has not sent a PING message for the longest time among these five nodes PING message to check whether the selected node is online. In addition, if the last time node A received the PONG message sent by node B, the current time has exceeded half of the duration set by the cluster-node-timeout option of node A, then node A will also send to node B PING message, which can prevent node A from randomly selecting node B as the sender of the PING message for a long time and causing the information update delay of node B.

3) PONG message. When the receiver receives the MEET message or PING from the sender, in order to confirm to the sender that the MEET or PING message has arrived, the receiver will return a PONG message to the sender. In addition, a node can also send a cluster broadcast its own PONG message to the cluster to make other nodes in the cluster refresh their knowledge about this node immediately.

4) FAIL message. When a master node A judges that another master node B has entered the FAIL state, node A will broadcast a FAIL message about node B to the cluster, and all nodes that receive this message will immediately send node B Mark as offline.

5) PUBLISH message. When a node receives a PUBLISH command, the node will execute this command and broadcast a PUBLISH message to the cluster. All nodes that receive this PUBLISH message will execute the same PUBLISH command.

A message consists of a message header (header) and a message body (data)

Message header:

typedef struct {
    
    
    //消息的长度(消息头的长度和消息正文的长度)
    uint32_t totlen;
    //消息的类型
    uint16_t type;
    //消息正文包含的节点信息数量
    //只有发送MEET、PING、PONG这三种Gossip协议消息时使用
    uint16_t count;
    
    //萨松这所处的配置纪元
    uint64_t currentEpoch;
    //如果发送者是一个主节点,那么这里面记录的时发送者的配置纪元
    //如果发送者时一个从节点,那么这里面记录的时发送者正在复制的主节点的配置纪元
    uint64_t configEpoch;
    //发送者的名称(ID)
    char sender[REDIS_CLUSTER_NAMELEN]//发送者目前的槽指派信息
    unsigned char myslots[REDIS_CLUSTER_SLOTS/8]//如果发送者是一个从节点,记录的是发送者正在复制的主节点的名称
    //如果发送者是一个主节点,那么这里记录的是REDIS_NODE_NULL_NAME
    char slaveof[REDIS_CLUSTER_NAMELEN];
    //发送者的端口号
    uint16_t port;
    //发送者的标识值
    uint16_t flags;
    //发送者所处集群的状态
    unsigned char state;
    //消息正文
    union clusterMsgData data;
} clusterMsg;

clusterMsg.data structure:

union clusterMsgData{
    
    
    //MEET PING PONG 消息正文
    struct{
    
    
        //每条MEET PING PONG消息都包含两个 clusterMsgDataGossip 结构
        clusterMsgDataGossip gossip[1]
    } ping;
    //FAIL 消息正文
    struct{
    
    
        clusterMsgDataFail about;
    } fali;
    
    //PUBLISH消息正文
    struct{
    
    
        clusterMsgDataPublish msg;
    } publish;
}

The clusterMsgDataGossip structure records the name of the selected node, the timestamp of the last PING message and PONG message sent and received by the sender and the selected node, the IP address and port number of the selected node, and the identification value of the selected node:

typedef struct {
    
    
    //节点的名字
    char nodename[REDIS_CLUSTER_NAMELEN];
    //最后一次向该节点发送PING消息的时间戳
    uint32_t ping_sent;
    //最后一次从该 节点接收到PONG消息的时间戳
    uint32_t pong_received;
    //节点的IP地址
    char ip[16];
    //节点的端口号
    uint16_t port;
    //节点的标识值
    uint16_t flags;
} clusterMsgDataGossip;

Learn a little every day, there will always be gains.

Note: Respect the author's intellectual property rights, refer to "Redis Design and Implementation" for the content in the article, and only learn here to share with you.


Insert picture description here

Guess you like

Origin blog.csdn.net/xuetian0546/article/details/106676636