Redis cluster--Cluster--failover--process/principle

Original URL:

Introduction

        This article introduces the failover process of Redis cluster (Cluster).

Troubleshooting

        When there is a problem with a node in the cluster, there needs to be a robust way to ensure that the node is faulty. Nodes in the Redis cluster implement node communication through ping/pong messages. Messages can not only propagate node slot information, but also other states such as master-slave state, node failure, etc. Therefore, fault discovery is also realized through the message dissemination mechanism, and the main links include: subjective offline (pfail) and objective offline (fail).

  • Subjective offline: refers to a node that considers another node to be unavailable, that is, the offline state. This state is not the final fault judgment, but only represents the opinion of one node, and there may be misjudgments.
  • Objective offline: Refers to marking the real offline of a node, and multiple nodes in the cluster believe that the node is unavailable, thus reaching a consensus result. If the primary node holding the slot fails, a failover is required for that node.

Subjective offline

        Each node in the cluster periodically sends ping messages to other nodes, and the receiving node replies with pong messages in response. If the communication fails all the time within the cluster-node-timeout time, the sending node will consider that the receiving node is faulty and mark the receiving node as a subjective offline (pfail) state. The process is shown in the figure below:

Process description:
1) Node a sends a ping message to node b. If the communication is normal, a pong message will be received, and node a will update the last communication time with node b.

2) If there is a communication problem between node a and node b, the connection will be disconnected, and the connection will be reconnected next time. If the communication fails all the time, the last communication time recorded by node a with node b cannot be updated.

3) When the timing task in node a detects that the last communication time with node b is too high cluster-nodetimeout, it updates the local state of node b as subjective offline (pfail).

        Subjective offline simply means that when a node cannot successfully complete ping message communication with another node within the cluster-note-timeout period, the node is marked as a subjective offline state. The cluster State structure in each node needs to save other node information, which is used to judge the state of other nodes from its own perspective. 

        Redis cluster is very strict in judging whether the node is finally faulty. Only one node thinks that it is subjectively offline and cannot accurately judge whether it is faulty. For example the following scene:

        The communication between nodes 6379 and 6385 is interrupted, which causes 6379 to judge that 6385 is in a subjective offline state, but the communication between nodes 6380 and 6385 is normal. In this case, it cannot be determined that node 6385 is faulty. Therefore, for a robust fault discovery mechanism, only when most nodes in the cluster judge that the 6385 is faulty can it be considered that the 6385 has indeed failed, and then failover is performed for the 6385 node. This process of multiple nodes cooperating to complete fault discovery is called objective offline.

objectively offline

        When a node judges that another node is offline, the corresponding node status will follow the message to spread in the cluster. The message body of the ping/pong message carries the status data of other nodes in the cluster 1/10. When the receiving node finds that the message body contains the subjective offline node status, it will find the ClusterNode structure of the faulty node locally and save it to the offline report linked list middle. The structure is as follows:

struct clusterNode { /* 认为是主观下线的clusterNode结构 */
    list *fail_reports; /* 记录了所有其他节点对该节点的下线报告 */
    ...
};

        Through Gossip message propagation, nodes in the cluster continuously collect offline reports of faulty nodes. When more than half of the master nodes holding slots mark a node as subjective offline. Trigger the objective offline process. There are two problems here:

1) Why must the master node responsible for the slot participate in the fault discovery decision? Because in cluster mode, only the master node that processes the slot is responsible for the maintenance of key information such as read and write requests and cluster slots, while the slave node only replicates the master node's data and status information.

2) Why are more than half of the processing slots masternodes? More than half of them must be used to deal with cluster segmentation caused by network partitions and other reasons. The divided small cluster cannot complete the key process from subjective offline to objective offline, thus preventing the small cluster from continuing to provide external services after completing the failover.

        Assuming that node a marks node b as subjective offline, node a sends the status of node b to other nodes through a message after a period of time, when node c receives the message and parses the message body containing the pfail status of node b, it will trigger the objective The offline process is shown in the following figure:

Flow Description:

  1. When the message body contains the pfail status of other nodes, the status of the sending node will be judged. If the sending node is the master node, the reported pfail status will be processed, and the slave node will ignore it.
  2. Find the node structure corresponding to pfail, and update the clusterNode internal offline report linked list.
  3. Attempt to log off objectively according to the updated log-off report list.

Here is a detailed description of maintaining the offline report and trying to objectively log out.

1. Maintain the offline report list

        There will be an offline linked list structure in the ClusterNode structure of each node, which saves the offline reports of other master nodes for the current node. The structure is as follows:

typedef struct clusterNodeFailReport {
    struct clusterNode *node; /* 报告该节点为主观下线的节点 */
    mstime_t time; /* 最近收到下线报告的时间 */
} clusterNodeFailReport;

        The offline report saves the structure of the node that reported the failure and the time when the offline report was recently received. When the fail status is received, the offline and online report linked list of the corresponding node will be maintained. The pseudocode is as follows:

def clusterNodeAddFailureReport(clusterNode failNode, clusterNode senderNode) :
    // 获取故障节点的下线报告链表
    list report_list = failNode.fail_reports;
    
    // 查找发送节点的下线报告是否存在
    for(clusterNodeFailReport report : report_list):
        // 存在发送节点的下线报告上报
        if(senderNode == report.node):
            // 更新下线报告时间
            report.time = now();
            return 0;
            
        // 如果下线报告不存在,插入新的下线报告
        report_list.add(new clusterNodeFailReport(senderNode,now()));
    return 1;

        Each downline report has an expiration date. Every time an objective downline is triggered, it will detect whether the downline report has expired, and the expired downline report will be deleted. If the offline report is not updated within the time of cluster-node-time*2, it will expire and be deleted. The pseudo code is as follows:

def clusterNodeCleanupFailureReports(clusterNode node) :
    list report_list = node.fail_reports;
    long maxtime = server.cluster_node_timeout * 2;
    long now = now();
    for(clusterNodeFailReport report : report_list):
        // 如果最后上报过期时间大于cluster_node_timeout * 2则删除
        if(now - report.time > maxtime):
            report_list.del(report);

        The validity period of the offline report is server.cluster_node_timeout*2, which is mainly for false positives. For example, node A reported that node B was offline in the last hour, but then returned to normal. Now there are other nodes reporting Node B's subjective offline, and according to the actual situation, the previous false positives cannot be used.

Operation and maintenance tips

        If the offline reports of more than half of the slot nodes cannot be collected within the cluster-node-time*2 time, the previous offline reports will be expired, that is to say, the speed of the subjective offline reports cannot catch up with the expired speed of the offline reports. , then the failed node will never be marked as objectively offline causing failover to fail. Therefore, it is not recommended to set cluster-node-time too small.

2. Try to log off objectively

        Every time a node in the cluster receives the pfail status of other nodes, it will try to trigger an objective offline. The process is shown in the following figure:

Flow Description:

  1. First, count the number of valid offline reports, and exit if it is less than half of the total number of master nodes holding slots in the cluster.
  2. When the offline report is greater than half of the number of master nodes in the slot, the corresponding faulty node is marked as an objective offline state.
  3. Broadcast a fail message to the cluster to notify all nodes to mark the failed node as objectively offline. The message body of the fail message only contains the ID of the failed node.

Use pseudocode to analyze the objective offline process, as follows:

def markNodeAsFailingIfNeeded(clusterNode failNode) {
    // 获取集群持有槽的节点数量
    int slotNodeSize = getSlotNodeSize();
    // 主观下线节点数必须超过槽节点数量的一半
    int needed_quorum = (slotNodeSize / 2) + 1;
    // 统计failNode节点有效的下线报告数量(不包括当前节点)
    int failures = clusterNodeFailureReportsCount(failNode);
    // 如果当前节点是主节点, 将当前节点计累加到failures
    if (nodeIsMaster(myself)):
        failures++;
    // 下线报告数量不足槽节点的一半退出
    if (failures < needed_quorum):
        return;
    // 将改节点标记为客观下线状态(fail)
    failNode.flags = REDIS_NODE_FAIL;
    // 更新客观下线的时间
    failNode.fail_time = mstime();
    // 如果当前节点为主节点,向集群广播对应节点的fail消息
    if (nodeIsMaster(myself))
        clusterSendFail(failNode);

Broadcasting the fail message is the last step of the objective downline, and it undertakes very important responsibilities:

  • Notify all nodes in the cluster to mark the faulty node as an objective offline state and take effect immediately.
  • Notify the slave node of the failed node to trigger the failover process.

        It should be understood that although there is a broadcast fail message mechanism, it is uncertain for all nodes in the cluster to know that the failed node enters the objective offline state. For example, when a network partition occurs, it is possible that the cluster is divided into two separate clusters, one large and one small. The large cluster holds half of the slot nodes and can complete the objective offline and broadcast the fail message, but the small cluster cannot receive the fail message, as shown in the following figure:

        However, when the network is restored, as long as the faulty node becomes objectively offline, it will eventually be propagated to all nodes in the cluster through Gossip messages.

Operation and maintenance tips

        The network partition will cause the divided small cluster to fail to receive the fail message of the large cluster. Therefore, if all the slave nodes of the faulty node are in the small cluster, subsequent failover will not be completed. The rack topology reduces the possibility of master and slave being partitioned.

Recovery

        After the faulty node becomes objectively offline, if the offline node is the master node holding the slot, it needs to select one of its slave nodes to replace it, so as to ensure the high availability of the cluster. All slave nodes of the offline master node are responsible for failure recovery. When the slave node finds that the master node replicated by itself has entered the objective offline through internal timing tasks, the failure recovery process will be triggered, as shown in the following figure:

1. Eligibility check

        Each slave node must check the last disconnection time from the master node to determine whether it is eligible to replace
the failed master node. If the disconnection time between the slave node and the master node exceeds cluster-node-time*cluster-slave-
validity-factor, the current slave node is not eligible for failover. The parameter cluster-slavevalidity-factor is used for the validity factor of the slave node, the default is 10.

2. Preparing for election time

        When the slave node is eligible for failover, the time for triggering the failover election is updated, and the
subsequent process can only be executed after the time is reached. The fields related to the failure election time are as follows:

struct clusterState {
    ...
    mstime_t failover_auth_time; /* 记录之前或者下次将要执行故障选举时间 */
    int failover_auth_rank; /* 记录当前从节点排名 */
}

        The reason why the delayed trigger mechanism is adopted here is mainly to support the priority problem by using different delayed election times for multiple slave nodes. A larger replication offset indicates a lower latency slave node, so it should have a higher priority to replace a failed master node. The priority calculation pseudocode is as follows:

def clusterGetSlaveRank():
    int rank = 0;
    // 获取从节点的主节点
    ClusteRNode master = myself.slaveof;
    // 获取当前从节点复制偏移量
    long myoffset = replicationGetSlaveOffset();
    // 跟其他从节点复制偏移量对比
    for (int j = 0; j < master.slaves.length; j++):
        // rank表示当前从节点在所有从节点的复制偏移量排名, 为0表示偏移量最大.
        if (master.slaves[j] != myself && master.slaves[j].repl_offset > myoffset):
            rank++;
    return rank;
}

Using the above priority ranking, update the election trigger time, the pseudo code is as follows:

def updateFailoverTime():
    // 默认触发选举时间: 发现客观下线后一秒内执行。
    server.cluster.failover_auth_time = now() + 500 + random() % 500;
    // 获取当前从节点排名
    int rank = clusterGetSlaveRank();
    long added_delay = rank * 1000;
    // 使用added_delay时间累加到failover_auth_time中
    server.cluster.failover_auth_time += added_delay;
    // 更新当前从节点排名
    server.cluster.failover_auth_rank = rank;

All slave nodes with the largest replication offset will trigger the failure election process in advance, as shown in the following figure:


3. Initiate an election

        When the slave node timing task detection reaches the failure election time (failover_auth_time), the election process is initiated as follows:

(1) Update configuration epoch

        The configuration epoch is an integer that only increases and does not decrease. Each master node maintains a configuration epoch (clusterNode.configEpoch) to indicate the version of the current master node. The configuration epochs of all master nodes are not equal, and the slave nodes will copy the configuration epoch of the master node. . The entire cluster maintains a global configuration epoch (clusterState.current Epoch), which is used to record the maximum version of the configuration epoch of all master nodes in the cluster. Execute the cluster info command to view configuration epoch information:

127.0.0.1:6379> cluster info
...
cluster_current_epoch:15 // 整个集群最大配置纪元
cluster_my_epoch:13 // 当前主节点配置纪元

        The configuration epoch will follow the ping/pong message to propagate in the cluster. When both the sender and the receiver are master nodes and the configuration epoch is equal, it means there is a conflict. The party with the larger nodeId will increment the global configuration epoch and assign it to the current node to distinguish it. Conflict, the pseudo code is as follows:

def clusterHandleConfigEpochCollision(clusterNode sender) :
    if (sender.configEpoch != myself.configEpoch || !nodeIsMaster(sender) || !nodeIsMast
        (myself)) :
    return;
    
    // 发送节点的nodeId小于自身节点nodeId时忽略
    if (sender.nodeId <= myself.nodeId):
        return
        
    // 更新全局和自身配置纪元
    server.cluster.currentEpoch++;
    myself.configEpoch = server.cluster.currentEpoch;

The main role of the configuration era:

  1. Indicates the different versions of each master node in the cluster and the largest version of the current cluster.
  2. Every time an important event occurs in the cluster, the important event here refers to the emergence of a new master node (newly joined or converted from a slave node), and the slave nodes compete for election. All will increment the cluster's global configuration epoch and assign it to the relevant master node to record this key event.
  3. The master node has a larger configuration epoch to represent the updated cluster state. Therefore, when ping/pong messages are exchanged between nodes, if there is inconsistency in key information such as slots, the party with the larger configuration epoch shall prevail to prevent outdated messages. The state pollutes the cluster.

The application scenarios for configuring epoch are:

  1. New nodes are added.
  2. Slot node mapping collision detection.
  3. Slave node voting for election conflict detection.

Development Tips

        Previously, when modifying the slot node mapping through the cluster setslot command, it was necessary to ensure that the local configuration epoch (configEpoch) of the master node executing the request was the maximum value, otherwise the modified slot information would not be transmitted to the node with a higher configuration epoch during message propagation. adoption. Since the Gossip communication mechanism cannot accurately know which node is the current largest configuration epoch, the clustersetslot{slot}node{nodeId} command at the end of the slot migration task needs to be executed once on all master nodes.

        Every time a slave node initiates a vote, the global configuration epoch of the cluster will be automatically incremented and stored separately in the
clusterState.failover_auth_epoch variable to identify the version of the election initiated by the slave node this time.

(2) Broadcast election message

        Broadcast the election message (FAILOVER_AUTH_REQUEST) in the cluster, and record the status of the message that has been sent to ensure that the slave node can only initiate an election within a configuration epoch. The content of the message is the same as the ping message but the type is changed to FAILOVER_AUTH_REQUEST.

4. Election voting

        Only the master node holding the slot will process the failure election message (FAILOVER_AUTH_REQUEST), because each node holding the slot has a unique vote in a configuration epoch, when it receives the first slave node message requesting a vote Reply to FAILOVER_AUTH_ACK message as a vote, and then the election messages of other slave nodes in the same configuration epoch will be ignored.
The voting process is actually a leader election process. For example, there are N master nodes holding slots in the cluster representing N votes. Since the master node holding a slot in each configuration epoch can only vote for one slave node, only one slave node can get N/2+1 votes, ensuring that the only slave node can be found.

        Redis cluster does not directly use slave nodes for leader election, mainly because the number of slave nodes must be greater than or equal to 3 to ensure that there are enough N/2+1 nodes, which will lead to waste of slave node resources. The leader election is performed using all master nodes holding slots in the cluster, even if there is only one slave node to complete the election process.

        When the slave node collects the votes of N/2+1 master nodes holding slots, the slave node can perform the operation of replacing the master node. For example, there are 5 master nodes holding slots in the cluster, and there are 4 master nodes after the failure of master node b. When one of the slave nodes collects 3 votes, the representative has enough votes to replace the master node, as shown in the following figure:

Operation and maintenance tips

        The faulty master node is also counted in the number of votes. It is assumed that the scale of nodes in the cluster is 3 masters and 3 slaves, of which 2 master nodes are deployed on one machine. When this machine goes down, the slave nodes cannot collect 3/3 2+1 primary votes will cause failover to fail. This question also applies to the fault finding link. Therefore, when deploying a cluster, all master nodes need to be deployed on at least 3 physical machines to avoid single point problems.

        Voting void: Each configuration epoch represents an election cycle. If a sufficient number of votes are not obtained from the nodes within the cluster-node-timeout*2 period after voting starts, the election will be voided. The slave node increments the configuration epoch and initiates the next round of voting until the election is successful.

5. Replace the master node

When enough votes are collected from the slave nodes, the replace master node operation is triggered:

  1. The current slave node cancels replication and becomes the master node.
  2. Execute the clusterDelSlot operation to revoke the slots that the faulty master is responsible for, and execute clusterAddSlot to delegate these slots to itself.
  3. Broadcast its own pong message to the cluster, informing all nodes in the cluster that the current slave node has become the master node and has taken over the slot information of the faulty master node.

failover time

        After introducing the process of failure discovery and recovery, we can estimate the failover time at this time:

  1. Subjective offline (pfail) identification time = cluster-node-timeout.
  2. Subjective offline status message propagation time <=cluster-node-timeout/2.
    1. The message communication mechanism will send a ping message to the non-communicating nodes that exceed cluster-node-timeout/2, and the message body will give priority to the offline state node when selecting which nodes to include, so usually more than half of the master nodes can be collected during this period of time. pfail report to complete fault discovery.
  3. Slave transfer time <= 1000 ms.
    1. Due to the delayed election mechanism, the slave node with the largest offset will delay the election by at most 1 second. Usually the first election will be successful, so the time for the slave node to perform the transfer is less than 1 second.

Based on the above analysis, the failover time can be estimated as follows:

failover-time(毫秒) ≤ cluster-node-timeout + cluster-node-timeout/2 + 1000

        Therefore, the failover time is closely related to the cluster-node-timeout parameter, which defaults to 15 seconds. During configuration, appropriate adjustments can be made according to the business tolerance, but the smaller the better, the bandwidth consumption will be further explained in the next section.

Failover Simulation

        The main details of the failover have been introduced so far. The following is an analysis of the failover behavior by simulating the failure scenario of the master node through the cluster built before. Use kill-9 to forcibly shut down the master node 6385 process, as shown in the following figure:

 Confirm cluster status:

Force close the 6385 process:

log analysis

1. The replication between the slave node 6386 and the master node 6385 is interrupted, and the log is as follows:

2. Both 6379 and 6380 master nodes marked 6385 as subjective offline, and more than half of them were marked as objective offline status, and the following logs were printed:

3. The slave node identifies the master node that is replicating and prepares the election time after entering the objective offline. The log prints the election delay of 964 milliseconds and executes it, and prints the current slave node replication offset.

4. After the delayed election time arrives, the slave node updates the configuration epoch and initiates a failure election.

5. 6379 and 6380 master nodes vote for slave node 6386, the log is as follows:


6. After obtaining 2 master node votes from the node, more than half of them perform the operation of replacing the master node to complete the failover:

Manually restore a node

        After the failover is successfully completed, we recover the failed node 6385 and observe whether the node status is correct:

1) Restart the failed node 6385.

 2) After the 6385 node starts up and finds that its responsible slot is assigned to another node, the existing cluster configuration shall prevail and become the slave node of the new master node 6386. The key log is as follows:

3) Other nodes in the cluster receive the ping message from 6385 and clear the objective offline status:

4) The 6385 node becomes the slave node, and initiates the replication process to the master node 6386:

5) The final cluster state is shown in the figure below.

 6386 becomes master and 6385 becomes its slave

Other URLs

"Redis Development and Operation" => Chapter 10 Clustering => 10.6 Failover

Guess you like

Origin blog.csdn.net/feiying0canglang/article/details/123580874