Principle of Redis Cluster

 1. Implementation of the CLUSTER MEET command

    By sending the CLUSTER MEET command to node A, the client can have node A receiving the command add another node B to the cluster where node A is currently located:

    CLUSTER MEET <ip> <port>

    Node A receiving the command will do a handshake with node B to confirm each other's existence and lay the foundation for further communication in the future:

    1) Node A will create a clusterNode structure for node B and add the structure to its own clusterState.nodes dictionary.

    2) After that, node A will send a MEET message to node B according to the IP address and port number given by the CLUSTER MEET command.

    3) If all goes well, node B will receive the MEET message sent by node A, node B will create a clusterNode structure for node A, and add the structure to its own clusterState.nodes dictionary.

    4) After that, node B will return a PONG message to node A.

    5) If all goes well, node A will receive the PONG message returned by node B. Through this PONG message, node A can know that node B has successfully received the MEET message sent by itself.

    6) After that, node A will return a PING message to node B.

    7) If all goes well, node B will receive the PING message returned by node A. Through this PING message, node B can know that node A has successfully received the PONG message returned by itself, and the handshake is completed.

    After that, node A will spread the information of node B to other nodes in the cluster through the gossip protocol, so that other nodes also shake hands with node B. Finally, after a period of time, node B will be recognized by all nodes in the cluster.

 

2. Slot assignment

    Redis cluster saves key-value pairs in the database by sharding: the entire database of the cluster is divided into 16384 slots, each key in the database belongs to one of the 16384 slots, and each key in the cluster belongs to one of the 16384 slots. A node can handle 0 or a maximum of 16384 slots.

    When all 16384 slots in the database are processed by nodes, the cluster is online (ok); conversely, if any slot in the database is not processed, the cluster is offline (fail).

    One or more slots can be assigned to a node by sending the CLUSTER ADDSLOTS command to the node:

    CLUSTER ADDSLOTS <slot> [slot . . .]

    127.0.0.1:7000> CLUSTER ADDSLOTS 0 1 2 3 4 . . . 5000

    OK

 

    127.0.0.1:7000> CLUSTER INFO

    cluster_state:ok

 

    The slots and numslot properties of clusterNode record which slots are handled by the node:

    struct clusterNode {

        // ...

        unsigned char slots[16384/8];

        int numslots;

        // ...

    };

    The slots attribute is a binary bit array (bit array), the length of this array is 2048 bytes, containing 16384 binary bits in total. If the value of the binary bit of the slots array at index i is 1, it means that the node is responsible for processing slot i, and 0 means it is not responsible.

 

    In addition to recording the slots it is responsible for processing in the slots attribute and numslots attribute of the clusterNode structure, a node will also send its own slots array to other nodes in the cluster through messages to tell other nodes that it is currently responsible for Which slots are processed.

 

    The slots array in the clusterState structure records assignments for all 16384 slots in the cluster:

    typedef struct clusterState {

        // ...

        clusterNode *slots[16384]; // each array item points to a clusterNode

        // ...

    } clusterState;  

    

3. Execute commands in the cluster

    After all 16384 slots in the database are assigned, the cluster will enter the online state, and the client can send data commands to the nodes in the cluster.

    When a client sends a command related to a database key to a node, the node receiving the command figures out which slot the database key the command is to process belongs to, and checks to see if this slot is assigned to itself:

    If assigned to the current node, the node executes this command directly. Otherwise, the node will return a MOVED error to the client, instructing the client to redirect (redirect) to the correct node, and send the command it wanted to execute again.

    The computed key belongs to that slot:

    def slot_number(key):

        return CRC16(key) & 16383

    // CRC-16 checksum

    Determine whether slot i is handled by the current node:

    clusterState.slots[i] == clusterState.myself

 

    A cluster client usually creates socket connections with multiple nodes in the cluster, and the so-called node switching is actually changing a socket to send commands.

    One difference between a node and a stand-alone server in terms of database is that a node can only use database 0.

 

4. Re-sharding

    The resharding operation of Redis Cluster can change any number of slots that have been assigned to a node (source node) to another node (target node), and the key-value pair to which the relevant slot belongs will also be moved from the source node to the target node. (Resharding here is not rehash, please pay attention to distinguish it from client-side consistent hash sharding)

    The resharding operation can be performed online. During the resharding process, the cluster does not need to be offline, and both the source and target nodes can continue to process command requests.

    The resharding operation is performed by the Redis cluster management software redis-trib. Redis provides all the commands required for resharding, and redis-trib performs resharding operations by sending commands to the source node and the target node.

    The steps for redis-trib to reshard a single slot of a cluster are as follows:

    1) redis-trib sends the CLUSTER SETSLOT <slot> IMPORTING <source_id> command to the target node, so that the target node is ready to import (import) key-value pairs belonging to the slot slot from the source node.

    2) redis-trib sends the CLUSTER SETSLOT <slot> MIGRATING <target_id> command to the source node, so that the source node is ready to migrate the key-value pairs belonging to the slot to the target node.

    3) redis-trib sends the CLUSTER GETKEYSINSLOT <slot> <count> command to the source node to obtain the key names of up to count key-value pairs belonging to the slot.

    4) For each key name obtained in step 3, redis-trib sends a MIGRATE <target_ip> <target_port> <key_name> 0 <timeout> command to the source node to atomically migrate the selected key from the source node to the target node.

    5) Repeat steps 3 and 4 until all key-value pairs that belong to the slot saved by the source node are migrated to the target node.

    6) redis-trib sends the CLUSTER SETSLOT <slot> NODE <target_id> command to any node in the cluster to assign the slot to the target node. This assignment information will be sent to the entire cluster through a message, and finally all nodes in the cluster will know that the slot has been assigned to the target node.

    ASK error:

    During resharding, in the process of migrating a slot from the source node to the target node, there may be such a situation: some key-value pairs belonging to the migrated slot are saved in the source node, while another part of the key-value pairs are saved inside the target node.

    When a client sends a command related to a database key to the source node, and the database key to be processed by the command happens to belong to the slot being migrated:

    The source node will first look for the specified key in its own database, and if found, it will directly execute the command sent by the client. If it is not found, then the key may have been migrated to the target node, and the source node will return an ASK error to the client, instructing the client to turn to the target node that is importing the slot, and send the command it wanted to execute again.

 

5. Replication and Failover

    The nodes in the Redis cluster are divided into master nodes (master) and slave nodes (slave), where the master node is used to process slots, and the slave node is used to replicate a master node, and when the replicated master node goes offline, Instead of going offline, the master node continues to process command requests.

    Set slave node CLUSTER REPLICATE <node_id>

    Fault detection:

    Each node in the cluster will periodically send PING messages to other nodes in the cluster to detect whether the other party is online. If the node receiving the PING message does not return the PONG message within the specified time, then the node sending the PING message will The node receiving the PING message will be marked as suspected offline (probable fail, PFAIL).

    If in a cluster, more than half of the nodes responsible for processing slots report a master node x as suspected offline, then the master node x will be marked as offline (FAIL), and the node marked x as FAIL will be Broadcast a FAIL message about x to the cluster, and all nodes that receive this FAIL message will immediately mark x as FAIL.

    Failover:

    When a slave node finds that the master node it is replicating enters the FAIL state, the slave node will start to failover the offline master node. The following are the steps to perform the failover:

    1) Among all the slave nodes that replicate the offline master node, one slave node will be selected.

    2) The selected slave node will execute the SLAVEOF no one command, which is called the new master node.

    3) The new master node will revoke all slot assignments to the offline master node and assign all these slots to itself.

    4) The new master node broadcasts a PONG message to the cluster. This PONG message can let other nodes in the cluster immediately know that the node has changed from a slave node to a master node, and that the master node has taken over from the offline node. The slot that the node is responsible for processing.

    5) The new master node begins to receive command requests related to the slot it is responsible for processing, and the failover is completed.

    To elect a new master node:

    1) The configuration epoch of the cluster is an auto-incrementing counter whose initial value is 0.

    2) When a node in the cluster starts a failover operation, the value of the cluster configuration epoch will be incremented by one.

    3) For each configuration epoch, each master node responsible for processing slots in the cluster has one chance to vote, and the first slave node that requests a vote from the master node will receive the master node's vote.

    4) When the slave node finds that the master node it is replicating has entered the offline state, the slave node will broadcast a CLUSTER_TYPE_FAILOVER_AUTH_REQUEST message to the cluster, requiring all master nodes that have received this message and have voting rights to vote for this slave node.

    5) If a master has voting rights (it is in charge of processing slots), and the master has not voted for other slaves, then the master will return a CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK message to the slave requesting the vote, indicating that the master supports the slave Become the new master node.

    6) Each slave node participating in the election will receive the CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK message, and according to how many such messages it has received, how many master nodes have been supported by Tongji itself.

    7) If there are N master nodes with voting rights in the cluster, when a slave node collects more than or equal to N/2+1 support votes, the slave node will be elected as the new master node.

    8) Because in each configuration epoch, each master node with voting rights can only vote once, so if there are N master nodes to vote, the slave nodes with more than or equal to N/2+1 support votes can only vote. There will be one, which ensures that there will only be one for the new master.

    9) If there are not enough support votes collected from the slave nodes in a configuration epoch, the cluster enters a new configuration epoch and conducts an election again until a new master node is elected.

    This method of electing a new master is very similar to the method of electing a leader Sentinel, because both are implemented based on the Raft algorithm's leader election method.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326604644&siteId=291194637