The sentinel mode of redis. How does Redis achieve rapid recovery after a downtime?

Author: Kaito

kaito-kidd.com/2020/07/02/redis-sentinel/

In this article, let's look at how Redis implements automatic failure recovery. Its implementation is based on the data persistence and multiple copies of data mentioned earlier.

Redis is a very hot memory database. In addition to having very high performance, it also needs to ensure high availability. When a failure occurs, the impact of the failure is reduced as much as possible. Redis also provides a complete failure recovery mechanism: sentinel .

Let's take a look at how Redis's failure recovery is done and the principle of it.

Deployment mode

When Redis is deployed, it can be deployed in a variety of ways, and each deployment method corresponds to a different level of availability.

Single node deployment: Only one node provides services, and both reads and writes are on this node. If this node is down, all data will be lost, which directly affects the business.

Master-slave deployment: two nodes form a master-slave mode. Write on the master and read on the slave. Separation of read and write improves access performance. After the master goes down, you need to manually upgrade the slave to the master, depending on the degree of business impact Delay in manually upgrading the master.

Master-slave+sentinel deployment: master-slave is the same as the above, except that a set of sentinel nodes is added to check the health of the master in real time, and automatically upgrade the slave to a new master after the master goes down, minimizing unavailability The impact time on the business is relatively short.

It can be seen from the above deployment modes that the key to improving Redis availability is: multi-copy deployment + automatic failure recovery, and multi-copy depends on master-slave replication.

High-availability practices

Redis natively provides master-slave data replication to ensure that the slave is always consistent with the master data.

When there is a problem with the master, we need to upgrade the slave to the master and continue to provide services. However, if the operation of upgrading the new master is manually processed, the timeliness will inevitably not be guaranteed. Therefore, Redis provides sentinel nodes to manage the master-slave nodes, and can automatically perform failure recovery operations when the master encounters a problem.

The entire failure recovery work is done automatically by Redis Sentinel.

Sentinel introduction

Sentinel is a highly available solution for Redis. It is a service tool that manages multiple Redis instances. It can monitor, notify, and automatically fail over Redis instances.

When deploying sentry, we only need to configure the master node that needs to be managed in the configuration file. The sentry node can manage the Redis node according to the configuration to achieve high availability.

Generally, we need to deploy multiple sentinel nodes. This is because in a distributed scenario, if you want to determine whether a certain node of a machine fails, it may be inaccurate to use only one machine to detect. The network of the two machines has failed, and the node itself has no problem.

Therefore, for node health detection scenarios, multiple nodes are generally used to detect at the same time, and multiple nodes are distributed on different machines, and the number of nodes is an odd number to avoid sentinel decision errors caused by network partitions. In this way, multiple sentinel nodes exchange detection information with each other, and the final decision can confirm whether a problem really occurs on a node.

After the sentinel node is deployed and configured, the sentinel will automatically manage the configured master-slave. When the master fails, it will promptly upgrade the slave to the new master to ensure availability.

So how does it work?

How Sentinel Works

The working process of the sentinel is mainly divided into the following stages:

State awareness
Heartbeat detection
Election of the sentinel leader
Choose a new master
Recovery
Client perceives the new master

These stages are described in detail below.

State awareness

After the sentry is started, only the address of the master is specified. If the sentry wants to recover from the failure of the master, it needs to know the slave information corresponding to each master. Each master may have more than one slave, so the sentinel needs to know the complete topological relationship in the entire cluster, how to get this information?

The sentry will send the info command to each master node every 10 seconds. The information returned by the info command contains the master-slave topological relationship, including the address and port number of each slave. With this information, the sentry will remember the topology information of these nodes, and in the event of subsequent failures, select the appropriate slave node for failure recovery.

In addition to sending info to the master, the sentry will also send the current status information of the master and the sentry's own information to the special pubsub of each master node. By subscribing to this pubsub, other sentry nodes can get the information sent by each sentry. information.

There are two main purposes for this:

The sentinel node can discover the joining of other sentinels, which facilitates the communication of multiple sentinel nodes and provides a basis for subsequent joint negotiation
Exchange master status information with other sentinel nodes to provide a basis for subsequent judgments of whether the master is faulty

Heartbeat detection

When a failure occurs, the failure recovery mechanism needs to be activated immediately, so how to ensure timeliness?

Each sentry node sends a ping command to the master, slave, and other sentry nodes every 1 second. If the other party can respond within the specified time, the node is healthy and alive. If it does not respond within the specified time (configurable), then the sentinel node considers the node to be offline subjectively.

Why is it called subjective offline?

Because the current sentry node detects that the other party has not received a response, it is very likely that the network between the two machines has failed, and the master node itself has no problems, and the master failure is considered incorrect at this time.

If you want to confirm whether the master node is really down, you need multiple sentinel nodes to confirm together.

Each sentinel node asks other sentinel nodes about the status of the master to jointly confirm whether there is a real failure on this node.

If the sentinel nodes exceeding the specified number (configurable) consider this node to be offline subjectively, then this node will be marked as objectively offline.

Election of the sentinel leader

After confirming that the node is truly faulty, it is necessary to enter the fault recovery phase. How to recover from failures also requires a series of processes.

First, a sentinel leader needs to be elected, and this special sentinel leader will perform fault recovery operations without multiple sentinels participating in fault recovery. The process of electing a sentinel leader requires multiple sentinel nodes to negotiate and select.

This election negotiation process is called consensus in the distributed field, and the negotiated algorithm is called a consensus algorithm.

The consensus algorithm is mainly to solve how multiple nodes reach a consistent result for a certain scenario in a distributed scenario.

Consensus algorithms include many kinds, such as Paxos, Raft, Gossip algorithms, etc. Interested students can search for relevant information by themselves, which will not be discussed here.

The process of selecting a leader by the sentinel is similar to the Raft algorithm, and its algorithm is simple and easy to understand.

In short, the process is as follows:

Each sentry sets a random timeout period, and then sends requests to other sentries to become the leader after the timeout.
Other sentinels can only reply and confirm the first request received
First reach the sentinel node with a majority of confirmed votes and become the leader
If after confirming the reply, all the sentinels cannot reach the result of the majority vote, then a re-election will be conducted until the leader is elected

After the sentinel leader is selected, subsequent failure recovery operations will be performed by the sentinel leader.

Choose a new master

The sentinel leader needs to select one of its slave nodes to replace its work for the master node that has failed.

This process of selecting a new master also has priority. In the case of multiple slaves, the priority is selected according to: slave-priority configuration> data integrity> runid with the smaller one.

That is to say, the slave node with the smallest slave-priority is preferred. If all slaves have the same configuration, then the slave node with the most complete data is selected. If the data is also the same, the slave node with the smaller runid is selected at last.

Promote the new master

After priority selection, after selecting the candidate master node, the next step is to perform a real master-slave switch.

The sentinel leader sends the slave of no one command to the candidate master node to make the node become the master.

After that, the sentinel leader will send the slaveof $newmaster command to all slaves of the failed node, so that these slaves become slave nodes of the new master, and start to synchronize data from the new master.

Finally, the sentinel leader demotes the faulty node to a slave and writes it to its configuration file. After the faulty node recovers, it will automatically become the slave of the new master node.

At this point, the entire failover is complete.

Client perceives the new master

Finally, how does the client get the latest master address?

After the failover is completed, the sentinel will write a message to the designated pubsub of its own node, and the client can subscribe to this pubsub to perceive the master's change notification. Our client can also get the latest master address by actively querying the current latest master on the sentinel node.

In addition, the sentry also provides a "hook" mechanism. We can also configure some script logic in the sentry configuration file. When the failover is completed, the "hook" logic is triggered to notify the client that the switch has occurred, and the client can be on the sentry again Get the latest master address.

Generally speaking, it is recommended to use the first method for processing. Many client SDKs have integrated the method of obtaining the latest master from the sentinel node, and we can use it directly.

to sum up

It can be seen that in order to ensure the high availability of Redis, the sentinel node must accurately and accurately determine the occurrence of the failure, and quickly select a new node to replace it to provide services. The intermediate process is still relatively complicated.

The knowledge of distributed consensus and distributed negotiation is involved in the process, and the purpose is to ensure the accuracy of failover.

It is necessary for us to understand the working principle of Redis high availability so that we can use it more accurately when using Redis.