redis04-Sentinel (sentinel mode)

1.Background

From " redis03-Master-Slave Architecture ", we know that configuring the master-slave mode for Redis can greatly improve the availability of the Redis service and reduce or even avoid the possibility of Redis service downtime.

It has the following capabilities:

  • Fault isolation and recovery: No matter whether the master node or the slave node is down, other nodes can still ensure the normal operation of the service, and can manually switch master and slave.
  • Read and write isolation: The Master node provides write services, and the Slave node provides read services, sharing the traffic pressure and balancing the traffic load.
  • Provide high availability guarantee: The master-slave mode is the most basic version of high availability and is also a prerequisite for the implementation of sentinel mode and cluster mode.

But there are still quite a few problems. We know that there is an indicator for measuring system availability called MTTR, which is the mean time to repair. Although the master-slave mode supports manual switching, it may be a relatively long process from knowing the service failure to manually switching the stop loss to recovery. The losses during this period will be difficult to measure, which is an absolute disaster for large ultra-high concurrency systems. Therefore, we need the system to automatically detect the Master failure and select a Slave to switch to the Master to achieve automatic failover.

Mean time to repair (MTTR) describes the average repair time when a product changes from a faulty state to a working state.

2. What is Sentry Mode?

In actual production environments, servers will inevitably encounter some emergencies: server downtime, power outage, hardware damage, etc. Once this occurs, the consequences will be disastrous.
The core of the sentinel mode is still the evolution of the master-slave mode, but compared to the master-slave mode, when the master node is down and cannot be written, it has more probing and an election mechanism: a new master node is elected from all slave nodes. , and then switch automatically. The implementation of the election mechanism relies on starting the sentinel process in the system to monitor each server. As shown below:
Insert image description here

3. The main responsibilities of Sentinel Mode

We know that there are many details to consider in order for the Redis service to implement automatic failover, such as:

  • What are the conditions for determining node failure? Is it possible that it is suspended animation or delayed response.
  • Since it is an election mechanism, all slave nodes can participate in the competition and have the opportunity to become the master. Choosing which slave becomes the master is key.
  • When a new master is elected, other slaves need to replicate from the new master, so message notification and communication are also core.

With these thoughts, let's take a look at the official definition of Redis Sentinel:
Sentinel, as an operating mode of Redis, focuses on monitoring the running status of Redis instances (master, slaves), and can pass a request when the master node fails. A series of operations implement new master election, master-slave switching, and failover to ensure the availability of the entire Redis service.

Therefore, the Sentinel’s capabilities should at least include the following points:

  • Monitoring: Continuously monitor whether the master and slave are healthy and in expected working condition.
  • Master-slave dynamic switching: When the master fails, Sentinel starts the automatic failure recovery process: select one from the slave as the new master.
  • Notification mechanism: After a new master is elected, the client is notified to establish a connection with the new master; the slave replicates from the new master to ensure the consistency of master-slave data.
    Next, let’s look at the implementation process of these capabilities one by one.

3.1 Monitoring capabilities

When Sentinel mode is enabled, a process called Sentinel will be enabled simultaneously. The sentinel process will send heartbeat packets (once every 1s) to all masters, slaves and other sentinel processes to see if the response is returned normally.

  • If the slave does not respond to sentinel's PING command within the specified time, sentinel will consider that the instance has hung up and tag it as: offline state;
  • In the same way, if the master does not respond to sentinel's PING command within the specified time, it will also be judged to be in the offline state, but it will take one more step to automatically switch the master process.

There are two situations in the reply of PING command:

  • Valid reply: return any of +PONG, -LOADING, -MASTERDOWN;
  • Invalid reply: A reply other than a valid reply, or any reply returned within the specified time.

However, there may be some misjudgments, such as network congestion, master instance suspended animation, and request delays, resulting in the instance being unavailable for a short period of time and then quickly recovering.
If we take the initiative to go offline at this time, the availability of the entire system will actually be degraded. Moreover, a series of operations after the misjudgment, such as master election, message notification, and data synchronization between the slave and the new master, will consume a lot of resources. Therefore, misjudgment is inevitable.
In order to ensure the reliability of judgment, we distinguish between offline identification: one is subjective offline and the other is objective offline.

  • Subjective offline
    Sentinel uses the PING command to monitor the life status of master and slave instance nodes. If the reply is invalid, Sentinel marks the instance node as subjectively offline. If you are a slave, you can usually go offline directly, but if you are a master, you need to be careful. It is easy for one sentinel to make a misjudgment, so multiple sentinels will vote to make a decision. The sentinel mechanism is like this. It uses multiple instances to form a sentinel cluster mode for deployment, that is, a sentinel cluster. Using multiple sentinel instances to judge together can prevent a single sentinel from mistakenly judging that the main database is offline due to poor network conditions.
    At the same time, the probability that multiple sentinel networks are unstable at the same time is small. If they make decisions together, the misjudgment rate can also be reduced.

  • Whether the objective offline
    master should be offline cannot be decided by a single sentinel. As mentioned above, our group will have a sentinel cluster, so this cluster comes into play. Everyone votes together, and more than half of the sentinels have judged subjective offline. At this time, we marked the master as objectively offline, thinking that it was really dead.
    When the master is judged to be objectively offline, even if there is officially no master, the top priority is to quickly elect a new master.

Insert image description here

  • How to distinguish between subjective and objective offline.
    Subjective offline means that the sentinel itself considers the node offline. At this time, the node is not really offline. Objective offline means that when a certain number of sentinels (for example, more than half) reach a certain number, they think the node is offline. At this time, the node is offline. It will further trigger a series of operations such as going offline and re-electing as the leader.

The "certain amount" here is a quorum, which is determined by the sentinel monitoring configuration. Explain the configuration:

# sentinel monitor <master-name> <master-host> <master-port> <quorum>
# 举例如下:
sentinel monitor mymaster 127.0.0.1 6379 2

This configuration item is used to inform the sentinel of the master node that needs to be monitored:

  • sentinel monitor: represents monitoring.
  • mymaster: represents the name of the master node and can be customized.
  • 192.168.11.128: represents the monitoring master node IP, 6379 represents the port.
  • 2: Quorum, which means that only when two or more sentinels believe that the master node is unavailable, the master will be set to an objective offline state, and then a failover operation will be performed.

The standard for objective offline is that when there are N sentinel instances, N/2 + 1 instances must determine that the master is subjectively offline, and only then can the master be finally determined to be objectively offline. This is actually a majority mechanism.

3.2 Master-slave dynamic switching

A very important job of sentinel is to elect a new master from multiple slaves. Of course, this election process will be more rigorous and requires selection + comprehensive evaluation.

3.2.1 Filtering

  • Filter out unhealthy (offline or disconnected) slave nodes that do not respond to sentinel ping responses.
  • Evaluate the past network connection status of the instance down-after-milliseconds. If the slave database and the master database are frequently disconnected within a certain period (such as 24h) and exceed a certain threshold (such as 10 times), the slave will not be considered.
    In this way, healthier instances are retained.

3.2.2 Comprehensive assessment

After filtering out unhealthy instances, we can conduct a comprehensive evaluation of the remaining healthy instances in order.

  • Slave priority, through the slave-priority configuration item (redis.conf), you can set different priorities for different slave libraries, and the one with higher priority will become the master first.
  • Select the one with the smallest data offset difference, that is, the progress difference between slave_repl_offset and master_repl_offset. In fact, it is to compare the copy progress difference between slave and original master.
  • Slave runID, when the priority and replication progress are the same, choose the best runID. The smaller the runID, the earlier the creation time, and the master is preferred. First come, first served principle.

After all these conditions are evaluated, we will select the most suitable slave and recommend him as the new master.
Insert image description here

3.3 Information notification

After the latest master is elected, all subsequent write operations will enter this master. Therefore, all slaves need to be notified as soon as possible, let them replacaof to the master, and re-establish runID and slave_repl_offset to ensure normal data transmission and master-slave consistency. As shown below:
Insert image description here

4 Sentinel Cluster

As mentioned before, a single sentinel may misjudge the offline status of a redis instance, so there is a concept of a sentinel cluster. Only when more than a certain proportion of sentinels (such as > 1/2) are judged to be subjectively offline can a substantial situation be formed. Objective offline.
There are several knowledge points here that we need to sort out.

4.1 How sentinels in the cluster implement communication

Use the pub/sub subscription capability of redis to implement inter-sentinel communication and slave discovery.
Sentinels can communicate with each other mainly due to Redis's pub/sub publish/subscribe mechanism. After the sentinel establishes communication with the master, it can use the publish/subscribe mechanism provided by the master to publish its own IP, port and other information. The master
has a dedicated channel of sentinel:hello for publishing and subscribing messages between sentinels. Sentinels can publish their own Name, IP, and Port messages through this channel, and at the same time subscribe to the Name, IP, and Port messages published by other sentinels. After mutual discovery, a connection is established, and subsequent message communication can be carried out directly.

This is similar to the entire set of practices like service registration and discovery in microservices, as well as RPC communication.
Insert image description here

4.2 How does the sentinel connect to the slave?

  • sentinel sends INFO command to master
  • master returns the list of slaves associated with it
  • Sentinel establishes connections with slaves one by one based on the slave list returned by the master, and continuously monitors this connection.
    Insert image description here

4.3 How Sentinel communicates with clients about events

It still publishes different events through the pub/sub mechanism, allowing clients to subscribe to messages here. Clients can subscribe to Sentinel messages. Sentinel provides many message subscription channels. Different channels contain different key events in the master-slave library switching process.

5 Summary

5.1 Sentinel’s main tasks

The Redis sentinel mechanism is one of the high-availability means to achieve uninterrupted Redis service. The data synchronization of the master-slave architecture cluster is the basic guarantee for data reliability; when the master database is down, automatic master-slave switching is the key support for uninterrupted services.
The Redis sentinel mechanism realizes automatic switching between master and slave libraries, so you no longer have to worry about the master going down when you are having sex with your girlfriend:

  • Monitor the running status of master and slave to determine whether they are offline objectively;
  • After the master objectively goes offline, select a slave to switch to the master;
  • Notify slaves and clients of new master information.

5.2 Principle of Sentinel Cluster

In order to avoid the failure of master-slave switching after a single sentinel failure, and to reduce the misjudgment rate, a sentinel cluster was introduced; the sentinel cluster needs some mechanisms to support its normal operation:

  • Implement communication between sentinel clusters based on pub/sub mechanism;
  • Obtain the slave list based on the INFO command to help the sentinel establish a connection with the slave;
  • Through Sentinel's pub/sub, event notification between the client and Sentinel is implemented.

Master-slave switching is not performed by randomly selecting a sentinel. Instead, a leader is selected through voting arbitration, and this leader is responsible for the master-slave switching.

Guess you like

Origin blog.csdn.net/d495435207/article/details/131565271