The core principle of Redis sentinel mode

1 Introduction

sentinal, the Chinese name is sentinel

The sentinel is a very important component in the redis cluster architecture. Its main functions are as follows:

(1) Cluster monitoring, responsible for monitoring whether the redis master and slave processes are working normally
(2) Message notification, if a redis instance fails, then the sentry is responsible for sending a message as an alarm notification to the administrator
(3) Failover, if the master node If it hangs up, it will be automatically transferred to the slave node
(4) Configuration center, if a failover occurs, notify the client of the new master address

The sentinel itself is also distributed, running as a cluster of sentinels, working with each other

(1) During failover, it is judged that a master node is down, and most of the sentinels are required to agree. This involves the issue of distributed elections.
(2) Even if some sentinel nodes are down, the sentinel cluster can still work normally Yes, because if a failover system, which is an important part of the high-availability mechanism, is a single point, high-availability cannot be achieved.

2 core knowledge

(1) Sentinel needs at least 3 instances to ensure its robustness
(2) The deployment architecture of Sentinel + redis master-slave does not guarantee zero data loss, but can only guarantee the high availability of the redis cluster
(3) For Sentinel + Redis master-slave this complex deployment architecture, try to carry out sufficient testing and drills in both the test environment and the production environment

3 Sentinel cluster

1 哨兵集群必须部署2个以上节点

If the sentry cluster only deploys 2 sentry instances, quorum=1

+----+         +----+
| M1 |---------| R1 |
| S1 |         | S2 |
+----+         +----+

Configuration: quorum = 1

The master is down. As long as one sentinel in s1 and s2 thinks that the master is down, the switch can be performed. At the same time, a sentinel in s1 and s2 will be elected to perform the failover.

At the same time, majority is required at this time, that is, most sentries are running. The majority of 2 sentries is 2 (majority of 2=2, majority of 3=2, majority of 5=3, majority of 4=2), If both sentries are running, failover can be allowed.

But if the entire machine running on M1 and S1 goes down, then there is only one sentinel. At this time, there is no majority to allow failover. Although the other machine has an R1, failover will not be performed.

2 经典的哨兵集群

       +----+
       | M1 |
       | S1 |
       +----+
          |
+----+    |    +----+
| R2 |----+----| R3 |
| S2 |         | S3 |
+----+         +----+

Configuration: quorum = 2,majority

If the machine where M1 is located is down, there are two of the three sentinels left. S2 and S3 can agree that the master is down, and then elect one to perform the failover.

At the same time, the majority of the three sentries is 2, so if the remaining two sentries are running, failover can be allowed.

5 Mechanism of Sentinel

5.1 sdown and odown conversion mechanism

Two failure states, sdown and odown

Sdown is subjective downtime. If a sentinel feels that a master is down, it is subjective downtime.

Odown is an objective downtime. If the sentinels of the number of quorums feel that a master is down, then it is an objective downtime.

The condition achieved by the sdown is very simple. If a sentry pings a master and the number of milliseconds specified by is-master-down-after-milliseconds is exceeded, the master is considered to be down subjectively.

The conditions for the conversion from sdown to odown are very simple. If a sentry receives a specified number of quorum within a specified time and other sentries think that the master is down, then it is considered to be down, and the master is objectively considered to be down.

5.2 Automatic discovery mechanism of sentinel cluster

The mutual discovery of sentinels is achieved through redis’ pub/sub system. Each sentinel sends a message to the __sentinel__:hello channel. At this time, all other sentinels can consume this message and perceive others. The presence of the sentinel

Every two seconds, each sentry sends a message to the __sentinel__:hello channel corresponding to a certain master+slaves it monitors. The content is its own host, ip and runid, as well as the monitoring configuration of this master.

Each sentry will also monitor the __sentinel__:hello channel corresponding to each master+slaves it monitors, and then sense the existence of other sentries that are also monitoring this master+slaves.

Each sentry exchanges the monitoring configuration of the master with other sentries, and synchronizes the monitoring configuration with each other.

5.3 Automatic correction of slave configuration

The sentry will be responsible for automatically correcting some configurations of the slave. For example, if the slave is to become a potential master candidate, the sentry will ensure that the slave is replicating the data of the existing master; if the slave is connected to a wrong master, such as after a failover, then the sentry Will ensure that they are connected to the correct master

5.4 slave->master election algorithm

If a master is considered to be down, and the major sentry allows the main/standby switch, then a certain sentry will perform the main/standby switch operation. At this time, a slave must be elected first.

Will consider some information of slave

(1) Time to disconnect from the master
(2) Slave priority
(3) Copy offset
(4) run id

If a slave is disconnected from the master by more than 10 times the down-after-milliseconds, plus the length of time the master is down, then the slave is considered unsuitable for election as the master.

(down-after-milliseconds * 10) + milliseconds_since_master_is_in_SDOWN_state

Next, the slaves will be sorted

(1) Sort according to slave priority. The lower the slave priority, the higher the priority.
(2) If the slave priority is the same, then look at the replica offset, which slave replicates more data, and the offset is lower, the priority is The higher
(3) If the above two conditions are the same, then choose a slave with a smaller run id

5.5 quorum and majority

Every time a sentry needs to switch between active and standby, first the quorum number of sentries needs to be considered as down, and then a sentry is elected to do the switching. This sentry has to be authorized by the major sentry to officially execute the switching.

If quorum <majority, for example, if there are 5 sentries, the majority is 3 and the quorum is set to 2, then only 3 sentries can be authorized to perform the switch

But if quorum >= majority, then the number of sentries must be authorized by the number of quorum, for example, if there are 5 sentries and the quorum is 5, then all 5 sentries must agree to the authorization before the switch can be executed.

5.6 configuration epoch

The sentry will monitor a set of redis master+slave and have corresponding monitoring configuration

The sentinel that performs the switch will get a configuration epoch from the new master (salve->master) to be switched to. This is a version number, and the version number must be unique each time it is switched

If the switch of the first elected sentry fails, then the other sentry will wait for the failover-timeout time, and then take over to continue the switch, at this time will re-acquire a new configuration epoch as the new version number

5.7 configuraiton propagation

After the sentinel completes the switch, it will update and generate the latest master configuration locally, and then synchronize it to other sentinels through the pub/sub message mechanism mentioned earlier.

The previous version number here is very important, because all kinds of messages are released and monitored through a channel, so after a sentinel completes a new switch, the new master configuration follows the new version number.

Other sentinels update their master configuration according to the size of the version number

6 Related information

  • The blog post is not easy, everyone who has worked so hard to pay attention and praise, thank you

Guess you like

Origin blog.csdn.net/qq_15769939/article/details/113845728