Redis sentinel principle, I have endured you for a long time!

There is a saying in the role of Redis master-slave replication: "Master-slave replication is the cornerstone of high availability", what is high availability? High availability is to reduce the time that the system cannot provide, which is often heard based on 6 9s. Indispensable for achieving high availability are sentries and clusters.


image.pngThe picture is from Pexels.
This article mainly introduces the sentinel mechanism around the following aspects:

  • What is a sentinel

  • The role of the sentinel

  • How to configure sentry

  • How Sentinel Works

  • to sum up


This article realizes the environment:

  • centos 7.3

  • redis 4.0

  • redis working directory /usr/local/redis

  • Simulate operations in a virtual machine


What is a sentinel


Let me briefly say a few words when we configure master-slave replication, there is a situation where the master node is down. Who will provide the service?


When the master node is down, master-slave replication has no meaning. In the era of data king, there is no high availability of data.

image.png
At this moment, a big brother named Sentinel was born. The big brother said that I will help you deal with this problem.


Since the master node master, as the boss, will not let you play anymore. I will pick another boss from the four of you, and you will play with him.


When the boss who didn't take you to play comes back, his identity will expire and he will no longer be your boss. He can only play with the boss I picked.


The above dialogue process is exactly where the meaning of our sentinel configuration is, whoever we play with is whoever gives data, we know the role of sentinel, we will continue.


Finally, we use professional terms to explain what a sentinel is:

Sentinel, the English name Sentinel, is a distributed system used to monitor each server in the master-slave structure. When the master node fails, the new master node is selected through a voting mechanism and all slave nodes are connected to On the new master node.


The role of the sentinel


The conversation process we talked about above is one of the roles of sentinels: automatic failover.


Speaking of role, it must be exactly what the sentinel did in his work. Let's first describe it with a relatively dry concept, and then the working principles will be discussed one by one below.


The three functions of the sentinel:

  • Monitoring: Who to monitor? The work that supports the master-slave structure is the master node and the slave node, so it must be monitoring these two. Monitor whether the master node and the slave node are operating normally; check whether the master node is alive, and the operation of the master node and the slave node.

  • Notification: When there is a problem with the server detected by the sentinel, it will send a notification to other sentinels. The sentinels are equivalent to a WeChat group, and the problems found by each sentinel will be posted in this group.

  • Automatic failover: When the downtime of the master node is detected, all slave nodes connected to the down master node are disconnected, one of the slave nodes is selected as the master node, and then the other slave nodes are connected to the latest master node . And inform the client of the latest server address.


There is a point to note here. Sentinel is also a Redis server, but it does not provide any external services. When configuring the sentinel, configure it as an odd number.


So why is the number of sentinel servers singular? With this question you will see the answer you want below.

How to configure sentry


Ready to work


We start to configure the sentinel and start eight clients, three sentinels, one master node, two slave nodes, one master node client, and one slave node client. 

image.png

sentinel.conf configuration interpretation


The configuration file used by sentinel is sentinel.conf, as shown below:

image.png

We come to believe sentinel.conf configuration information to interpret:

image.png

But most of them are comments. Here is a command to filter this useless information:

cat sentinel.conf | grep -v '#' | grep -v '^$' 

image.png


①port 26379: External service port number.

②dir /tmp: Store the work information of the sentry.


③sentinel monitor mymaster 127.0.0.1 6379 2: Who is monitoring, the name can be customized, the back 2 means that if two sentinels judge that the master node is down, then the master node is down, usually set as the sentinel Count half plus one.


④ sentinel down-after-milliseconds mymaster 30000: The sentinel is down-after-milliseconds mymaster 30000 . The 30,000 behind is milliseconds, which is 30 seconds.


⑤sentinel parallel-syncs mymaster 1:这个配置项是指在故障转移时,最多有多少个从节点对新的主节点进行同步。


这个值越小完成故障转移的时间就越长,这个值越大就意味着越 多的从节点因为同步数据而不可用。


⑥sentinel failover-timeout mymaster 180000:在进行同步的过程中,多长时间完成算有效,系统默认值是 3 分钟。


开始配置


使用命令 cat sentinel.conf | grep -v '#' | grep -v '^$' > ./data/sentinel-26379.conf 把 sentinel.conf 过滤后的信息移到 /usr/local/redis/conf 下。

image.png

然后打开 sentinel-26379.conf 修改信息存放目录:

image.png

再快速的复制两个哨兵配置文件,端口为 26380 和 26381:

sed 's/26379/26381/g' sentinel-26379.conf > sentinel-26381.conf

image.png


测试主从复制处于正常工作状态,启动三台 redis 服务器,端口分别为 6379、6380、6381:

image.png

查看主节点信息,是有俩台从节点在连接着,端口分别为 6380、6381。


这里有一个小小的点就是 lag 怎么一个是 1 一个是 0 呢?lag 是延迟时间,我这里是本地测试所以会出现 0 的情况,使用云服务器是很少出现的。


lag 的值为 0 和 1 都属于正常。 

image.png

测试主节点添加一个 hash 值,hset kaka name kaka:


分别从 slave1 和 slave2 获取 kaka 的值,检测主从复制是否正常运行。

经过测试我们的主从结构是正常运行的,如下图:

image.png

image.png

启动一个哨兵 redis-sentinel 26379-sentinel.conf:

image.png

连接 26379 哨兵,主要是最后一行,监控的主节点名为 mymaster,状态正常,从节点有俩个,哨兵数量为 1 个。

image.png

再来查看一下 26379 的哨兵配置信息,这个时候已经改动了:

image.png

在启动一个 26380 的哨兵,redis-sentinel 26380-sentinel.conf,这里注意一下最后一行多了一条信息,这个 id 就是我们 26379 配置文件新增的 id。

image.png

然后我们来到哨兵 26379 的客户端,同样也是新增的 26380 哨兵的 id:

image.png

这个时候我们再查看一下 26379 哨兵的配置文件,第一次查看配置文件是没有配置 26380 哨兵的,第二次查看时配置了 26380 哨兵后添加的信息。

image.png

最后我们需要把哨兵客户端 3 启动起来,端口号为 26381。启动起来之后,我们的配置信息和服务端的信息也会改动,添加哨兵 26380 有的信息,哨兵 26381 也会有。

直到这里我们对哨兵的配置就结束了,接下来我们把主节点 Master 给宕掉。

image.png

等待 30 秒后我们来到 26379 哨兵的客户端,这里新增了一些信息,那么这些信息都做了什么呢?让我们细细道来。

image.png

这里边的信息我们先需要知道几个:


①+sdown:这个信息后是指三个哨兵里边有一个认为主节点宕机了。


②+odown:这个信息是指其他俩个哨兵去连接了一下主节点,发现确实是主节点宕机了,然后发起了一轮投票。这里使用的是 redis 4.0,版本之间这块信息有点差异。


③+switch-master mymaster 127.0.0.1 6379 127.0.0.1 6380:直到这里是哨兵发起投票的结果,推选端口为 6380 的 redis 为主节点。


④+slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6380:这里就把端口为 6381 与 6379 和新的主节点 6380 做了一个连接。


⑤+sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6380:最后一句是端口为 6379 的还是没有上线,于是给踢下线。


当我们在重新把 6379 的 redis 服务器上线后,就可以看到哨兵服务端响应了俩句。一句是去除 6379 的下线。最后一句就是重连 6379 到新的主节点上。 

image.pngimage.png

这个时候主节点就是 6380 了,在 6380 的 redis 客户端设置值,检测主从复制是否正常工作。


在新的主节点 6380 添加 list 类型:

image.png

在 6379 和 6381 获取这个值,至此,我们的哨兵模式就配置完成了。

image.png

image.png

哨兵工作原理


配置完哨兵后,就需要对其工作原理进行解析了,只有知道其工作流程,才能对哨兵有更好的理解。


本文讲解原理没有那么干巴!让你可以把一篇技术文章当故事去看。


进入正题,哨兵作用是监控、通知、故障转移。那么工作原理也是围绕这三点来讲的。

监控工作流程

image.png

监控工作流程如下:


哨兵发送 info 指令,并且保存所有哨兵状态,主节点和从节点的信息。


主节点会记录 redis 实例的信息,主节点记录的信息跟哨兵记录的信息看起来是一样的,实际上还是有点区别。
哨兵会根据在主节点拿到的从节点信息,给对应的从节点也发送 info 指令。
接着哨兵 2 来了,同样的也会给主节点发送 info 指令,并且建立 cmd 连接。


这个时候哨兵 2 也会保存跟哨兵 1 一样的信息,只不过是保存的哨兵信息是 2 个。


这个时候为了每个哨兵的信息都一致它们之间建立了一个发布订阅。为了哨兵之间的信息长期对称它们之间也会互发 ping 命令。


当再来一个哨兵 3 时,也会做同样的事情,给主节点和从节点发送 info。并且跟哨兵 1 和哨兵 2 建立连接。


通知工作流程


sentinel 会给主从的所有节点发送命令获取其状态,并且会把信息发布到哨兵的订阅里。 

image.png

故障转移原理


image.png

哨兵会一直给主节点发送 publish sentinel:hello,直到哨兵报出 sdown,这个词这会是有不是有点熟悉了。没错就是我们上文中把主节点断开后哨兵服务端报出的信息。


哨兵报出主节点 sdown 后还没有完,哨兵还会往内网里发布消息说明这个主节点挂了。发送的指令是 sentinel is-master-down-by-address-port。


其余的哨兵接收到指令后,主节点挂了吗?让我去看看到底挂没挂。发送的信息也是 hello。


其余的哨兵也会发送他们收到的信息并且发送指令 sentinel is-master-down-by-address-port 到自己的内网,确认一下第一个发送 sentinel is-master-down-by-address-port 的哨兵说你说的对,这个家伙确实挂了。


当所有人都认为主节点挂了后就会修改其状态为 odown。当一个哨兵认为主节点挂了标记的是 sdown,当半数哨兵都认为挂了其标记的状态是 odown。这也就是配置哨兵为什么配置单数的原因。


对于一个哨兵认为主节点挂了称之为主观下线,半数哨兵认为主节点挂了称之为客官下线。 


一旦被认为主节点客官下线后,哨兵就会进行下一步操作:

这时哨兵已经检测到问题所在了,那么到底是那个哨兵去负责推选新的主节点呢!不能是张三也去,李四也去,王五也去,这样就乱套了、于是就需要在所有的哨兵里选出领头的,那么是如何选的呢!请看下图。


这个时候,五个 sentinel 就在一起开会了,所有的哨兵都在一个内网中,然后他们会做一件事情就是五个 sentinel 会同时发送指令 sentinel is-master-down-by-address-port 并且携带上自己竞选次数和 runid。 

image.png

每个 sentinel 既是参选者也是投票者,每个 sentinel 都有一票,信封就代表自己的投票权。 image.png
当 sentinel1 和 sentinel4 同时把指令发送到群里准备竞选时,sentinel2 这个时候就说我先接到谁的指令就把票投给谁。


假如 sentinel1 发的早,那么 sentinel2 的票就会投给 sentinel1。

image.png
按照这样的规则一直发起投票直到有一个 sentinel 的票数为总 sentinel 数量的一半之多。


假设说是 sentinel1 的票数满足总哨兵数量的一半之多后,sentinel1 就会当选。这个时候就进行到了下一个阶段。 

image.png

在上边哨兵已经选出了 sentinel1 为代表去所有的从节点找出一个作为主节点。这个挑选主节点不是随便拿一个是有一定的规则的。

先把不在线的干掉:

image.png

响应慢的干掉,sentinel 会给所有的 redis 发送信息,响应速度慢的就会被干掉。

image.png

与原主节点断开时间最久的干掉,这里由于演示不够用了,所有新增了一个 slave5,没有任何意义哈!

image.png以上三个点都判断结束后还有 salve4 和 slave5,就会根据优先原则来进行筛选:

  • 首先会根据优先级,如果优先级一样在进行其他判断。

  • 判断 offset 偏移量,判断数据同步性,假如说 slave4 的 offset 为 90,slave5 偏移量为 100。

    那么哨兵就会认为 slave4 的网络是不是有问题,于是就会选 slave5 为新的主节点。那如果说是 slave4 和 slave5 的 offset 相同呢!还有最后一个判断。

  • The last step is to judge the runid, that is, the seniority in the workplace, that is, to judge based on the creation time of the runid, the time is early.

image.png

After selecting the new master node, it is necessary to send instructions to all nodes.

image.png

to sum up


All the knowledge points about the sentry have been said, the most important thing in this article is the working principle of the sentry.


We are briefly sorting out its working principle:

  • First monitor, and all sentries synchronize information.

  • The sentry publishes information to the subscription.

  • Failover: The sentinel finds that the master node is offline → the sentinel starts to vote for the person in charge → the person in charge elects a new master node → the new master node disconnects the original master node, and other slave nodes connect to the new master node. After the original master node goes online Connect as a slave node.


The above is the author's understanding of the sentinel. If you make a mistake, please point it out so that you can correct it in time.


Guess you like

Origin blog.51cto.com/14410880/2545875