Redis High Availability Architecture Best Practices

http://www.sohu.com/a/150426358_505802

I. Introduction

On May 13, 2017, the Guangzhou Station of Application Performance Management Lecture Hall came to a successful conclusion. Among them, DBA Wen Guobing from Sanqi Interactive Entertainment shared the wonderful Redis technology with you at the venue.

Redis is an open source log, Key-Value database written in ANSI C language, supports network, can be memory-based and persistent, and provides APIs in multiple languages.

Today, the data of Internet business is growing at a faster rate, and the types of data are becoming more and more abundant, which puts forward higher requirements on the speed and capability of data processing. Redis is an open source in-memory non-relational database that brings a disruptive experience to developers. High performance has been designed from start to finish, making Redis the fastest NoSQL database available today.

While considering high performance, high availability is also an important consideration. Internet 7x24 uninterrupted service, failover at the fastest speed during failures, which can bring minimal losses to the enterprise.

So, what are the high-availability architectures in practical applications? What are the pros and cons of the architectures? How should we choose? What are the best practices?

2. Sentinel principle

Before explaining the Redis high availability solution, let's take a look at the principle of Redis Sentinel (https://redis.io/topics/sentinel).

  1. The Sentinel cluster discovers the master through the given configuration file and monitors the master when it starts. Get all the slave servers under the server by sending info information to the master.

  2. Sentinel cluster sends hello information (once per second) to the monitored master-slave server through command connection, the information includes Sentinel's own IP, port, id, etc., in order to announce its existence to other Sentinels.

  3. Sentinel clusters receive hello messages sent by other Sentinels through subscription connections to discover other Sentinels monitoring the same master server; command connections are created between clusters for communication, because there are already master-slave servers as sending and receiving hello messages As an intermediary, no subscription connection will be created between Sentinels.

  4. The Sentinel cluster uses the ping command to detect the status of the instance. If there is no reply within the specified time (down-after-milliseconds) or an incorrect reply is returned, the instance is judged to be offline.

  5. When the failover master/standby switch is triggered, the failover will not proceed immediately, and most Sentinels in the Sentinel are required to be authorized before the failover can be performed, that is, the Sentinel that performs the failover will obtain the authorization of the specified quorum Sentinels, and enter after success. ODOWN state. For example, if two quorums are configured in five Sentinels, failover will be executed when two Sentinels think that the master is dead.

  6. Sentinel sends the SLAVEOF NO ONE command to the slave selected as the master. The condition for selecting the slave is that Sentinel will first sort according to the priority of the slaves. The smaller the priority, the higher the ranking. If the priority is the same, check the subscript of the replication, whichever receives more replicated data from the master, whichever is first. If the priority and subscript are the same, the one with the smaller process ID is selected.

  7. After Sentinel is authorized, it will get the latest configuration version number (config-epoch) of the down master. When the failover execution ends, this version number will be used for the latest configuration and notify other Sentinels by broadcasting. , other Sentinels update the configuration of the corresponding master.

1 to 3 are automatic discovery mechanisms:

  • Send the info command to the monitored master every 10 seconds, and obtain the current information of the master according to the reply.

  • Send a PING command to all redis servers, including Sentinel, at a frequency of 1 second, and determine whether the server is online by replying.

  • Every 2 seconds, the slave server sends messages with current Sentinel master information to all monitored masters.

4 is the detection mechanism, 5 and 6 are the failover mechanism, and 7 is the update configuration mechanism. [1]

3. Redis high availability architecture

After explaining the principle of Redis Sentinel, let's explain the commonly used Redis  high-availability architecture .

  • Redis Sentinel Cluster + Intranet DNS + Custom Script

  • Redis Sentinel cluster + VIP + custom script

  • Encapsulate the client to connect directly to the Redis Sentinel port

    • JedisSentinelPool, for Java

    • PHP self-packaged based on phpredis

  • Redis Sentinel Cluster + Keepalived/Haproxy

  • Redis M/S + Keepalived

  • Redis Cluster

  • Twemproxy

  • Codes

Next, explain with pictures and texts one by one.

1. Redis Sentinel cluster + intranet DNS + custom script

Redis Sentinel Cluster + Intranet DNS + Custom Script

The above picture is the solution that has been applied in the online environment. The bottom layer is the Redis Sentinel cluster, which acts as a proxy for Redis master and slave, and the Web side connects to the intranet DNS to provide services. Intranet DNS is allocated according to certain rules, such as xxxx.redis.cache/queue.port.xxx.xxx, the first segment represents the business abbreviation, the second segment represents the Redis intranet domain name, and the third segment represents Redis Type, cache represents cache, queue represents queue, the fourth segment represents the Redis port, and the fifth and sixth segments represent the main domain name of the intranet.

当主节点发生故障,比如机器故障、Redis 节点故障或者网络不可达,Sentinel 集群会调用 client-reconfig- 配置的脚本,修改对应端口的内网域名。对应端口的内网域名指向新的 Redis 主节点。

优点:

秒级切换,在 10s 内完成整个切换操作

脚本自定义,架构可控

对应用透明,前端不用担心后端发生什么变化

缺点:

维护成本略高,Redis Sentinel 集群建议投入 3 台机器以上

依赖 DNS,存在解析延时

Sentinel 模式存在短时间的服务不可用

服务通过外网访问不可采用此方案

2、Redis Sentinel 集群 + VIP + 自定义脚本

Redis Sentinel 集群 + VIP + 自定义脚本

此方案和上一个方案相比,略有不同。第一个方案使用了内网 DNS,第二个方案把内网 DNS 换成了虚拟 IP。底层是 Redis Sentinel 集群,代理着 Redis 主从,Web 端通过 VIP 提供服务。在部署 Redis 主从的时候,需要将虚拟 IP 绑定到当前的 Redis 主节点。当主节点发生故障,比如机器故障、Redis 节点故障或者网络不可达,Sentinel 集群会调用 client-reconfig- 配置的脚本,将 VIP 漂移到新的主节点上。

优点:

  • 秒级切换,在 5s 内完成整个切换操作

  • 脚本自定义,架构可控

  • 对应用透明,前端不用担心后端发生什么变化

缺点:

  • 维护成本略高,Redis Sentinel 集群建议投入 3 台机器以上

  • 使用 VIP 增加维护成本,存在 IP 混乱风险

  • Sentinel 模式存在短时间的服务不可用

  • 3.3 封装客户端直连 Redis Sentinel 端口

3、封装客户端直连 Redis Sentinel 端口

部分业务只能通过外网访问 Redis,上述两种方案均不可用,于是衍生出了这种方案。Web 使用客户端连接其中一台 Redis Sentinel 集群中的一台机器的某个端口,然后通过这个端口获取到当前的主节点,然后再连接到真实的 Redis 主节点进行相应的业务员操作。需要注意的是,Redis Sentinel 端口和 Redis 主节点均需要开放访问权限。如果前端业务使用 Java,有 JedisSentinelPool 可以复用;如果前端业务使用 PHP,可以在 phpredis 的基础上做二次封装。

优点:

  • 服务探测故障及时

  • DBA 维护成本低

缺点:

  • 依赖客户端支持 Sentinel

  • Sentinel 服务器和 Redis 节点需要开放访问权限

  • 对应用有侵入性

4、Redis Sentinel 集群 + Keepalived/Haproxy

Redis Sentinel 集群 + Keepalived/Haproxy

底层是 Redis Sentinel 集群,代理着 Redis 主从,Web 端通过 VIP 提供服务。当主节点发生故障,比如机器故障、Redis 节点故障或者网络不可达,Redis 之间的切换通过 Redis Sentinel 内部机制保障,VIP 切换通过 Keepalived 保障。

优点:

  • 秒级切换

  • 对应用透明

缺点:

  • 维护成本高

  • 存在脑裂

  • Sentinel 模式存在短时间的服务不可用

5、Redis M/S + Keepalived

Redis M/S + Keepalived

此方案没有使用到 Redis Sentinel。此方案使用了原生的主从和 Keepalived,VIP 切换通过 Keepalived 保障,Redis 主从之间的切换需要自定义脚本实现。

优点:

  • 秒级切换

  • 对应用透明

  • 部署简单,维护成本低

缺点:

  • 需要脚本实现切换功能

  • 存在脑裂

6、Redis Cluster

Redis Cluster

From: http://intro2libsys.com/focused-redis-topics/day-one/intro-redis-cluster

Redis 3.0.0 在 2015 年 4 月 2 日正式发布,距今已有两年多的时间。Redis 集群采用 P2P 模式,无中心化。把 key 分成 16384 个 slot,每个实例负责一部分 slot。客户端请求对应的数据,若该实例 slot 没有对应的数据,该实例会转发给对应的实例。另外,Redis 集群通过 Gossip 协议同步节点信息。

优点:

  • 组件 all-in-box,部署简单,节约机器资源

  • 性能比 proxy 模式好

  • 自动故障转移、Slot 迁移中数据可用

  • 官方原生集群方案,更新与支持有保障

缺点:

  • 架构比较新,最佳实践较少

  • 多键操作支持有限(驱动可以曲线救国)

  • 为了性能提升,客户端需要缓存路由表信息

  • 节点发现、reshard 操作不够自动化

7、Twemproxy

Twemproxy

From: http://engineering.bloomreach.com/the-evolution-of-fault-tolerant-redis-cluster

多个同构 Twemproxy(配置相同)同时工作,接受客户端的请求,根据 hash 算法,转发给对应的 Redis。

Twemproxy 方案比较成熟了,之前我们团队长期使用此方案,但是效果并不是很理想。一方面是定位问题比较困难,另一方面是它对自动剔除节点的支持不是很友好。

优点:

  • 开发简单,对应用几乎透明

  • 历史悠久,方案成熟

缺点:

  • 代理影响性能

  • LVS 和 Twemproxy 会有节点性能瓶颈

  • Redis 扩容非常麻烦

  • Twitter 内部已放弃使用该方案,新使用的架构未开源

8、Codis

Codis

From: https://github.com/CodisLabs/codis

Codis 是由豌豆荚开源的产品,涉及组件众多,其中 ZooKeeper 存放路由表和代理节点元数据、分发 Codis-Config 的命令;Codis-Config 是集成管理工具,有 Web 界面供使用;Codis-Proxy 是一个兼容 Redis 协议的无状态代理;Codis-Redis 基于 Redis 2.8 版本二次开发,加入 slot 支持,方便迁移数据。

优点:

  • 开发简单,对应用几乎透明

  • 性能比 Twemproxy 好

  • 有图形化界面,扩容容易,运维方便

缺点:

  • 代理依旧影响性能

  • 组件过多,需要很多机器资源

  • 修改了 Redis 代码,导致和官方无法同步,新特性跟进缓慢

  • 开发团队准备主推基于 Redis 改造的 reborndb

四、最佳实践

所谓的最佳实践,都是最适合具体场景的实践。

主推以下方案:

  • Redis Sentinel 集群 + 内网 DNS + 自定义脚本

  • Redis Sentinel 集群 + VIP + 自定义脚本

以下是实战过程中总结出的最佳实践:

  • Redis Sentinel 集群建议使用 >= 5 台机器

  • 不同的大业务可以使用一套 Redis Sentinel 集群,代理该业务下的所有端口

  • 根据不同的业务划分好 Redis 端口范围

  • 自定义脚本建议采用 Python 实现,扩展便利

  • 自定义脚本需要注意判断当前的 Sentinel 角色

  • 自定义脚本传入参数:<service_name> <role> <comment> <from_ip> <from_port> <to_ip> <to_port>

  • 自定义脚本需要远程 ssh 操作机器,建议使用 paramiko 库,避免重复建立 SSH 连接,消耗时间

  • 加速 SSH 连接,建议关闭以下两个参数

  • UseDNS no

  • GSSAPIAuthentication no

  • 微信或者邮件告警,建议 fork 一个进程,避免主进程阻塞

  • 自动切换和故障切换,所有操作建议在 15s 以内完成

五、小结

此次活动分享了 Redis 高可用的必要性、Sentinel 原理、Redis 高可用常用架构和实战过程中总结出的最佳实践,希望对读者有所帮助,如果有需要后续交流的,可以添加我的微信(Wentasy),或者发邮件到:[email protected]

附 PPT 下载:https://github.com/dbarobin/slides

视频回放:Redis 高可用架构最佳实践 http://www.itdks.com/dakashuo/detail/1437

六、致谢

感谢听云和运维帮的精心组织,感谢大家冒着大雨前来参加此次活动。此次分享由 IT 大咖说全程录像,感谢 IT 大咖说的技术支持。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326064554&siteId=291194637