Second-level disaster recovery, three generations of architecture UCloud network service availability of Evolution

Fast-paced life, any business exception / interrupt will not be tolerated.

In the supermarket to buy complete unmanned checkout, checkout page suddenly stuck, unable to complete the purchase. Then the choice to give up in the hands of merchandise or continue to wait?

When the hotel check-in, management system crash, can not query bookings, resulting in Check-affected reception is waiting in a long queue ......

High availability and each of us are closely related, in the upcoming two-eleven, it is the availability of each electricity supplier business put forward higher requirements. In this regard, UCloud provide highly available service-based network of VIP, VIP network design evolution through three generations before and after the broadcast of the cluster, to solve the problem of broadcasting achieve in complex heterogeneous Overlay networks, access to second-class high availability switching capability, and it can good physical support cloud.

Below, this will be the network service availability UCloud second switching stage is described in detail.

Based on the highly available network services VIP

1, high availability ideas and points

From a business perspective, of course, to avoid possible application failure. But not entirely the fault is not possible.

How then to solve this problem? The answer is to believe that any single node is not reliable, you want to increase the backup to each node. When any node fails, traffic is automatically switched to the normal node, and the whole switching procedure no user perception, this is the basic concept of high availability. The two points is high availability and automatic failover backup node.

Second-level disaster recovery, three generations of architecture UCloud network service availability of Evolution

FIG: Once A fails, it will quickly switch to B

2, the traditional network availability solutions

In a traditional network, Keepalived + virtual IP is a classic high-availability solutions.

Keepalived is a software-based high availability VRRP agreement, there is one master node and multiple backup service node, and deploy the same services. The master node outside using a virtual IP service, when the primary node fails, Keepalived initiated based VRRP negotiation, select the standby node upgrade master node, the virtual IP address will automatically drift to the node, while taking advantage of the location information GARP declared virtual IP's updated to ensure normal service.

3, under highly available cloud Overlay Network

The Network Structure Cloud computing has changed dramatically, the traditional network architecture has been updated to Overlay networks, and the emergence of various types of complex, heterogeneous networks. So under the new network environment, how to solve the availability problem?

First we look at the basic principles of cloud computing network:

Second-level disaster recovery, three generations of architecture UCloud network service availability of Evolution

FIG: Cloud computing network implemented

As shown above, the cloud resources are bridging OVS on the bridge, but also the business card of a bridge on the bridge OVS, Controller for the UCloud Ryu achieve self-development based on open source framework. Controller Manager through interaction with the background, pulling ACL kinds of information, routing tables, the VPC Unicom, isolation, and through the OVS the Message Flow OVS cured on the bridge, Flow management purpose to achieve communication with the ACL barrier off, the Layer 3 forwarding function, and thus the ability to complete VPC Unicom and tenant isolation. The actual upper service packet through the GRE encapsulation, to ensure transparency to the underlying network.

Given the complexity of the user to achieve high availability in the cloud computing network, UCloud designed the VIP network products and provide services for the cloud hosting on the cloud platform, cloud physical host. As a user-defined service availability may drift within the network entry, the faults are detected to automatically failover without additional API calls and arranged inside the machine, to complete the second stage switch.

Second-level disaster recovery, three generations of architecture UCloud network service availability of Evolution

Figure: VIP network console interface

How network of VIP-second failover switch?

Long generally associated with the following two steps network failover VIP:

1, after the Master fails, the standby server needs to elect a new Master;

2, need to inform the other nodes in the broadcast domain, the IP position has changed.

As described above, in the Overlay network, the upper layer service packets ARP protocol analysis, IP address, unicast, multicast, and broadcast need to re-implement, there will be no small difficulty. So how should broadcast it?

UCloud broadcast-based implementation mechanisms, the evolution of the three versions below.

The first generation: analog broadcasting

Second-level disaster recovery, three generations of architecture UCloud network service availability of Evolution

Figure: analog broadcasting

As shown above, a broadcast packet directly copied parts of N, the other to the node broadcast domain to complete the broadcast behavior. Because OVS support replication and transmission of packets, only need to specify multiple actions in the Output Flow in can be realized. Flow modes are as follows:

Second-level disaster recovery, three generations of architecture UCloud network service availability of Evolution

FIG: Flow analog broadcasting mode

这种实现确实可以满足需求,但是存在几个明显的缺点:

1、Flow 的更新。由于用户的广播域是变化的,一旦广播域发生变化,那么所有广播域中节点所在宿主机上的广播 Flow 全部需要推送更新。因此如果用户的广播域比较大,这种更新非常消耗性能。

2.、Flow 的长度数量有限制。OVS 对 Flow 的长度有要求:单条 Flow 的长度不能超过 64K bit,而广播域增加的时候,Flow 的长度一定随之增长。如果客户的子网比较大,导致超过了 Flow 的长度限制,那么就无法再进行更新,出现广播行为异常,进而影响高可用实现。

3、异构网络的广播需要单独实现。比如物理云主机底层不是基于 OVS 的架构,那么就必须重现一遍,开发和维护成本很高。

为解决上述问题,UCloud 开发出了第二代广播解决方案 —— 广播集群:

第二代:广播集群

Second-level disaster recovery, three generations of architecture UCloud network service availability of Evolution

图:广播集群

如上图,所有的广播流量通过 Flow 指向自研的广播集群。广播集群从业务数据库中拉取广播的信息,对报文进行复制和分发。广播集群是 UCloud 基于 DPDK 自研的高可用集群,可以高性能地实现广播逻辑。

采用广播集群,我们很好的解决了第一代广播逻辑中存在的问题:

1、广播域的变化问题。广播域变化只需要通知广播集群即可,无需全网告知。

2、广播域的大小问题。广播集群通过 DPDK 来进行报文的复制和转发,理论上广播域无上限。

3、各种网络的适配问题。各类网络只需要将广播报文送到广播集群即可,无需进行额外的逻辑开发,很好的适配了各种网络场景。

随后,在第二代的基础上,UCloud 又提供了第三代的广播解决方案:

第三代:广播集群 + GARP 嗅探

Second-level disaster recovery, three generations of architecture UCloud network service availability of Evolution

图:基于 GARP 嗅探的广播集群

在第二代广播集群已经可以很好的实现高可用服务的情况下,UCloud 为什么还要开发出第三代呢?

从前文我们可以知道,在 VIP 切换的过程中,GARP 将利用广播告知整个广播域,进而 VIP 发生漂移。但是广播域之外的服务器是没有能力获知相关信息的。这样就会出现下列问题:VIP 的切换会导致跨三层的访问失效。

而跨三层的访问则要求后台数据库必须通过某种方式获知 VIP 位置的变化。在内网 VIP 的切换过程中,GARP 报文会通知广播域内的节点 VIP 的位置信息变化,而广播集群可以获取到所有的广播流量。因此,广播集群利用 ARP_SPA=ARP_TPA 的特征过滤得到 GARP 流量,将相应的位置信息上报到后台,并更新 Flow 信息,从而保证三层的访问正常。

在第三代架构下,广播集群对公有云、物理云等多种异构网络均进行了支持,满足不同云计算高可用应用场景的需求。

应用实例解析

1、电商支付系统高可用实践

某电商在频繁的日常消费与各类促销活动中对支付系统可用性提出了很高的要求。消费者对支付系统的可用性是非常敏感的,一旦出现任何一点小小的故障,诸如 “付款失败、重新支付、支付超时” 等都会带来不好的使用体验,严重时甚至可能导致用户流失。

在不考虑外部依赖系统突发故障的前提下,如网络问题、第三方支付和银行的大面积不可用等情况,该电商希望通过提高自身支付系统的高可靠服务能力来保证消费者的可用性体验。

为了实现高可用,UCloud 基于 Keepalived + 内网 VIP 产品为该电商线上支付系统快速构建了高可靠服务,从而避免自身单点故障,大大提高系统的可用性。

Second-level disaster recovery, three generations of architecture UCloud network service availability of Evolution

图:高可用服务构建实例

As FIG, VIP binding UPHost (physical cloud host) as the master node exists, when the Master node fails to bind VIP, VIP drift will occur. Physical cloud gateway receives GARP packet, and GARP packets sent to broadcast cluster. After the broadcast cluster analysis GARP packet will be reported to the back end position, and updates the physical cloud gateway configuration and Flow public cloud platform. Subsequently, the broadcast cluster replication GARP packets sent to all UHost and UPHost broadcast domain. Information and access to the second floor of a three-layer access information will be updated in the second grade, ensure high availability services.

2, UCloud cloud technology to achieve high availability data UDB product

In the high-availability technology UCloud cloud data UDB products, the application is also a VIP network technology. Below, the product UDB double main frame, and data synchronization is achieved by Semi-Sync, the availability management module UDB bottom node availability real-time monitoring, once the monitoring Master DB unavailable, automatically triggering the switching mechanism disaster, network VIP stateless drift to the Standby DB, ensure the stability of the user UDB database of reliable service.

Second-level disaster recovery, three generations of architecture UCloud network service availability of Evolution

Figure: DB technology based UCloud high availability within the network to achieve VIP

During UDB achieve high availability, since a single VIP access network, it is possible to seamlessly switch to complete the application layer, without any manual intervention and user configuration changes throughout the process. Relying on internal networks VIP, UDB product provides users with a highly available database services, now the product have been installed in tens of thousands of businesses and provides tens of thousands of database instances.

Epilogue

High availability is a complex proposition, in addition to the network application products VIP avoid single points of failure that may arise, but also strictly regulate the operation needs to be done in terms of maintenance services, including pre-split good service, good service monitoring afterwards.

But more than that right? Murphy's Law tells us: all the things that can go wrong will go wrong have a great chance. Daily three times daily: business structure is stable enough? Exception handling is adequate complete? Disaster recovery program this strong enough? And accordingly continue to optimize business systems, operation and maintenance engineers wish everyone can sleep!

Guess you like

Origin blog.51cto.com/13832960/2447056