UCloud physical cloud gateway one hundred G-cluster design practice

Physical cloud host server is a dedicated physical UCloud provided with excellent computing performance, core scenarios meet demand for high performance and stability, and can be flexible with the other product. Physical cloud gateway used to carry internal network communication between the physical and public cloud for each product, due to the need to deploy multiple users, the gateway cluster facing cross-regional flow pressure across the cluster.

We break up the multi-tunnel traffic and other means to solve the traffic Hash polarization caused by overload, and lossless migration limit by an elephant flow capacity management and quarantine. Since the launch program, the cluster is upgraded from the carrier can carry several hundred G G flow, to help other users People traffic peak smoothly through the two-eleven. The following are the practical experience to share. One

A traffic overload Cloud Physics

In order to ensure high availability cloud services, users will usually deployed in different business areas. At this point cloud users will need physical access to each other through physical cloud gateway, inevitably, physical cloud gateway will carry a large number of physical hosts across the cloud cluster traffic.

Meanwhile, in order to ensure that any internal isolation and visits network traffic between different users of the room, the user will physically cloud gateway tunnel encapsulated packet, and send to the recipient.

1, a problem: Hash polarization cloud and physical overload

Below, we find that the physical cloud cluster 2 e gateway device bandwidth overload, affecting all access to the cluster service 2. To check by monitoring the traffic distribution is not uniform cluster 2, cluster bandwidth is off the hook portion of the device, but the remainder of the device is very small flow rate. By packet capture, the traffic gateway device e cloud almost entirely from the physical cluster 1.

UCloud physical cloud gateway one hundred G-cluster design practice

FIG: When a schematic cross-cluster access tunnel encapsulation

Binding business analysis to determine the cause that the physical cloud overload: physical traffic between visits cloud cluster 1 and cluster 2 Hash polarization appears, resulting in uneven flow distribution.

So what is the Hash polarization it?

The use of a single cluster transmission tunnels between tunnel encapsulation hides the original information of the user, e.g. IP, MAC, etc., present only outside the tunnel information, and a unique tunnel using SIP and DIP. Then the same Hash algorithm, consistent with the calculated results, resulting in traffic can not do a very good load sharing, will make the cluster a single device sudden load, the phenomenon appears to be off the hook in extreme cases, thereby affecting the Cluster All users, this is the polarization Hash, Hash often in multiple scenes across devices appear.

According to the status quo, we are trying to resolve the issue from two perspectives:

① If the user traffic can be broken up, how to avoid Hash polarization encapsulated tunnel?

② If the user traffic can not break up, how to prevent the "elephant flow" off the hook physical cloud network?

Now, we are starting from two studies corresponding solutions.

2, how to avoid Hash polarization encapsulated tunnel?

To address this question, we initially proposed several solutions:

 Program 1: user traffic transmitted from the switch to the cluster polling each device. The advantage of this method is that the flow can be fully broken up, it does not appear Hash polarization. But the disadvantage is that the timing is disrupted network packets that may affect the business user.

② Scenario 2: Switch-based tunnel inner packet Hash. The method is based on the user packets broken, the advantage that it can be broken up more balanced clusters on different devices. The problem is that the user will tunnel encapsulated packet fragmentation again, the inner lead and the message information deletion Hash fragmented packets to the different devices.

③ Scheme 3: cluster each device is assigned a separate tunnel source IP. This method can achieve efficient traffic broken up, but due to the limited number of tunnels, Hash problem of uneven performance in the existing network is still evident.

These three methods have disadvantages varying degrees, does not completely solve the problem of polarization Hash. Through a series of research, we finally found a multi-tunnel solution. That broke the single tunnel gateway mode, all of the gateways to bind a network of tunnels IP, based on the user's inner message information Hash, and select the tunnel segment pre-assigned in the SIP and DIP, as far as possible to ensure that different flow distributed in different tunnels, which will break up the user traffic.

UCloud physical cloud gateway one hundred G-cluster design practice

Figure: Multi-tunnel solutions schematic

3, how to prevent the "elephant flow" off the hook physical cloud network?

Premise multi-tunnel option is that the user traffic can be broken up, but if you encounter "elephant stream" mean? Even more tunnels can not be off the hook to avoid. The face of "elephant flow" of users, technology alone is not enough, we also need to do advance the prevention and avoidance from the hardware configuration.

■ Stand-alone capacity management

First need reasonable physical capacity management cloud gateway, the gateway to ensure that the user can carry a bandwidth greater than the bandwidth of the physical host cloud, while ensuring the carrying capacity of the entire cluster to meet user needs.

UCloud physical cloud gateway one hundred G-cluster design practice

FIG: Example - adjusting unit capacity from 10G to 25G

这一点其实与云厂商自身的能力密切相关,目前UCloud网关集群单机的承受能力远远大于单个用户的流量,在承载多用户汇聚流量的情况下,仍能保证个别用户的突发“大象流”不会打爆网关。

■ 隔离区无损迁移

提升单机容量还远远不够,以防万一,UCloud还配备了隔离区,隔离区通常是无流量通过的。

UCloud physical cloud gateway one hundred G-cluster design practice

图:隔离区无损迁移

如上图,一旦监测到流量过大,存在集群被打爆的风险时,集群配套的自动迁移系统便会修改需要迁移的物理机数据库信息,并自动更新对应转发规则,部分业务流量便可通过隔离区分担出去。同时我们还会基于强校验技术对迁移结果进行自动验证,保证迁移业务的无损可靠。

4、实例:新旧方案下的用户应用对比

在新方案上线前,由于Hash极化现象,集群通常只能承载几十G的流量,并且不时出现过载的状态。

新方案上线后,如下监控图,可以看到流量基本在集群上打散,集群的优势得到了充分发挥,目前集群可以承载上百G的流量,充分抵御用户业务量突增时的风险。例如达达在双十一时60G的流量压力是普遍现象,突发时还会出现流量达到100G的情况,此时集群流量依旧转发正常,对业务毫无影响。

UCloud physical cloud gateway one hundred G-cluster design practice

图:流量监控图示意

除了提升性能,这次集群升级中对高可用设计也做了优化。

二、集群升级后的高可用性优化

针对集群升级,一般情况下会先部署新灰度集群,然后将用户业务逐步进行迁移。这样的好处在于可以在新集群版本存在缺陷的情况下,最大限度的控制影响范围,当出现故障时,可以及时回迁受影响的用户业务到老集群,避免用户业务受到影响。

UCloud physical cloud gateway one hundred G-cluster design practice

图:预期结果-新Manager接管灰度集群

在灰度过程中,曾发现一个问题。

在新集群Manager部署完毕后,由于配置错误导致灰度集群接管了旧集群,Manager基于配置文件的集群信息自动接管集群的控制,并直接下发配置信息,旧集群接受错误配置。由于旧集群和新集群配置差异较大,导致旧集群在解释新配置时有误,出现高可用异常。

UCloud physical cloud gateway one hundred G-cluster design practice

图:灰度Manager错误接管旧集群示意

1、风险分析

为了系统性避免这类问题,我们对配置过程进行了回溯分析,总结了存在的风险:

 部署人为干预多,会加大故障概率;

 程序的异常保护不够;

 集群之间的有效隔离不足,若故障影响范围大。

2、优化:自动化运维&程序优化&隔离影响

■ 自动化运维

自动运维化通过自动化代替人工操作,可以有效避免人为错误的发生。我们对集群部署流程进行了优化,将其分为配置入库部署两个流程,运维人员只需录入必要的配置信息,其余均通过自动化生成部署。

■ 完善校验和告警

此外,我们还对部分程序作了优化,加大对异常配置的校验。例如,配置加载前,首先需进行白名单过滤,如果发现配置异常则终止配置加载,并进行告警通知后续人工介入。

UCloud physical cloud gateway one hundred G-cluster design practice

图:白名单限制程序,只允许正确的控制面同步配置

■ 隔离影响

最后,不管自动化运维机制和程序自身多精密,总要假设异常的可能。在此前提下,还需要考虑在故障发生时如何最大程度地减少影响范围和影响时间。我们的解决思路如下:

 去除公共依赖

The main problem of previous devices due to cluster all while relying on the unusual Manager, resulting in a loss for both sides. It is necessary to remove the cluster devices rely on the public, reduce the influence. E.g. different clusters of different binding Manager, which can effectively control the scope. Of course, not only the public rely on the cluster may appear in the Manager, or it may be an IP, a rack, etc., which we need careful screening of the actual project.

 Set quarantine under controlled circumstances sphere of influence, a Manager anomaly affects only part of the device in the cluster, in which case an exception should also be quickly removed the device or directly migrate all users in the cluster to the isolation area, for the most fast troubleshooting time.

to sum up

With the expansion of technology and business development, systems architecture increasingly complex, increasingly close correlation, demand for technical personnel are also increasing. During development of physical cloud gateway cluster, you will inevitably encounter a lot of "pit", but whenever you are required to uphold the point: all technologies are for business services. To this end, we plan to share experience in the design out, hoping to give you more thinking and harvest.

UCloud physical cloud gateway one hundred G-cluster design practice

Guess you like

Origin blog.51cto.com/13832960/2465210