Design and Implementation of High-Performance Load Balancing

Abstract:  At the 2017 Alibaba Cloud Network Technology Summit Forum Online Technology Summit, Alibaba Cloud Weizheng gave a speech on the design and implementation of high-performance load balancing. This article mainly starts from the early load balancing, and then explains the high-performance load balancing, focuses on analyzing LVS and Tengine, and how to achieve high availability, and finally makes a brief summary.

 

Here are the highlights:

load balancing

Load balancing is the basic component of cloud computing and the entrance of network traffic. Its importance is self-evident.

What is load balancing? The traffic input by the user is evenly distributed to multiple servers on the backend through a load balancer according to a certain load balancing algorithm. The server that receives the request can respond to the request independently to achieve the purpose of load sharing. In terms of application scenarios, common load balancing models include global load balancing and intra-cluster load balancing. From the perspective of product form, they can be further divided into hardware load balancing and software load balancing. Global load balancing is generally implemented through DNS, and different region scheduling capabilities are achieved by resolving a domain name to different VIPs. Common hardware load balancers include F5, A10, and Array, which have obvious advantages and disadvantages. The advantage is that they are powerful. , there is a dedicated after-sales service team, the performance is relatively good, the disadvantage is the lack of customization flexibility, and the maintenance cost is high; the current Internet more ideas are realized through software load balancing, which can meet various customization needs, common The software load balancing of LVS, Nginx, Haproxy.

 

 

Alibaba Cloud high-performance load balancing uses LVS and Tengine. We distinguish different computer rooms in a region. Each computer room has an LVS cluster and a Tengine cluster. For the four-layer monitoring configured by the user, the user ECS will be directly mounted behind the LVS, and the seven-layer The user monitoring ECS ​​is mounted on the Tengine. The traffic monitored by Layer 4 is directly forwarded by LVS to ECS, while the traffic monitored by Layer 7 will pass through LVS to Tenigine and then to the user ECS. There will be multiple availability zones in each region to achieve the purpose of master and backup disaster recovery. There are multiple devices in each cluster. The first is to improve performance, and the second is based on disaster recovery considerations.

 

 

The picture shows an overview of high-performance load balancing control management. SLB products also have the concept of SDN. Forwarding and control are separated. All user configurations first arrive at the controller through the console, and the user configuration is pushed to different devices through centralized controller conversion. On each device, the Agent receives the requirements sent by the controller, and converts it locally into a configuration that can be recognized by LVS and Tengine. This process supports hot configuration, does not affect user forwarding, and does not require reload to make the new configuration take effect.

 

LVS

Three modes supported by LVS

 

 

Early LVS supports three modes, DR mode, TUN mode and NAT mode. After the DR mode passes through LVS, LVS will change the MAC address, encapsulate the MAC header, and leave the inner IP packet unchanged. After the packet is found to the RS through LVS load balancing, the source MAC header is changed to its own, and the destination MAC is changed to RS. Address and MAC addressing are in the Layer 2 network, and there are certain restrictions on network deployment. In large-scale distributed cluster deployment, the flexibility of this mode cannot meet the needs; TUN mode goes after LVS, LVS will be in the original The IP header is encapsulated on the basis of the packet. After reaching the back-end RS, the RS needs to unencapsulate the IP packet to get the original packet. Whether it is in DR mode or TUN mode, the back-end RS can see the real client source IP. , the destination IP is its own VIP, and the VIP needs to be configured on the RS device, so that it can directly bypass the LVS and return it to the user. The problem with the TUN mode is that the decapsulation module needs to be configured on the back-end ECS, which is already supported on Linux. However, there is no support on Windows, so the user's system image selection is limited. The user in NAT mode accesses the VIP. After LVS searches, it will perform DNAT conversion on the destination IP and select the RS address. Because the client's IP has not changed, it will directly route to the real client IP of the public network when returning the packet. The constraint is because LVS has done DNAT conversion, so the return packet needs to go through LVS to convert the packet header back. Since the ECS sees the real source address of the client, we need to configure the route on the user ECS, which will go to the default ECS. The route points to LVS, which also limits the user scenario.

LVS is implemented based on Netfilter framework

 

 

Netfilter is an open network platform provided by Linux. Based on the platform, you can develop your own business function modules. In the early days, many security vendors implemented some business models based on Netfilter. This model is more flexible, but the general model is more about compatibility. , the path will be very long; and the multi-core feature cannot be used in the general model. At present, the development of CPU is more horizontal expansion. We often see multi-socket servers, how many cores are on each road, and the early general model is not special for multi-core support. Friendly, there are some deficiencies in multi-core design, which leads to limited scalability when we do some application development on the general model. As the number of cores increases, the performance does not increase but decreases.

Improvements to LVS

 

 

The various limitations of the early mode restricted our development, so we first did FullNAT. Compared with the original NAT method, FullNAT has more SNAT attributes and converted the original IP address of the client; secondly, we did parallelization. In order to process, make full use of multi-core to achieve linear performance improvement; then there is the fast path. When we are doing the network forwarding model, we can easily think of designing the fast path and the slow path. The slow path is more to solve the problem of how the first packet passes through the device. It may be necessary to To check ACL or routing, you need to judge many things related to policies. All the following packets can be forwarded through the fast path; there are also instruction-related optimizations, which use special instructions from Intel to improve performance; in addition, for multi-core architecture, NUMA multi-node memory Access, it is possible to obtain better latency performance by accessing the local node memory.

 

 

The incoming IP of the client first accesses the VIP of LVS. The original IP belongs to the client, and the destination IP is the VIP of LVS. After FullNAT conversion, the original IP becomes the Local address of LVS, and the destination address is the RS address selected by LVS. It is easier to return packets. As long as the route is reachable, the packets will be delivered to the LVS without any special configuration on the RS. On the right is DNAT+SNAT translation, and the packet can be forwarded back to the client through LVS. This method mainly brings flexibility to the deployment of application scenarios.

 

 

通过并行化实现对LVS性能的改善,性能没有办法得到线性提升更多的是因为每条路径都需要访问全局资源,就会不可避免引入锁的开箱,另外,同一条链接上的报文可能分散在不同的核上,大家去访问全局资源时也会导致cache的丢失。所以我们通过RSS技术把同一个五源组报文扔到同一个CPU上处理,保证入方向的所有相同连接上的报文都能交给相同CPU处理,每个核在转发出去时都用当前CPU上的Local地址,通过设置一些fdir规则,报文回来时后端RS访问的目的地址就是对应CPU上的local地址,可以交到指定的CPU上去处理,这样一条连接上左右方向报文都可以交给同一个CPU处理,将流在不同的CPU隔离开;另外,我们把所有配置资源包括动态缓存资源在每个CPU上作了拷贝, 将资源局部化,这使整个流从进入LVS到转发出去访问的资源都是固定在一个核上的本地资源,使性能达到最大化,实现线性提升。

经过我们改进之后,LVS的具体表现如下:

  • 出于对容灾和性能提升的考虑,我们做了集群化部署,每个region有不同机房,每个机房有多个调度单元,每个单元有多台LVS设备;
  • 每台LVS经过优化后,都能达到更高性能,大容量,单台LVS可以达到4000W PPS,600W CPS、单个group可以到达1亿并发;
  • 支持region、IDC、集群和应用级的高可用;
  • 实现了防攻击功能,并在原版LVS上提供了更丰富的功能,可以基于各个维度做管理控制,精确的统计,流量的分析等。

 

Tengine

 

 

Tengine在应用过程中也遇到了各种问题,最严重的就是性能问题,我们发现随着CPU数量越来越多,QPS值并没有线性提升;Nginx本身是多worker模型,每个worker是单进程模式,多worker架构做CPU亲和,内部基于事件驱动的模型,其本身已经提供了很高的性能,单核Nginx可以跑到1W5~2W QPS。Nginx往下第一层是socket API,socket 往下有一层VFS,再往下是TCP、IP,socket层比较薄,经过量化的分析和评估,性能开销最大的是TCP协议栈和VFS部分,因为同步开销大,我们发现横向扩展不行,对此,我们做了一些优化。

七层反向代理的路径更长,处理更复杂,所以它的性能比LVS低很多,我们比较关注单机和集群的性能,集群性能可以靠堆设备去解决,单机如果不提升,成本会一直增加,从性能角度来看,有以下的优化思路和方向:

  • 基于Kernel做开发,比如优化协议栈;
  • 基于Aliscoket的优化,Alisocket是阿里研发的高性能TCP协议栈平台,底层是DPDK,它将资源做了局部化处理,报文分发不同核处理,性能非常出色;
  • HTTPS业务越来越多,流量逐步递增,我们采用硬件加速卡方式做一些加解密的性能提升,还有HTTPS的会话复用;
  • 基于Web传输层的性能优化。

从弹性角度看,比如一些公司的应用和用户热点有关,当发生一个社会网络热点后,访问量会急剧变高,我们固有的基于物理机器实现的负载均衡模型在弹性扩展方面是有限制的,对此,我们可以使用VM去做,把反向代理功能放在VM去跑,我们会监控实例负载情况,根据实时需求做弹性扩容缩容;除了VM,还有调度单元,我们可以在不同调度单元做平滑切换,根据不同的水位情况,通过切换可以把负载均衡实例调度到不同的单元中去,改善使容量上管理。Tengine本身也做了集群化部署,我们在一个region里有不同的机房,不同的调度单元,每个调度单元有多组设备;LVS到Tengine也有健康检查,如果一台Tengine有问题,可以通过健康检查方式摘除,不会影响用户转发能力;Tengine具备灵活的调度能力,可以帮助我们应对更多的复杂情况;另外,Tengine也有很多高级的特性,比如基于cookie的会话保持、基于域名/URL的转发规则、HTTP2、Websocket等功能;目前,我们7层单VIP可以支撑10W规格的HTTPS QPS。

 

高可用

Group

 

 

高可用是整个产品很重要的一部分,图为集群内的高可用架构图,可以看到,在网络路径上是全冗余无单点的。具体情况如下:

  • 双路服务器,每节点双网口上联不同交换机,增加带宽,避免跨节点收包
  • VIP路由两边发不同的优先级,不同的VIP,高优先级路由在不同的交换机上
  • 单机160G转发能力,单VIP 80G带宽,单流 40G带宽
  • 网卡故障不影响转发,上下游路由自动切换
  • ECMP,VIP路由发两边,通过优先级控制从入口
  • 集群640G转发能力,单vip 320G带宽
  • 会话同步,多播、包触发同步、定时同步
  • 单机故障不影响转发
  • 交换机故障不影响转发,路由秒级切换
  • 用户无感知的升级变更,部分未及时同步的连接重连即可

AZ

 

 

每个机房连接两个不同路由器,当一个AZ出现故障之后,我们可以无缝切换到另外一个机房,具体情况如下:

  • VIP在不同的AZ发不同优先级的路由(秒级切换、自动切换)
  • VIP区分主备AZ,不同的VIP主备AZ不同
  • 多个AZ的负载通过控制系统分配
  • 缺省提供VIP多AZ的容灾能力
  • 不支持跨AZ的session同步,跨AZ切换后,所有连接都需要重连

Region

 

 

当用户访问域名时,通过DNS解析,可以设定DNS解析到多个regionVIP地址,下沉到某一个Region来看,如果一个机房出现故障,流量可以切换到另一个可用区继续转发,如果流量进到机房发现一台LVS转发设备出现故障后,我们可以切换到另外一台LVS作处理,如果LVS后面挂载的RS出现问题,通过健康检查也可以快速摘掉设备,将流量转换到健康的设备上去。我们从多个维度实现高可用,最大限度的满足用户的需求。

 

总结

目前,高性能负载均衡应用主要在几个方面:

  1. 作为公有云基础组件,为公有云网站、游戏客户、APP提供负载均衡功能,也针对政府、金融等安全性高的客户提供专有云支持;
  2. 为阿里云内部云产品RDS、OSS、高防等提供了负载均衡的功能;
  3. 负载均衡作为电商平台入口,向淘宝、天猫、1688提供VIP统一接入功能;
  4. 交易平台的流量入口也在负载均衡设备上,如支付宝、网上银行。

未来,我们希望有更好的弹性扩展能力,更高的单机处理能力,我们希望VIP主动探测用户,以及网络全链路监控。

本文为云栖社区原创内容,未经允许不得转载,如需转载请发送邮件至[email protected]

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326465904&siteId=291194637