Advances in networking technology and cutting-edge open switch

Future network development in the Third Assembly SDN / NFV Application and Technology Innovation sub-forum China UnionPay e-commerce and e-payment National Engineering Laboratory Dr. Zhou Yong Kai, delivered a keynote speech on the theme "Progress in cutting-edge networking technology and open exchange" of.
Advances in networking technology and cutting-edge open switch

The speeches are mainly three parts of the content, the first part is open to the forefront of progress in the switch; the second part describes the networking, especially native networking cloud data center; the third part of the financial sector to do some of the open switch research and verification. In this paper, the first and second part made finishing.

Stack of network technology

Dr. Zhou talked about the first stack of network technology, network technology stack can be divided into control plane and data plane, in Neutron interface, there K8S CNI interface, and is currently more popular IBN SDN community-based control plane has declared intention, northbound interfaces have interface. Open source network controller has ODL and ONOS, there is the commercial version of Cisco ACI, Huawei AC and so on. Southbound interface is the most famous OpenFlow, as well as the latest P4 runtime, of course, the traditional device manufacturers may prefer BGP, NETCONF, OPFLEX interfaces.

Advances in networking technology and cutting-edge open switch

Look at the data plane, the data plane may switch into the operating system, hardware abstraction layer and the switching chip. Switch Operating System is the focus of competition in the current network of open source, open source system includes SONiC, FBOSS and so on. The hardware abstraction layer, now is the development of better SONiC the SAI layer. Switching to a programmable chip currently under development direction. Open network should still be the trend, represented in red on the left image above are open source project, you can see open source software has been able to achieve full coverage of the network stack, open switch ecological standard hardware plus open source software layers that are gradually forming.

Technical characteristics of the open switch

The technical features of Dr. Zhou open switch summarized in the following three points. The first is a small switch can set a large network, i.e., more complex compared to the original large block switches, now standard small box can be extended out of a very large network. The second plus is an open standard hardware control, that is, on the whole hardware system, how to build a more controllable, more streamlined network operating system. The third is a programmable switch chip, for SDN, the chip programmable SDN is the most thorough, because it has the software-defined boundaries sink to the level of the forward line. Below these three points in detail.

Advances in networking technology and cutting-edge open switch

Small group of large network switches

在谈小交换机组大网之前,先介绍一下框式和盒式交换的差别,框式交换机通过背板交换连接多块线卡,其内部的连线也是CLOS的结构。因此,一个大框可以通过小交换机进行组合构建,用小交换机的好处有如下几点:1)小交换机比较便宜,可节约成本。2)架构可扩展,因为框式交换机一旦被设计出来,它的整个数量就完全确定了。3)可控性更高,但与此同时管理的难度也会逐渐增加。下图用于比较用框式和盒式堆三层网络,对于框式交换机如果只是组几千个节点,一台框式交换机就可以搞定了,但是如果是几万个节点,就需要框式堆框式,每一个框里面至少3块芯片,一路算下来,从一端到另外一端要经过11跳,而盒式交换机组成三层网络只需要5跳,所以在时延和跳数上是有优势的。

Advances in networking technology and cutting-edge open switch

当前盒式交换机单芯片的端口密度已经很大了,最高的12.8T(有128个100G的端口)都已经出来了,所以通过三层的CLOS就可以组一个很大的网。具体计算一下,一个2级CLOS构成基本单元POD,可以挂几千台机器,再扩展到三级CLOS,差不多可以无阻塞互联十万左右的服务器,这对于单个数据中心而言已经足够多了。三级CLOS组网的数量取决于中间层交换机的端口密度。

Advances in networking technology and cutting-edge open switch

今年OCP,Facebook发布了最新的数据中心网络设计——F16。他的前身是几年前经典的F4组网,该组网的基本单元有48个接入交换机,4个中间层交换机,差不多每个POD可以连接1000台服务器。然后最上层的Spine交换机通过CLOS互联可以扩展互联数万台服务器的规模,并且在任意两个服务器节点之间有多条冗余路径可以做负载分担。当前最新的F16,中间层改成了16*100G的互联,最顶层的Spine交换平面有36个。如果有六栋楼的话,这种互联方式还可以将六个AZ的交换机Fabric进行全互连。

Advances in networking technology and cutting-edge open switch

网络的开放控制

在这一部分中,周博士首先谈到了路由控制,路由控制分两种传统的路由控制和SDN路由控制。对于传统的路由控制,周博士对开源网络操作系统SONiC和Stratum进行了比较。

Advances in networking technology and cutting-edge open switch

SONiC
对于SONiC来说,下图是一个简单的架构图,控制平面仅实现了最核心的BGP协议以保障云数据中心大规模三层网络的互通。数据平面比较核心的是SAI层,这一层目前比较重要因为它的生态发展比较好,它下面支持的芯片非常多。用户既可以用Switch.p4这样纯可编程的芯片来支持SAI,也可以通过博通、盛科等的芯片来实现SAI的接口,最终映射到物理的Chip Target。

Advances in networking technology and cutting-edge open switchAdvances in networking technology and cutting-edge open switch

开放交换机创新的技术中不得不提一下去堆叠技术。通常情况下,服务器为了保证高可用性,一般是双连到两台交换机上,如果有一个交换机宕机了,另外一个可以接上。上图中可以看到TOR1和TOR2之间有两条线,这两条堆叠线的作用是同步MAC、ARP等状态。为了达到高可用性,最极端的做法是把两台交换机虚拟成一台控制平面,当用户登上TOR1和TOR2时会发现它们的管理地址是一模一样的,这个虚拟程度是很高,但是额外复杂度、不稳定性也增加了。对此,阿里提出了一种比较创新的去堆叠的技术(VPC-lite),他们的想法是服务器bond口将ARP双发到两条链上,这样TOR1和TOR2就不用同步ARP表了。当链路断了,再显示地通告一下BGP。这种方式达到了原来同样的效果,但原来的堆叠线没有了,交换机也相互独立,实现方面也要简单很多。

SONiC现在已经成为OCP的一大招牌,因为OCP基本上是以硬件为主,对于软件方面,现在主推SONiC,也是目前生态最成熟的一个开放交换机操作系统,这套操作系统是微软的华人工程师创建的,设计精简前卫,它里面很多组件的模块性都比较好。在使用案例方面,微软将SONiC部署到了全球44个region,领英当前40%的数据中心大规模在使用SONiC。此外,OCP也特意强调了中国对于SONiC的贡献,由阿里牵头ODCC(中国开放数据中心联盟)专门成立了一个凤凰项目,负责SONiC在中国的推广。阿里是SONiC生产应用最早的也是规模比较大的企业。腾讯、百度包括京东也正在开展密集的验证测试,而且不久也会正式生产上线。

Advances in networking technology and cutting-edge open switch

Stratum
和SONiC相比,Stratum的理念更偏向计算机,它是以IT的方式来管理整个CT系统,也是比较有意思的。整个设计最顶层是远端的控制器,接口端主要分成三类,一个是P4ruetime,然后就是gOMI和gNOI。g代表gRPC,而不是传统网络设备所使用的NETCONF,这可以使得策略的下发效率提升很多。下图蓝色框内便是Stratum的覆盖范围。
Advances in networking technology and cutting-edge open switch

单独的Stratum是没有办法进行组网独立工作的,在上层它需要ONOS或者其他的控制器配合,下层是通过Trellis组件提供Fabric SDN的路由控制。这个系统是纯SDN选路,所以一旦链路端掉线,系统很快就能够响应,重新编制转发表项,由此也不存在去堆叠之类的麻烦。
Advances in networking technology and cutting-edge open switch

Stratum项目最早是由谷歌发起的,所以谷歌在内部肯定已经大规模使用了Stratum(但是谷歌的控制器不是ONOS),整个项目预计今年6月正式开源。国内在去年12月份左右,由腾讯牵头举办了一场Stratum Developer Day,同时 UCloud、阿里、锐捷也都在积极跟进或者密切关注。

Advances in networking technology and cutting-edge open switch

RDMA
在网络的开放控制中SDN解决的是路由控制的问题,而RDMA要解决的是流量控制。要解决什么样的流量呢?首先看下图,如果是点对点两两互打的话,这个对交换机来说并没有什么太大的压力,每两点产生的流量再大,有线速保障的交换芯片都可以处理过来。但是如果碰到多打一的情况,交换机芯片再强大也处理不了。对这种情况只能从源端进行解决,把原来的大流量变成原来的三分之一,出口那边才可能扛住。在源端分流最常用的方法是从TCP的端侧流控,但这有一个缺点,速度比较慢,有可能对端反馈过来的时候在交换机里已经产生丢包了。于是有了RDMA,可以做端到端的全程流控,整个网络都可以参与流量拥塞的反压。
这种多打一的情况经常出现在大数据训练场景下。另外对于25G和100G网络这种情况也非常突出,因为25G和100G网络速度太快了,它的交换机的缓存撑不了很长时间,一旦有拥塞,交换机缓存就会迅速溢出,所以RDMA技术基本上会运用在25G/100G网络中。

Advances in networking technology and cutting-edge open switch

下图是RDMA的技术实现,首先在网络侧需要优化配置PFC和ECN等参数,整个RDMA最难的就是这些参数该怎么配。智能网卡侧实现数据远程搬运,同时可以降低CPU的流控负担。最后,原有的TCP协议栈也要重新改写,替换为RoCEv2 verbs的接口。RDMA最终的目标是高吞吐、低时延和不丢包。

Advances in networking technology and cutting-edge open switch

RDMA technology was first used in scientific computing, it is a relatively closed and the price is relatively expensive technology. In Ethernet, the RDMA data calculation mainly used in large, deep learning and distributed storage networks, high throughput, low latency scene. Currently, RDMA is already relatively widespread use, and Microsoft was first to apply it to the scene cloud data center, BAT and other Internet companies are mainly used for the AI ​​training mission and distributed storage. Huawei also launched AIFabric this heavyweight products. It is worth mentioning that the entire RDMA network stack, there is a single point, that is, step complex thinking (Mellanox) Smart card. Mai envelope think a great contribution, which itself is the primary inventor of InfiniBand and RDMA technology for RDMA. In March this year, Nvidia to 6.9 billion US dollars acquisition of the Israeli semiconductor company in the future GPU memory data can be achieved "Remote Handling" by RDMA. In the financial sector, China Merchants Bank and Shanghai Pudong Development and production applications have been respectively depth verification, CUP is also verified.

Advances in networking technology and cutting-edge open switch

Programmable switching chip (see more here )

Guess you like

Origin blog.51cto.com/14355923/2403162