The single-machine T-level traffic forwarding throughput has been increased by 5 times, and the programmable load balancing gateway 1.0 has been launched

1. Background

The load balancing gateway is a key infrastructure of the cloud computing network, providing high-performance forwarding functions for various cloud computing application services.

At present, cloud computing gateways are generally implemented based on the X86 CPU + DPDK general server platform. The BGW (BaiduGateWay) four-layer load balancing gateway self-developed by Baidu Smart Cloud has been used since 2012. It has evolved from the initial stand-alone 10Gbps throughput to the current stand-alone 200Gbps. It is the most widely used gateway in cloud computing networks.

With the development of Baidu Smart Cloud business, new requirements and challenges have been put forward for the load balancing gateway:

  • Single-core computing power is limited. In order to prevent out-of-sequence packets, the same service flow needs to be dispatched to the same CPU core of the same gateway for processing. Since the CPU single-core capability has basically stopped improving, the ultimate throughput of a single stream develops slowly. Today, even with the latest CPU, the actual throughput of a single stream can only reach 10-20Gbps, and this data is only the optimal result under ideal conditions.

    If two or more large flows are scheduled to the same CPU core at the same time, due to the limitation of processing capacity, the overall throughput of the business will be reduced due to the mutual influence caused by competing for the CPU; Other traffic being processed will also be affected, possibly resulting in probabilistic packet loss.

  • Latency is unstable. Using CPU software processing usually has a higher delay than hardware forwarding. On the software gateway, the processing flow of a message has to go through the following steps: starting from the reception of the network card, sending it to the DPDK driver on the CPU through PCIe, then the gateway software performs business logic processing, and then submits it to the DPDK driver, and finally sends it to the DPDK driver through PCIe. Send it to the network card and then send it out.

    According to the actual measurement results, the average packet processing delay of the current software gateway of Baidu Smart Cloud is usually 30-50us under the general load level. There will be a delay of ms level. In addition, the delay fluctuation is closely related to the actual hit situation of the CPU cache, which is difficult to predict. A large delay fluctuation scale generally has no substantial impact on cross-computer room or cross-region communication, but it has a greater impact on access to services that rely heavily on low latency in the same computer room.

  • The TCO (Total Cost of Ownership) of the high-bandwidth scenario is relatively high. Although the number of CPU cores is continuously increasing, the ability of the software to process packets cannot increase linearly with the number of CPU cores in services such as gateways that require heavy I/O throughput. For example, running BGW on an AMD Milan server with 64 physical cores, when the number of CPU cores is increased to more than 32 cores, there is no significant increase in overall throughput. This phenomenon is strongly related to the microarchitecture of the current CPU (especially the L3 cache).

    In fact, the current software gateway usually promises a bandwidth specification of roughly 100-200G (if only large packets are considered, it can also achieve 400G). If a cluster of gateways is required to support 10T bandwidth, even without considering redundancy, 50-100 servers need to be deployed.

To sum up, the software gateway based on the X86 CPU general server will not be able to further meet the demand with the increase of business volume and the in-depth use of a single gateway performance, and it will be difficult to improve the throughput performance. When the traffic is high, the CPU is fully utilized, causing problems such as packet loss.

Two, the solution

In order to meet the ever-evolving business needs, Baidu Smart Cloud created the third-generation programmable gateway platform - UNP (Universal Networking Platform). The UNP platform integrates X86 CPU, programmable switching chip, and FPGA accelerator card to form a scalable heterogeneous fusion gateway platform. Compared with the X86 software gateway platform, the UNP has the following advantages:

  • Programmable switch chip can provide T-level bandwidth throughput;

  • Through the hardware network chip + X86 CPU, it supports the operation of hardware gateway and traditional software gateway at the same time, providing powerful flexibility and hyper-convergence capabilities;

  • Provide more hardware-accelerated expansion capabilities through expansion slots.

The introduction of the UNP platform can give full play to the technical advantages of software and hardware integration. Baidu Smart Cloud launched the programmable load balancing UNP-BGW gateway 1.0 in January 2023, which effectively solved the problems of large bandwidth, elephant flow, Low latency and other problem requirements bring the following benefits to the load balancing gateway:

  • Through the combination of software and hardware and session table hardware offloading, the performance of gateway throughput is improved, which can provide a single gateway with T-level large bandwidth, and at the same time greatly reduce the cost of gateway usage.

  • Reduce the network delay of gateway products, and solve the problem of gateway packet loss and jitter in high-load scenarios.

The programmable load balancing UNP-BGW gateway 1.0 is mainly divided into two parts, the X86 gateway part and the programmable switching chip part.

The X86 part still uses the DPDK solution to handle management and control configuration, route forwarding control, session management, and non-offload packet load balancing function forwarding. From this perspective alone, it is similar to deploying a dual-NUMA X86-BGW.

The peripheral control principle of routing is shown in the figure below. Two NIC network cards appear in the user space in the form of standard network cards, and the other end of them is directly connected to the programmable switching chip. Vnic0 to VnicN in the figure are virtual network devices generated using the drivers provided by the programmable switch chip. These devices have two main functions: sending and receiving routing packets and packet capture diagnosis:

  • Send and receive data packets related to the BGP routing protocol, form a routing relationship with the switch, and introduce traffic to this device;

  • When it is necessary to capture packets, mirror a copy of the traffic entering the port and send it to the X86 user space virtual network device, which can capture packets and troubleshoot problems.

picture

The principle of data forwarding processing is shown in the figure below. Compared with ordinary X86-BGW, UNP-BGW forwarding is divided into two paths:

  • Fast-Path: fast path, the traffic hitting the session is forwarded by ASIC editable hardware, providing hardware T-level forwarding capability and us-level low latency;

  • Slow-Path: The slow path, the traffic that does not hit the session is sent to the CPU, and the configuration command is used to decide whether to send the session to the fast path.

picture

For the first connection establishment message of a service flow, after UNP-BGW receives it, when ASIC inquires that the hardware has no session, it will go through the Slow-Path and send it to the CPU for new session processing.

BGW regularly obtains session traffic statistics. When the forwarding state of this flow reaches a certain period of time, and the bandwidth bps or pps reaches the specified threshold, BGW determines that this flow is an elephant flow, and sends this session to ASIC programmable hardware as a session offload.

When the connection is actively disconnected or timeout due to no flow, the CPU will age out the session of the ASIC programmable hardware in time, so that the hardware resources can be used reasonably.

Compared with traditional software gateways, UNP-BGW 1.0 has the following product features:

  • Capacity: The bandwidth throughput of a single machine has been expanded by more than 5 times, and the capacity of 200Gbps has been upgraded to greater than 1Tbps;

  • Latency: The average forwarding delay is reduced by more than 20 times, and the 100us is reduced to less than 4us when the load is high, and there is no jitter, and the forwarding is faster;

  • Packet loss rate: 1/100,000 to hundreds of millions, the network is more reliable;

  • Cost: The single-machine forwarding capability has been improved, and the deployment cost can be reduced when undertaking larger-scale traffic;

  • Energy consumption: The reduction of required deployment machines, the overall energy consumption of T-level throughput is reduced by more than 50%, and carbon emission reduction is achieved.

Typical Case

A business user who reads and writes storage often has a single elephant flow (with a bandwidth of about 15Gbps) to read business data requests. When using the X86-BGW software gateway cluster, the CPU usage rate on a single gateway is as high as 90%, which affects the forwarding of other businesses. deal with.

picture

picture

After the user switches to the UNP-BGW gateway, a single elephant flow can be increased to 16Gbps under the same business read and write conditions, and the usage rate of a single CPU after traffic offload is less than 1%.

picture

picture

Programmable load balancing UNP-BGW gateway 1.0 has brought a significant increase in performance indicators such as bandwidth, delay, and packet loss rate. This product has been used to accelerate object storage BOS services. You can enter the service network card product on Baidu Smart Cloud official website On the page, select BOS access point to use when creating a service network card.

picture

The programmable load balancing version 1.0 gateway has only a few hundred Mb of storage space for the ASIC hardware, and the session table capacity is limited. The FPGA accelerator card can be expanded to increase the capacity of large entries and further enhance the programmable load balancing capability.

Baidu Smart Cloud is accelerating the launch of UNP-BGW 2.0 version, which has higher offload unloading capabilities and meets the load balancing requirements of tens of millions of high-bandwidth sessions.

- - - - - - - - - - END - - - - - - - - - -

Recommended reading:

Baidu Engineer's Guide to Avoiding Pitfalls in Mobile Development - Memory Leaks

Augmented language models - the road to general intelligence?

Realization of full message based on public mailbox

Baidu APP iOS terminal package size 50M optimization practice (2) Image optimization

On the recompute mechanism in distributed training

Analyze how the Dolly Bear business practices stability construction based on the distributed architecture

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4939618/blog/8880881