Multicore networking in Linux user space with no performance overhead

Dronamraju Subramanyam, John Rekesh and Srini Addepalli, Freescale Semiconductor

FEBRUARY 26, 2012

In this Product How-To Design article, the Freescale authors discuss multicore network SoCs and how to leverage them efficiently for data path processing, the limitations of current software programming models, and how to use the VortiQa zero-overhead user space software framework in designs based on the QorIQ processor family.

System-on-chip architectures incorporating multiple general purpose CPU cores along with specialized accelerators have become increasingly common in the networking and communications industry.

These multi-core SoCs are used in network equipment including layer 2/3 switches and routers, load balancing devices, wireless base stations, and security appliances, among others. The network equipment vendors have traditionally used ASICs or network processors for datapath processing but are migrating to multi-core SoCs.

Multi-core SoCs offer high performance and scalability, and include multiple general purpose cores and acceleration engines with in-chip distribution of workloads. However, exploiting their capabilities requires intimate knowledge of SoC hardware and deep software expertise.

In this article we discuss multi-core SoC capabilities and how to leverage these capabilities efficiently for data path processing, limitations of current software programming models, and finally discuss a zero-overhead user space software framework.

Multicore SoC Hardware Elements
As shown in Figure 1 below a multicore SoC has multiple general purpose cores that run application software. It has hardware units that assist with data path acceleration. Incoming packets are usually directed toward the general purpose cores, where application processing takes place.

Figure 1. A multicore SoC has multiple general purpose cores that run application software. It has hardware units that assist with data path acceleration.

Application cores make use of hardware accelerator engines to offload standard processing functions. Implementing networking applications on multi-core SoCs need certain basic requirements to be met by the SoC.

1. Partitioning: the SoC must provide the flexibility to partition available general purpose cores to run multiple application modules, or even different applications

2. Parsing, classification and distribution: Once partitioned, there must be flexibility and intelligence in the hardware to parse and classify incoming packets, and then direct them to appropriate partitions and/or cores.

3. Queuing and scheduling: When parsing is completed, the parsing unit must have a mechanism to direct the packet, and also for the system to have a mechanism to direct that incoming packet to a desired processing unit or core. This requires a queuing and scheduling unit within the hardware.

4. Look-aside processing: The queuing & scheduling unit must manage the flow of packets between cores and acceleration engines. Cryptography, pattern matching, compression/ de-compression, de-duplication, timer management, and protocol processing (IPSec, SSL, PDCP etc.) are some standard examples of acceleration units in multicore SoCs.

5. Egress processing: The queuing & scheduling unit must direct the packets to their interface destinations at very high rate Here QoS algorithms for shaping and congestion avoidance are required to offload these standard tasks from application cores.

6. Buffer management: Packet buffers need to be allocated by hardware, and often freed by hardware as packets leave the SoC. Therefore hardware packet buffer pool managers are a necessity.

7. Interfaces to cores: The multi-core SoC architecture need to present a unified interface to the cores, to work with packet processing units.

8. Semi-autonomous processing: Semi-autonomous processing of flows without intervention from cores is desired to offload some processing tasks from the cores. A few multi-core SoCs provide programmable micro engines to enable ingress acceleration on the incoming packets, to do functions such as IP reassembly, TCP LRO or IPsec, before packets are given to the cores.

Multicore SoC Software Programming Methods
Two models are prevalent in software programming for packet processing on the cores. One is pipeline processing where functionality is split across cores and packets are processed in pipeline fashion from one set of cores to the next as shown in Figure 2 below.

Figure 2. Pipeline processing model splits functional across cores and packets in a pipeline.

The more popular model is a run-to-completion model, where each core or a set of cores executes the same processing on a packet as shown in Figure 3 below.

Figure 3: In the run-to-completion network execution model, each core or a set of cores executes the same processing on a packet.

In the run-to-completion model, effective load balancing of packets across cores is important. It is also important to preserve packet ordering in flows, as network devices are not expected to cause re-ordering of packets in a flow.

This means that the packet scheduling unit should be intelligent enough to support a mechanism that ensures that packets of a flow are not sent to more than one core at the same time.

Otherwise the cores could complete processing of those packets at slightly different times and send them out in a different order. Thus order preservation mechanisms are an important part of the hardware scheduling unit, which can be leveraged by run-to-completion applications that are flow order sensitive.

It is often possible to combine pipelining with run-to-completion, where a group of cores are dedicated for certain application functions, another group to another set of functions and so on.

Within a group, all cores perform the same application processing on every packet, and once completed, hands off to the next group of cores that implement a different set of application functions.

High-performance Data Plane Processing
The software architecture of networking equipment typically comprises of data, control, and management planes. The data plane, also called the fast path, represents packet flows that have been validated and admitted into the system, and avoids expensive per packet policy processing.

Packets representing flows in the data plane pass through an efficient and optimized processing path, including some hardware accelerators. For example, a web download of a music file may move through the data plane of a device in the network path, after the device has established the flow as valid by processing the initial packets of that download in the control plane.

The control plane checks and enforces policy decisions that can result in establishing, removing or modifying flows in the data plane. It runs protocols or portions of protocols that deal with these aspects.

The management plane handles configuration of the device, such as installing policies or creating or removing virtual instances. It also manages other operational information and notifications such as device alerts. The rest of the article mainly concentrates on data plane processing.

Data plan processing on the network
Data plane processing in different network devices tends to use similar types of operations. Multicore SoCs accelerate and substantially improve performance of data plane processing, by providing mechanisms that address common data path processing elements. Typical data plane processing involves steps from ingress to egress, as illustrated in Figure 4 below.

Figure 4. In the typical network, data plane processing involves execution of multiple steps from ingress to egress

Packet ingress involves parsing, classification and activating the right application module to handle the packet. This is now facilitated in hardware, such as using a parse/classify/distribute unit. Packet (protocol) integrity checks may also be conducted at this stage.

The next step is core-based packet processing, by locating the context or flow associated with the packet, within the data plane. Much of policy related processing by application modules need not happen per packet. Instead, only the first (or a few) packets of a flow need to be processed thus in many cases.

When a flow context is not found in the data plane, the packet is sent to the control plane for policy lookup and enforcement. If policy allows, the control plane creates a flow context within the data plane. Further packets of the flow are matched against this context and are processed fully within the data plane.

A flow is typically defined by an N-tuple, which are fields extracted from the packet. A hash table lookup using these fields is the most common implementation of a flow lookup, to find its context. Both the extraction of necessary fields and the required hash computation can be offloaded to the hardware parsing unit of an SoC.

Within data plane processing stages there can be multiple application modules that need to process the packet in sequence. Each of these modules in the data plane may have its own control plane module that handles application specific flows.

An efficient communication mechanism between data plane and control plane modules is therefore required. This is essentially a core-to-core communication mechanism, facilitated by the hardware.

Each application module (that involves standard protocols) may implement some standard processing algorithms. Many of these algorithms, methods and even protocols are common enough to be implemented in look-aside hardware accelerators. An application module can then make use of these accelerators during appropriate stages of its own processing, by directing packets to those engines, and collecting responses.

One thing common to all data plane processing is handling of statistics. Statistics counters such as byte and packet counters and application specific counters often need to be kept per flow, and also per higher abstraction levels as required by applications. Therefore a large number of counters can be expected in higher end devices.

Since multi-core synchronized access to shared counters are costly, a multi-core SoC can also provide a statistics acceleration mechanism – one that would make incrementing statistics for millions of counters very efficient.

Once a packet is processed through all necessary application modules in the data plane, the packet is sent out to the egress interface. Typical processing here requires scheduling and shaping (or rate limiting). Since standard QoS algorithms are generally used, these functions can also be offloaded to hardware units, so that the application modules need only enqueue packets to egress processing units.

A zero-overhead user space network software framework
Multi-core SoCs software developers have been challenged with writing applications that suit specific SoC families and their derivatives. This means writing specialized code that is suitable and specific to a given SoC.

For example queuing, buffer management, statistics, accelerators and other technologies would have very specific modus operandi that applications must follow. There would also be specific methods of interfacing with the hardware to receive and send packets, and for distributing work.

These also result in application software architecture being dictated to some extent by the SoC. Migrating software across SoCs, even within the family of the same SoC product line can be a large and expensive development effort, and a burden on maintenance and support.

There is a need for a software framework that is able to leverage features provided by a multicore SoC, without the need for in-depth expertise in the hardware operational understanding.

Applications need to be portable, and be able to leverage different SoCs and families without software application changes, essentially through configuration of features and an abstracted execution environment.

Much like a traditional (i.e. non-embedded) operating system that hides many hardware details from applications, a network software framework that hides SoC specific details and offers a consistent programming model to applications is the need of the hour.

Limits of Linux in the dataplane
Direct use of Linux kernel for data plane implementations has limitations. Linux kernels provide abstraction for disk I/O, USB, processor features and other hardware elements. However, scaling to millions of flows/sessions in datapath is not easy in the Linux kernel space. Kernel resident applications suffer from limited memory, an environment that is hard to develop and debug in. Vendors also have GPL concerns with Linux kernel modules.

In order to overcome the limitations of Linux, it is required to execute applications in user-space with virtually zero-overhead and with direct access to the SoC hardware, and provide a software framework, which caters to various needs of networking applications without requiring knowledge of hardware specific details.

Such a framework needs to support layer 2, layer 3 and higher layer processing, orchestrate packet flow, manage packet buffers, provide access to hardware accelerators, timers, and statistics.

It also needs to support inter-application communication, and provide multiple execution models for applications. A network software framework in user-space that leverages the advantages of Linux OS and overcomes its limitation for networking and embedded application is essential for next generation of networking and embedded applications.

Multicore networking in Linux user space with no performance overhead

猜你喜欢