DPDK PCIe packet processing and

references:

 "Layman DPDK"

  linux Public Yuemachang No.

..............................................................................................................

A. PCIe Introduction (refer linux Yuemachang article)

First we look at the x86 systems, PCIe what kind of a system architecture. Below is an example of a topology of PCIe, PCIe Bus 256 protocol, each Device Bus supports up to 32, up to eight per Device Function, so the BDF (Bus, device, function) constituted each PCIe ID number of the device node.

 

PCIe architecture generally consists of root complex, switch, endpoint and other types of PCIe devices, usually there will be some embeded endpoint (not PCIe interfaces such devices outside) in the switch and root complex. So many devices, the CPU starts to how to find and recognize them then? Host PCIe device for scanning is to use a depth-first algorithm, the process is brief for each possible path deep into the branch can not go any further so far, and each node can only be accessed once. We usually call this process PCIe device enumeration. Host enumeration process to obtain the configuration information read by the downstream device package things, by the configuration write transactions package device arranged downstream.

The first step, devices on a PCI host bridge scanning the Host Bus 0 (processor in a system, generally Root complex with the Host Bridge is connected to the PCI bus named PCI Bus 0), the system will first ignore the on Bus 0 etc. does not hook embedded EP PCI bridge device, a main bridge bridge found, the following Bridge1 Bus PCI Bus set to 1, the system is initialized configuration space bridge 1 and bridge and Secondary Bus primary Bus Number Number registers are set to 0 and 1, to indicate upstream bus Bridge1 is 0, 1 is a downstream bus, since loading can not determine the specific circumstances under Bridge1 device, the system will temporarily Subordinate bus Number set to 0xFF.

The second step, the system begins scanning Bus 1, will find Bridge 3, and found it to be a switch device. The system PCI Bus Bridge 3 below as Bus 2, and the bridge Primary Bus Number Register and the Secondary Bus Number 1 and 2 are arranged, and the upper step as temporarily Bridge Subordinate Bus Number 3 is set to 0xFF.

The third step, the system continues to scan Bus 2, will find Bridge 4. Continue scanning, the system will find the following Bridge NVMe SSD mounted device, the system PCI Bus Bridge. 4 below as Bus 3, and the bridge Primary Bus Number Register and the Secondary Bus Number 2 and 3 are provided, since Bus3 the following is hanging endpoint device (leaf node), then there will not be below the downstream bus, so the value of the Subordinate bus Number Bridge 4 may be determined to be 3.

After the fourth step, to complete scanning Bus 3, the system returns to continue scanning Bus 2, will find Bridge 5. Continue scanning, the system will find the following NIC mounted device, the system PCI Bus Bridge 5 below to Bus 4, and the bridge Primary Bus Number Register and the Secondary Bus Number 2 and 4 are provided, because the same NIC endpoint device, the value of the Subordinate Bus Number Bridge 5 may be determined as 4.

A fifth step, in addition to the Bridge. 4 and Bridge 5, Bus2 below no other device, and the flow returns to Bridge 3, Bus 4 is mounted in the last number of a bus Bridge found, thus the Bridge Subordinate Bus Number 3 to 4. Bridge downstream device 3 have been scanned, continues to back up Bridge. 1, likewise Subordinate Bus Number Bridge 1 is set to 4.

A sixth step, the system returns to continue scanning Bus0, will find Bridge 2, the system PCI Bus Bridge 2 below as Bus 5. And Primary Bus Number Bridge 2 and Secondary Bus Number Register are set to 0 and 5, Graphics card is an endpoint device, so the value of Bridge 2 Subordinate Bus Number 5 may be determined

At this point, hanging on the PCIe bus, all devices are scanned, the end of the enumeration process, Host obtain a complete PCIe device topology through this process.

After the system is powered on, host will automatically complete the above-mentioned device enumeration process. In addition to a number of proprietary systems, the general system will only boot stage scanning device, after a successful start (the end of the enumeration process), even if a PCIe device is inserted, the system will not recognize it again

In linux operating system, we can lspci -v -t command to query the system power-up phase of the scan to the PCIe device, the results will list all pcie devices in the system in the form of a tree. As shown below, wherein the yellow block PCIe device is Beijing Yi Technology Inc. (Bejing Starblaze Technology Co., LTD.) Release of STAR1000 series NVMe SSD controller chips, shown in FIG 9d32 in the PCI-SIG is Starblaze License organization, 1000 is the device serial number.

BDF STAR1000 device can identify from the above figure, wherein the bus is 0x3C, device is 0x00, function is 0x0, BDF expressed as 3C: 00.0, corresponding to the upstream port 00: 1d.0.

We can "lspci -xxx -s 3C: 00.0" command to list the PCIe device details (or digital control technology enthusiasts please follow this section). The content is stored in PCIe configuration space, which characteristics are described PCIe itself. As shown below (in the left lower address 0x00), it can be seen that a non-volatile memory controller, PCIe starting address of 0x00 Vendor ID and Device ID. Class code 0x010802 indicates that this is a NVMe storage device. 0x40 capability pointer is first set, if you want to view the properties of PCIe, to query the need to start from this position, given the starting address of the next set of characteristics in the header field of each group will be characterized. Sequentially starting from the address 0x40 is power management, MSI interrupt control and status link, MSI-X interrupts feature group. Specifically listed here, the link characteristics of a field 0x43 indicating a link STAR1000 device is x4lane support PCIe Gen3 rate (8Gbps)

Of course, you can use lspci -vvv -s 3C: 00.0 command to view the device properties

 

II. From the perspective of the packet processing pcie

pcie specification conforms to the reference model Open Systems Internet, from top to bottom into the transmission transaction layer, a data link layer, physical layer.

 

Three. Mbuf

Mbuf library provides application and release mbufs functions, DPDK application uses these buffer stores the message buffer. Message stored in the buffer in mempool

 

Rte_mbuf data structure may carry packet data network or a general purpose buffer control message buffer (indicated by CTRL_MBUF_FLAG). It can also be extended to other types. rte_mbuf head structure as small as possible, at present only two cache lines, the most common field is in a cache line

In order to store data packets (protocol headers offer), considered two methods:

  1. Metadata embedded in a single memory buffer, the packet data is followed by fixed size region
  2. And metadata for the data packets are stored in separate buffer.

The first method is the advantage that he only needs to allocate an operation / release of the entire data packet is stored representation. However, the second method is more flexible and allows for the assignment and allocation packets of data buffers metadata completely separated.

DPDK choose the first method. Metadata includes control information such as a message type, length, offset to the beginning of data, and the like, as well as links to permit additional buffering mbuf structure pointer.

The case for carrying the message network packet buffer buffering process requires a plurality of buffers to store the complete data packet. Many jumbo frame through a link under field together mbuf composition, as is the case.

For mbuf newly allocated data area is started after RTE_PKTMBUF_HEADROOM byte buffer position, which is aligned with the cache. Message buffers may carry control information in the system of different entities, messages, events, and so on. Message buffers may be used from other message buffer pointer to the buffer of data fields or other data structures.

 

../_images/mbuf1.svg

Fig. 6.1 An mbuf with One Segment

../_images/mbuf2.svg

 

Buffer Manager implements a fairly standard buffer access operation manipulating network packets.

 

Buffer Manager 使用 Mempool Library 来申请buffer。 因此确保了数据包头部均衡分布到信道上并进行L3处理。 mbuf中包含一个字段,用于表示它从哪个池中申请出来。 当调用 rte_ctrlmbuf_free(m) 或 rte_pktmbuf_free(m),mbuf被释放到原来的池中。

 

Packet 及 control mbuf构造函数由API提供。 接口rte_pktmbuf_init() 及 rte_ctrlmbuf_init() 初始化mbuf结构中的某些字段,这些字段一旦创建将不会被用户修改(如mbuf类型、源池、缓冲区起始地址等)。 此函数在池创建时作为rte_mempool_create()函数的回掉函数给出。

 

分配一个新mbuf需要用户指定从哪个池中申请。 对于任意新分配的mbuf,它包含一个段,长度为0。 缓冲区到数据的偏移量被初始化,以便使得buffer具有一些字节(RTE_PKTMBUF_HEADROOM)的headroom。

释放mbuf意味着将其返回到原始的mempool。 当mbuf的内容存储在一个池中(作为一个空闲的mbuf)时,mbuf的内容不会被修改。 由构造函数初始化的字段不需要在mbuf分配时重新初始化。

当释放包含多个段的数据包mbuf时,他们都被释放,并返回到原始mempool。

 

这个库提供了一些操作数据包mbuf中的数据的功能。 例如:

  • 获取数据长度
  • 获取指向数据开始位置的指针
  • 数据前插入数据
  • 数据之后添加数据
  • 删除缓冲区开头的数据(rte_pktmbuf_adj())
  • 删除缓冲区末尾的数据(rte_pktmbuf_trim()) 详细信息请参阅 DPDK API Reference

 

部分信息由网络驱动程序检索并存储在mbuf中使得处理更简单。 例如,VLAN、RSS哈希结果(参见 Poll Mode Driver)及校验和由硬件计算的标志等。

mbuf中还包含数据源端口和报文链中mbuf数目。 对于链接的mbuf,只有链的第一个mbuf存储这个元信息。

例如,对于IEEE1588数据包,RX侧就是这种情况,时间戳机制,VLAN标记和IP校验和计算。 在TX端,应用程序还可以将一些处理委托给硬件。 例如,PKT_TX_IP_CKSUM标志允许卸载IPv4校验和的计算。

以下示例说明如何在vxlan封装的tcp数据包上配置不同的TX卸载:out_eth/out_ip/out_udp/vxlan/in_eth/in_ip/in_tcp/payload

  • 计算out_ip的校验和:

    mb->l2_len = len(out_eth) mb->l3_len = len(out_ip) mb->ol_flags |= PKT_TX_IPV4 | PKT_TX_IP_CSUM set out_ip checksum to 0 in the packet 

    配置DEV_TX_OFFLOAD_IPV4_CKSUM支持在硬件计算。

  • 计算out_ip 和 out_udp的校验和:

    mb->l2_len = len(out_eth) mb->l3_len = len(out_ip) mb->ol_flags |= PKT_TX_IPV4 | PKT_TX_IP_CSUM | PKT_TX_UDP_CKSUM set out_ip checksum to 0 in the packet set out_udp checksum to pseudo header using rte_ipv4_phdr_cksum() 

    配置DEV_TX_OFFLOAD_IPV4_CKSUM 和 DEV_TX_OFFLOAD_UDP_CKSUM支持在硬件上计算。

  • 计算in_ip的校验和:

    mb->l2_len = len(out_eth + out_ip + out_udp + vxlan + in_eth) mb->l3_len = len(in_ip) mb->ol_flags |= PKT_TX_IPV4 | PKT_TX_IP_CSUM set in_ip checksum to 0 in the packet 

    这以情况1类似,但是l2_len不同。 配置DEV_TX_OFFLOAD_IPV4_CKSUM支持硬件计算。 注意,只有外部L4校验和为0时才可以工作。

  • 计算in_ip 和 in_tcp的校验和:

    mb->l2_len = len(out_eth + out_ip + out_udp + vxlan + in_eth) mb->l3_len = len(in_ip) mb->ol_flags |= PKT_TX_IPV4 | PKT_TX_IP_CSUM | PKT_TX_TCP_CKSUM 在报文中设置in_ip校验和为0 使用rte_ipv4_phdr_cksum()将in_tcp校验和设置为伪头 

    这与情况2类似,但是l2_len不同。 配置DEV_TX_OFFLOAD_IPV4_CKSUM 和 DEV_TX_OFFLOAD_TCP_CKSUM支持硬件实现。 注意,只有外部L4校验和为0才能工作。

  • segment inner TCP:

    mb->l2_len = len(out_eth + out_ip + out_udp + vxlan + in_eth)
    mb->l3_len = len(in_ip)
    mb->l4_len = len(in_tcp)
    mb->ol_flags |= PKT_TX_IPV4 | PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM | PKT_TX_TCP_SEG;
    在报文中设置in_ip校验和为0
    将in_tcp校验和设置为伪头部,而不使用IP载荷长度
    

    配置DEV_TX_OFFLOAD_TCP_TSO支持硬件实现。 注意,只有L4校验和为0时才能工作。

  • 计算out_ip, in_ip, in_tcp的校验和:

    mb->outer_l2_len = len(out_eth) mb->outer_l3_len = len(out_ip) mb->l2_len = len(out_udp + vxlan + in_eth) mb->l3_len = len(in_ip) mb->ol_flags |= PKT_TX_OUTER_IPV4 | PKT_TX_OUTER_IP_CKSUM | PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM; 设置 out_ip 校验和为0 设置 in_ip 校验和为0 使用rte_ipv4_phdr_cksum()设置in_tcp校验和为伪头部 

    配置DEV_TX_OFFLOAD_IPV4_CKSUM, DEV_TX_OFFLOAD_UDP_CKSUM 和 DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM支持硬件实现。

Flage标记的意义在mbuf API文档(rte_mbuf.h)中有详细描述。 更多详细信息还可以参阅testpmd 源码(特别是csumonly.c)。

 

直接缓冲区是指缓冲区完全独立。 间接缓冲区的行为类似于直接缓冲区,但缓冲区的指针和数据便宜量指的是另一个直接缓冲区的数据。 这在数据包需要复制或分段的情况下是很有用的,因为间接缓冲区提供跨越多个缓冲区重用相同数据包数据的手段。

当使用接口 rte_pktmbuf_attach() 函数将缓冲区附加到直接缓冲区时,该缓冲区变成间接缓冲区。 每个缓冲区有一个引用计数器字段,每当直接缓冲区附加一个间接缓冲区时,直接缓冲区上的应用计数器递增。 类似的,每当间接缓冲区被分裂时,直接缓冲区上的引用计数器递减。 如果生成的引用计数器为0,则直接缓冲区将被释放,因为它不再使用。

处理间接缓冲区时需要注意几件事情。 首先,间接缓冲区从不附加到另一个间接缓冲区。 尝试将缓冲区A附加到间接缓冲区B(且B附加到C上了),将使得rte_pktmbuf_attach() 自动将A附加到C上。 其次,为了使缓冲区变成间接缓冲区,其引用计数必须等于1,也就是说它不能被另一个间接缓冲区引用。 最后,不可能将间接缓冲区重新链接到直接缓冲区(除非它已经被分离了)。

Although using the recommended rte_pktmbuf_attach () and rte_pktmbuf_detach () function call to directly attach / detach operation, but use of more advanced rte_pktmbuf_clone () function, which is responsible for the correct initialization of the indirect buffer, and a buffer having a plurality of segments can be cloned in Area.

Since the indirect buffer should not actually hold any data, indirect buffer memory pool should be configured to indicate a reduced memory consumption. Can be found for indirect memory pool buffers (indirect exemplary embodiment and buffer) initialized in several exemplary sample application, e.g. IPv4 multicast sample application.

 

Guess you like

Origin www.cnblogs.com/mysky007/p/11117704.html