DPDK Ethernet part code finishing

The overall process is shown in the figure. Devices and queues are configured during initialization, and packets are sent and received through rte_eth_tx_burst and rte_eth_rx_burst during runtime.

Packet receiving memory pool creation

rte_pktmbuf_pool_create(pool_name, num, cache_size, priv_size, data_room_size, socket_id);

  1. num is the number of blocks in the pool

  2. cache_size is some lcore pointers, pointing to the buffer in this mempool, note that each lcore will reserve so many pointers, other buffers cannot access these blocks.

  3. priv_size is the reserved internal space, for example, it can be used as the storage location of some custom headers.

  4. data_room_size is actual payload size + RTE_PKTMBUF_HEADROOM

  5. socket_id can be 0, if there is only 1 socket. It can also be arbitrary, such as SOCKET_ID_ANY, which is -1. It mainly involves different NUMA, and different DMA will be used.

  6. Some drivers, such as some drivers of ice, do not support the mode of iova=va, because PMD requires a large block of continuous physical memory.

  7. The packet receiving DMA will be moved to the pre-allocated packet receiving memory pool. Note that the sending memory pool does not need to be created additionally for the driver, the buffer before rte_eth_tx_burst has already been allocated.

Device port initialization

  1. rte_eth_dev_count_avail()
    to see how many eth ports there are, that is, how many eth devices there are, including the number of all PFs and VFs.

  2. rte_eth_dev_info_get(port, &dev_info);
    See which functions the current port supports, such as the maximum number of rx queues and tx queues.

  3. rte_eth_dev_configure(port, n_rx_q, n_tx_q + n_free_tx_q, &port_conf)
    configuration port has several tx and rx queues.

  4. rte_eth_dev_set_vlan_offload(port_id, vlan_offload)
    first obtains the current VLAN offload configuration. rte_eth_dev_get_vlan_offload(port_id), and then configure vlan offload configuration. rte_eth_dev_get_vlan_offload

  5. rte_eth_dev_vlan_filter(port_id, vlan_id, 1)
    configures offload for port_id and vlan_id, 1 means enable, which means that these packets are allowed to pass through, and after receiving them, the vlan is automatically removed.
    When sending a packet, just fill in the VLAN to mbuf->vlan_tci.

  6. rte_eth_dev_callback_register(portid, RTE_ETH_EVENT_QUEUE_STATE, eth_queue_state_change_callback, NULL);
    can register a callback function, for example, when RTE_ETH_EVENT_QUEUE_STATE, the queue state changes, call eth_queue_state_change_callback() processing.
    Because of some devices, when the mbuf used for rx is exhausted, it will no longer receive data, and the queue will be stopped. At this time, the rx queue needs to be restarted.
    rte_eth_dev_rx_queue_stop(port, queue), rte_eth_dev_rx_queue_start(port, queue) is enough.
    When the link state RTE_ETH_EVENT_INTR_LSC changes, the corresponding callback function can also be registered to handle it. Some network cards support, some do not support, you can check this page https://doc.dpdk.org/guides/nics/overview.html

  7. rte_eth_dev_adjust_nb_rx_tx_desc(portid, &nb_rxd, &nb_txd);
    Check the number of descriptors of rx and tx, if it exceeds the limit, set them to the value of limit.

  8. rte_eth_dev_get_port_by_name(port_name, &port_id)
    can get port id by device name

  9. rte_tm_node_add(port, node, parent_node, prio, weight, level, param, error)
    supports TM device and can configure traffic manager. The prio can be set to 5, the weight can be set to 5, and the level can be set to RTE_TM_NODE_LEVEL_ID_ANY (that is, configured as UINT32_MAX)

Initialize the queue

  1. rte_eth_rx_queue_setup(port, rx_q_id, nb_rx_desc, rte_eth_dev_socket_id(port), NULL, mp)
    needs to specify the mempool, so that the receiving driver will get the buffer from this mempool.

  2. rte_eth_tx_queue_setup(port, tx_q_id, nb_tx_desc, rte_eth_dev_socket_id(port), &txconf);
    tx does not need to specify mempool, but needs to configure tx_free_thresh and tx_rs_thresh.
    struct rte_eth_txconf txconf = dev_info.default_txconf;
    txconf.tx_free_thresh = nb_tx_desc - txconf.tx_rs_thresh;
    If you need to increase txq during operation, you can use this API to continue to increase.
    rte_eth_tx_queue_setup(port, txq, nb_txd, socket_id, &txconf);

  3. rte_eth_cp_queue_setup(portid, cp_q_id, NUM_PKTS_COMPL_Q, socket_id, &cp_conf);
    configure the completion queue, and you will be notified that the transmission is complete.

boot device

rte_eth_dev_start(port)
rte_eth_link_get_nowait(port, &link);
You can use link.link_status to judge whether the link is up. If it is 1, it means that the link is up and can be used for communication later.

stop queue

  1. tx queue stop
    rte_eth_dev_tx_queue_stop(port, tx_queue);
    rte_tm_node_delete(port, tm_node_id, &error);

  2. rx queue stop
    rte_eth_dev_rx_queue_stop(port, tx_queue);

stop device

rte_eth_dev_stop(port);
rte_eth_dev_close(port);
release the eth device after stopping the data flow.

Receive data

After rte_eth_rx_burst(port, 0, mbufs, BURST_SIZE)
rte_pktmbuf_mtod(mbufs[i], struct rte_ether_hdr*)
is offset to eth hdr, you can start processing data.
rte_pktmbuf_mtod_offset(mbufs[i], struct rte_ipv4_hdr*, sizeof(struct rte_ether_hdr))
can also be offset to the specified offset.

send data

After filling in eth hdr, ip hdr, udp hdr, you can send data.
rte_eth_tx_burst(port, 0, &txbuf, 1)

debugging

  1. See interface and queue configuration
    interface configuration
    rte_eth_link_get()
    rte_eth_promiscuous_get(i)
    rte_eth_dev_get_mtu(i, &mtu)
    queue configuration
    Rx
    rte_eth_rx_queue_info_get(port, queue, &queue_info) // You can know how many queues are configured, how many descriptors are there, and whether they are full up.
    rte_eth_rx_descriptor_status(port, queue, offset);
    Tx
    rte_eth_tx_queue_info_get(port, queue, &queue_info)
    rte_eth_tx_descriptor_status(port, queue, offset);
    rte_eth_dev_rss_hash_conf_get(port, rss_conf)
    TM
    rte_tm_capabilities_get()
    rte_tm_node_capabilities_get
    () rte_tm_node_type_get()
    rte_tm_level_capabilities_get()
    rte_tm_get_number_of_leaf_nodes()
    rte_tm_node_stats_read ()

  2. Look at the data packet
    rte_eth_stats_get(port, $stats)
    stats.ipackets is rx packets, ierrors is rx errors, ibytes is rx bytes, rx_nombuf has no mbuf, it means that the mbuf of rx pool is used up, if there is traffic burst, you need to increase this mempool The number of blocks.
    stats.opackets is tx packets, oerrors is tx errors, and obytes is tx bytes.
    q_ipackets is the queue rx packets, q_errors is the queue rx errors, q_ibytes is the queue rx bytes.
    q_opackets is the queue tx packets, q_obytes is the queue tx bytes.

rte_eth_xstats_get_names_by_id(port, name, len, ids)
rte_eth_xstats_get_by_id(port, ids, values, len)

You can also use dpdk_pdump to generate packets available to wireshark. Or add the generated pcap function in rte_eth_add_rx_callback(), rte_eth_add_tx_callback().

  1. Transfer data from VF port mirror to PF


    For example, the network card of X550 can use the following command to mirror to PF, and then perform tcpdump on PF.
    echo "write 0x0000F600 0x702" > /sys/kernel/debug/ixgbe/PF_PCI_BDF}/reg_ops echo "write 0x0000F604 0x704" > /sys/kernel/debug/ixgbe/{PF_PCI_BDF}/reg_ops where PF_PCI_BDF can use the specific PCI number
    , For example, 0000:00:06.0 and so on.
    tcpdump -i eth0 -s 1000 -B 2000 -W 5 -C 1000 -w traffic.pcap
    where -s indicates the size of each packet, -B indicates the buffer size of the operating system to capture packets, the unit is 1KB, 2000 indicates 2MB, - W means to save 5 packages, -C means the size of each package is 1000*1000M.

overall process

The index of rx_ring and sw_ring are in one-to-one correspondence, and the data is moved to the mbuf of rx pool through DMA. The code for the tx ring is similar.

The DD bit (Descriptor Done Status) is used to identify whether a descriptor is available.

Whether the network card supports hardware offloading can be found on this page: https://ark.intel.com/content/www/cn/zh/ark/compare.html?productIds=39774,88209,189534,192558

DDIO (Direct Data I/O, also known as Direct Cache Access--DCA): If the network card supports DDIO, the CPU and the network card interact directly through the LLC cache.

When receiving packets, the memory buffer and control structure will be prefetched to the Cache, and then the cache will be accessed directly. The process is external device/controller -> Last Level Cache -> CPU.

When sending a packet, the process is to send the packet to the network, and an I/O read request will be initiated to send the data block to the network card through the bus

When DDIO is not supported, it is sent to the memory through the PCI bus. The data flow is that the CPU applies for access to the memory -> the memory fetches data from the external device -> Last Level Cache fetches data from the memory.

RDMA (Remote Direct Memory Access): Only X810 supports it. Used to avoid TCP/IP stack process, memory copy and process context switch.

Original link: https://mp.weixin.qq.com/s/-opp8LbqUcVoEGTG7Fmdvg

 
Learn more dpdk materials
DPDK learning materials, videos and learning roadmap: https://space.bilibili.com/1600631218
Dpdk/network protocol stack/vpp/OvS/DDos/NFV/virtualization/high performance expert learning address: https ://ke.qq.com/course/5066203?flowToken=1043799
DPDK develops learning materials, teaching videos and learning roadmap sharing. If necessary, you can add learning exchanges by yourself q Junyang 909332607 Get it

 

Guess you like

Origin blog.csdn.net/weixin_60043341/article/details/126623758