IOVA analysis of the twenty-one DPDK of the DPDK series

1. IOVA

IOVA, IO virtual address. Part of the functions managed by the upper-layer EAL (Environment Abstraction Layer) in DPDK include mapping the registers of the hardware device into memory for application by other drivers. That is to say, the process in the user mode can directly use the IO address and perform the IO operation. As mentioned earlier, these addresses can be divided into physical addresses (PA) and IO virtual addresses, namely IOVA. The upper layer does not distinguish between the two, that is, for the application layer, it is insensitive to the two. All the user-mode processes see are IOVA addresses.
The advantage of PA's IOVA mode is that it can be used in kernel space applications and for all hardware. Its disadvantage is that it will be troublesome if there is a requirement for permissions for memory operations. Similarly, if there are many memory fragments, it may not be possible to allocate memory, which will cause the failure of the entire DPDK initialization. In order to solve these problems, generally speaking, a larger page is used, such as 1G, and the system is guided to use a large page at startup. But a discerning person can see at a glance that this is only a temporary solution, as the saying goes, if a headache cures the head, a foot pain cures the foot.
The IOVA mode of VA requires an IOMMU to convert and analyze addresses. It is equivalent to an additional layer of abstraction. Generally, those who have done design understand that abstraction mostly means a reduction in efficiency (except for zero-cost abstraction). Its advantage is that memory processing is not completely limited by real physical memory, nor does it require some special permissions. Especially in the cloud environment (IOMMU is more suitable in the virtual environment), the application of DPDK has a wider range of applications. Of course, it has many shortcomings: hardware does not necessarily support IOMMU or the platform does not have this, software does not support or IOMMU is limited, and so on.
Under normal circumstances, DPDK chooses to use the IOVA mode of PA by default. One is safe and the other is wide applicability. However, it is recommended to use the IOVA mode of VA if conditions permit. In DPDK17.11 and later versions, you can use the command:
–iova-mode
to automatically select the appropriate mode.

2. Application of IOVA

Since it is an IO operation, in theory, this function is similar to that of a traditional driver. In DPDK, both interrupt mapping and register mapping of hardware require the assistance of the kernel. It needs to be bound to PCI, as anyone who has played with computers will understand. A feature of this PCI is that it is not bound to a specific set of devices. Anyone who has written a hardware driver knows that the type ID of some devices is usually hard-coded in the driver. So in theory it can be used with any device of this type. But developers often know how deep the water is.
In DPDK, in user space (UIO), due to its own limitations (it uses igb_uio), only the IOVA mode of PA can be used, which also limits the application in UIO; and recommended in higher versions In the VFIO kernel driver (Linux3.6), it is specially developed with IOMMU, so the previous two modes can be selected for processing in VFIO (but the permission problem in PA mode still exists). Wait until a later version of the kernel (>=4.5), after setting the enable_unsafe_noiommu_mode option, you can use VFIO without IOMMU. This is even more advantageous.
Of course, in DPDK, PMD (software polling mode driver) and some related software do not need PCI to operate the driver. They operate the hardware through the standard kernel infrastructure, so that the IOVA mentioned above can be ignored. In other words, it doesn't matter.

3. Data structure and source code

First look at the relevant address definitions:

/** Physical address */
typedef uint64_t phys_addr_t;
#define RTE_BAD_PHYS_ADDR ((phys_addr_t)-1)

/**
 * IO virtual address type.
 * When the physical addressing mode (IOVA as PA) is in use,
 * the translation from an IO virtual address (IOVA) to a physical address
 * is a direct mapping, i.e. the same value.
 * Otherwise, in virtual mode (IOVA as VA), an IOMMU may do the translation.
 */
typedef uint64_t rte_iova_t;
#define RTE_BAD_IOVA ((rte_iova_t)-1)

In the previous analysis, including the data structure analyzed later, you can see the application of these two data types. Let's take a look at the relevant conversion code:

//\dpdk-stable-19.11.14\lib\librte_eal\linux\eal\eal_memory.c
rte_iova_t
rte_mem_virt2iova(const void *virtaddr)
{
	if (rte_eal_iova_mode() == RTE_IOVA_VA)
		return (uintptr_t)virtaddr;
	return rte_mem_virt2phy(virtaddr);
}

Look at VFIO again:

struct user_mem_map {
	uint64_t addr;
	uint64_t iova;
	uint64_t len;
};
//\lib\librte_eal\linux\eal\eal_vfio.h
struct vfio_iommu_type {
	int type_id;
	const char *name;
	bool partial_unmap;
	vfio_dma_user_func_t dma_user_map_func;
	vfio_dma_func_t dma_map_func;
};

Look at the mode judgment again:

/* IOMMU types we support */
static const struct vfio_iommu_type iommu_types[] = {
	/* x86 IOMMU, otherwise known as type 1 */
	{
		.type_id = RTE_VFIO_TYPE1,
		.name = "Type 1",
		.partial_unmap = false,
		.dma_map_func = &vfio_type1_dma_map,
		.dma_user_map_func = &vfio_type1_dma_mem_map
	},
	/* ppc64 IOMMU, otherwise known as spapr */
	{
		.type_id = RTE_VFIO_SPAPR,
		.name = "sPAPR",
		.partial_unmap = true,
		.dma_map_func = &vfio_spapr_dma_map,
		.dma_user_map_func = &vfio_spapr_dma_mem_map
	},
	/* IOMMU-less mode */
	{
		.type_id = RTE_VFIO_NOIOMMU,
		.name = "No-IOMMU",
		.partial_unmap = true,
		.dma_map_func = &vfio_noiommu_dma_map,
		.dma_user_map_func = &vfio_noiommu_dma_mem_map
	},
};
//4\lib\librte_eal\common\eal_common_bus.c
/*
 * Get iommu class of devices on the bus.
 */
enum rte_iova_mode
rte_bus_get_iommu_class(void)
{
	enum rte_iova_mode mode = RTE_IOVA_DC;
	bool buses_want_va = false;
	bool buses_want_pa = false;
	struct rte_bus * bus;

	TAILQ_FOREACH(bus, &rte_bus_list, next) {
		enum rte_iova_mode bus_iova_mode;

		if (bus->get_iommu_class == NULL)
			continue;

		bus_iova_mode = bus->get_iommu_class();
		RTE_LOG(DEBUG, EAL, "Bus %s wants IOVA as '%s'\n",
			bus->name,
			bus_iova_mode == RTE_IOVA_DC ? "DC" :
			(bus_iova_mode == RTE_IOVA_PA ? "PA" : "VA"));
		if (bus_iova_mode == RTE_IOVA_PA)
			buses_want_pa = true;
		else if (bus_iova_mode == RTE_IOVA_VA)
			buses_want_va = true;
	}
	if (buses_want_va && !buses_want_pa) {
		mode = RTE_IOVA_VA;
	} else if (buses_want_pa && !buses_want_va) {
		mode = RTE_IOVA_PA;
	} else {
		mode = RTE_IOVA_DC;
		if (buses_want_va) {
			RTE_LOG(WARNING, EAL, "Some buses want 'VA' but forcing 'DC' because other buses want 'PA'.\n");
			RTE_LOG(WARNING, EAL, "Depending on the final decision by the EAL, not all buses may be able to initialize.\n");
		}
	}

	return mode;
}

Look at the support for IOMMU in the kernel:

//\drivers\bus\pci\linux\pci.c
#if defined(RTE_ARCH_X86)
bool
pci_device_iommu_support_va(const struct rte_pci_device *dev)
{
#define VTD_CAP_MGAW_SHIFT	16
#define VTD_CAP_MGAW_MASK	(0x3fULL << VTD_CAP_MGAW_SHIFT)
	const struct rte_pci_addr *addr = &dev->addr;
	char filename[PATH_MAX];
	FILE *fp;
	uint64_t mgaw, vtd_cap_reg = 0;

	snprintf(filename, sizeof(filename),
		 "%s/" PCI_PRI_FMT "/iommu/intel-iommu/cap",
		 rte_pci_get_sysfs_path(), addr->domain, addr->bus, addr->devid,
		 addr->function);

	fp = fopen(filename, "r");
	if (fp == NULL) {
		/* We don't have an Intel IOMMU, assume VA supported */
		if (errno == ENOENT)
			return true;

		RTE_LOG(ERR, EAL, "%s(): can't open %s: %s\n",
			__func__, filename, strerror(errno));
		return false;
	}

	/* We have an Intel IOMMU */
	if (fscanf(fp, "%" PRIx64, &vtd_cap_reg) != 1) {
		RTE_LOG(ERR, EAL, "%s(): can't read %s\n", __func__, filename);
		fclose(fp);
		return false;
	}

	fclose(fp);

	mgaw = ((vtd_cap_reg & VTD_CAP_MGAW_MASK) >> VTD_CAP_MGAW_SHIFT) + 1;

	/*
	 * Assuming there is no limitation by now. We can not know at this point
	 * because the memory has not been initialized yet. Setting the dma mask
	 * will force a check once memory initialization is done. We can not do
	 * a fallback to IOVA PA now, but if the dma check fails, the error
	 * message should advice for using '--iova-mode pa' if IOVA VA is the
	 * current mode.
	 */
	rte_mem_set_dma_mask(mgaw);
	return true;
}
#elif defined(RTE_ARCH_PPC_64)
bool
pci_device_iommu_support_va(__rte_unused const struct rte_pci_device *dev)
{
	return false;
}
#else
bool
pci_device_iommu_support_va(__rte_unused const struct rte_pci_device *dev)
{
	return true;
}
#endif

enum rte_iova_mode
pci_device_iova_mode(const struct rte_pci_driver *pdrv,
		     const struct rte_pci_device *pdev)
{
	enum rte_iova_mode iova_mode = RTE_IOVA_DC;

	switch (pdev->kdrv) {
	case RTE_KDRV_VFIO: {
#ifdef VFIO_PRESENT
		static int is_vfio_noiommu_enabled = -1;

		if (is_vfio_noiommu_enabled == -1) {
			if (rte_vfio_noiommu_is_enabled() == 1)
				is_vfio_noiommu_enabled = 1;
			else
				is_vfio_noiommu_enabled = 0;
		}
		if (is_vfio_noiommu_enabled != 0)
			iova_mode = RTE_IOVA_PA;
		else if ((pdrv->drv_flags & RTE_PCI_DRV_NEED_IOVA_AS_VA) != 0)
			iova_mode = RTE_IOVA_VA;
#endif
		break;
	}

	case RTE_KDRV_IGB_UIO:
	case RTE_KDRV_UIO_GENERIC:
		iova_mode = RTE_IOVA_PA;
		break;

	default:
		if ((pdrv->drv_flags & RTE_PCI_DRV_NEED_IOVA_AS_VA) != 0)
			iova_mode = RTE_IOVA_VA;
		break;
	}
	return iova_mode;
}

In fact, things like IOVA and IOMMU are an address control method, and IOMMU can better serve IOVA. Anyway, in the eyes of upper-level applications, there is no such thing as physical and virtual. It only cares about the address of the operation. As for how to deal with the address in the end, it is not a matter of concern to others. Just like storing money in the bank, you just send the money in, and don't care what the bank does with the money.

4. Analysis

I hope that domestic programmers will develop to the bottom when they have time. This is the direction of the future. The application of the upper layer has been booming for ten years, and it is found that it has basically reached a crossroads. Go left, go right, or keep going? The benevolent see benevolence. But no matter how you go, you will be separated from the support that is not at the bottom. Without a foundation, no matter how good the building is, it will not be able to withstand wind and rain.

Guess you like

Origin blog.csdn.net/fpcc/article/details/131348677