Detailed explanation of DPDK vhost-user

Detailed explanation of Vhost-user

In software-implemented network I/O paravirtualization, vhost-user achieves a nearly perfect balance in terms of performance, flexibility, and compatibility. Although it has been proposed for more than four years, and more and more new features have been added, it remains the same. Today, we will start from the establishment process of the entire vhost-user data path and the data packet transmission process. etc. Introduce the vhost-user architecture in detail. This article is based on the analysis of DPDK 17.11.

The best implementation of vhost-user is in DPDK's vhost library, which contains a complete virtio backend logic, which can be directly abstracted into a port in a virtual switch. In the most mainstream software virtual switch OVS (openvswitch), the DPDK library can be used.

vhost-user typical deployment scenario.png

The most typical application scenario of vhost-user is shown in the figure. OVS creates a vhost port for each virtual machine to implement the virtio back-end driver logic, including responding to the request of the virtual machine to send and receive data packets, and processing data packet copying. Each VM actually runs in an independent QEMU process. QEMU is responsible for simulating virtual machine devices. It has integrated the KVM communication module, so the QEMU process has become the main process of the VM, which contains threads such as vcpu. The command line parameter of QEMU startup can select the network card device type as virtio, and it will virtualize the virtio device for each VM. Combined with the virtio driver used in the VM, it constitutes the front end of virtio.

1. Establish a connection

As mentioned earlier, the VM is actually running in the QEMU process, so when the VM is started, if it wants to establish a connection with the vhost port of the OVS process, so as to realize the data packet path, it is necessary to establish a set of control channels first. This set of control channels is based on socket inter-process communication, which occurs between the OVS process and the QEMU process, not with the VM. In addition, this set of communication has its own protocol standard and message format . All message types can be seen in lib/librte_vhost/vhost_user.c of DPDK:

static const char *vhost_message_str[VHOST_USER_MAX] = {
    [VHOST_USER_NONE] = "VHOST_USER_NONE",
    [VHOST_USER_GET_FEATURES] = "VHOST_USER_GET_FEATURES",
    [VHOST_USER_SET_FEATURES] = "VHOST_USER_SET_FEATURES",
    [VHOST_USER_SET_OWNER] = "VHOST_USER_SET_OWNER",
    [VHOST_USER_RESET_OWNER] = "VHOST_USER_RESET_OWNER",
    [VHOST_USER_SET_MEM_TABLE] = "VHOST_USER_SET_MEM_TABLE",
    [VHOST_USER_SET_LOG_BASE] = "VHOST_USER_SET_LOG_BASE",
    [VHOST_USER_SET_LOG_FD] = "VHOST_USER_SET_LOG_FD",
    [VHOST_USER_SET_VRING_NUM] = "VHOST_USER_SET_VRING_NUM",
    [VHOST_USER_SET_VRING_ADDR] = "VHOST_USER_SET_VRING_ADDR",
    [VHOST_USER_SET_VRING_BASE] = "VHOST_USER_SET_VRING_BASE",
    [VHOST_USER_GET_VRING_BASE] = "VHOST_USER_GET_VRING_BASE",
    [VHOST_USER_SET_VRING_KICK] = "VHOST_USER_SET_VRING_KICK",
    [VHOST_USER_SET_VRING_CALL] = "VHOST_USER_SET_VRING_CALL",
    [VHOST_USER_SET_VRING_ERR]  = "VHOST_USER_SET_VRING_ERR",
    [VHOST_USER_GET_PROTOCOL_FEATURES]  = "VHOST_USER_GET_PROTOCOL_FEATURES",
    [VHOST_USER_SET_PROTOCOL_FEATURES]  = "VHOST_USER_SET_PROTOCOL_FEATURES",
    [VHOST_USER_GET_QUEUE_NUM]  = "VHOST_USER_GET_QUEUE_NUM",
    [VHOST_USER_SET_VRING_ENABLE]  = "VHOST_USER_SET_VRING_ENABLE",
    [VHOST_USER_SEND_RARP]  = "VHOST_USER_SEND_RARP",
    [VHOST_USER_NET_SET_MTU]  = "VHOST_USER_NET_SET_MTU",
    [VHOST_USER_SET_SLAVE_REQ_FD]  = "VHOST_USER_SET_SLAVE_REQ_FD",
    [VHOST_USER_IOTLB_MSG]  = "VHOST_USER_IOTLB_MSG",
};

With the iteration of the version, more and more features are added, and there are more and more message types, but the main function of the control channel is: to transmit the data structure necessary to establish the data path; to control the opening and closing of the data path and to reset the channel. Connection function; transfer disconnection message when hot migration or shutting down virtual machine.

From the start of the virtual machine to the establishment of the data path, the transmitted messages will be recorded in the OVS log file. After sorting out these messages, the actual process is shown in the following figure:

vhost-user control channel message flow.png

The left half is some negotiation features, especially when the back-end driver and the front-end driver do not know each other's protocol version, it is necessary to negotiate these features. After the feature negotiation is completed, the next step is to transfer the data structure necessary to establish the data path, mainly including the transfer relationship between the file descriptor of the shared memory and the memory address, and the status information of the virtual queue in virtio. The most critical parts are explained in detail below.

Set up shared memory

In a virtual machine, memory is allocated in advance by the QEMU process. Once QEMU chooses to use the vhost-user method for network communication, you need to configure the memory access mode of the VM to be shared. The specific command line parameters are also explained in the DPDK documentation:

-object memory-backend-file,share=on,...

This means that the memory of the virtual machine must be a pre-allocated large page and allowed to be shared with other processes . The specific reason is mentioned in the previous article, because both OVS and QEMU are user-mode processes, and during the packet copy process, need The OVS process has the ability to access the virtual machine buffer in the QEMU process, so the VM memory must be shared with the OVS process.

Some memory-saving solutions, such as virtio_balloon, are no longer available for dynamically changing VM memory. The reason is also very simple. It is impossible for OVS to keep changing the shared memory with the virtual machine, which is so troublesome.

In the vhost library, the backend driver receives messages from the control channel, and the code for actively mapping VM memory is as follows:

static int vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
{
   ...
   mmap_size = RTE_ALIGN_CEIL(mmap_size, alignment);

   mmap_addr = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
                 MAP_SHARED | MAP_POPULATE, fd, 0);
   ...
   reg->mmap_addr = mmap_addr;
   reg->mmap_size = mmap_size;
   reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
                      mmap_offset;
   ...
}

It uses the linux library function mmap()to map VM memory, see the linux programming manual http://man7.org/linux/man-pages/man2/mmap.2.html for details. Notice that it does one more thing before mapping memory, setting memory alignment, this is because the basic unit of mmap function mapping memory is a page, that is to say, the starting address and size must be an integer multiple of the page size, in In a large page environment, it is 2MB or 1GB. Only after alignment can it be guaranteed that the mapped shared memory will not go wrong, and subsequent memory access behaviors will not cross the boundary.

The last three lines are the information for saving the address conversion relationship. There are several address translations involved here, and the most complicated one in vhost-user is address translation. From the perspective of the QEMU process, the virtual machine memory has two addresses: GPA (Guest Physical Address) and QVA (QEMU Virtual Address); from the OVS process, the virtual machine memory also has two addresses GPA (Guest Physical Address) and VVA (vSwitch Virtual Address). )

GPA is the most important address of virtio. In the virtio standard, each item in the virtual queue virtqueue used to store the packet address is represented by GPA. But for the operating system, we actually use virtual addresses for access in the process, and the physical address has been shielded, that is to say, the process can only access the memory when it gets the virtual address corresponding to the physical address (the address we use in programming Pointers are all virtual addresses ).

The QEMU process is easy to implement. After all, when the main process of the VM pre-allocates memory to the VM, the mapping relationship between QVA and GPA is established.

Shared memory mapping relationship.png

For the OVS process, take the above figure as an example. The mmap() function returns a virtual address, which is the starting address of the VM memory mapped to the OVS address space (that is, I map this piece of memory in the size starting from mmap_addr is 1GB in the space). In this way, a GPA is given to the OVS process, and the OVS process can use the address offset to calculate the corresponding VVA, and then perform memory access.

However, the actual mapped memory is much more complicated than the illustration. Maybe the VM memory is divided into several blocks, and the starting GPA of some blocks is not 0. Therefore, it is necessary to add a GPA offset to the mapped block to preserve it completely. Address correspondence between VVA and GPA. These correspondences are the basis for the implementation of subsequent data paths.

Another technical detail worth noting is that in sockets, it is generally not possible to directly pass a file descriptor. From a programming perspective, a file descriptor is an int variable, and if it is passed directly, it is an integer. Here is also a little technology The above trick, if you are interested, you can also study it

Set virtual queue information

The structure of the virtual queue consists of three parts: avail ring, used ring and desc ring. Here is a little more detailed description of the role and design ideas of these three rings.

There is only one ring table in the traditional network card design, but there are two pointers, which are respectively for the driver and the network card device management. This is a typical producer-consumer problem. The party that produces the data packet moves the pointer, and the other party chases it until it is all consumed. But the disadvantage of doing this is that it can only be executed sequentially. Before the previous descriptor is processed, the latter can only wait.

But in a virtual queue, the producer and consumer are separated. The real packet descriptor (that is, the packet or its buffer address) is stored in the desc ring, and the index to the item in the desc ring is stored in the avail ring and used ring. The front-end driver puts the produced descriptors into the available ring, and the back-end driver puts the consumed descriptors into the used ring (in fact, it writes the index in the desc ring, that is, the serial number). In this way, the front-end driver can reclaim the used descriptors according to the used ring, even if there are descriptors in the middle that are occupied, it will not affect the reclaiming of the descriptors after the occupied descriptors. In addition, DPDK also optimizes prefetching at the cache level for this structure to make it more efficient.

After the control channel has established the shared memory, it is also necessary to establish the same virtual queue data structure as the front end at the back end. The required information mainly includes: the number of desc ring items (different front-end drivers, such as DPDK virtio driver and kernel virtio driver are different), where the last avail ring was used (this is mainly for reconnection or dynamic migration , this item should be 0 for the first time to establish a connection), the start address of the three rings of the virtual queue, and set the notification signal.

After these messages are processed, the back-end driver uses the received start address to create a virtual queue data structure exactly the same as the front-end driver, and is ready to send and receive data packets. The last two items of eventfd are used for scenarios that need to be notified. For example: the virtual machine uses the kernel virtio driver. Every time the vhost port of OVS sends a data packet to the virtual machine, it needs to use eventfd to notify the kernel driver to receive the data packet. Under the polling drive, these eventfds are meaningless.

In addition, VHOST_USER_GET_VRING_BASEit is a very peculiar signal, which is sent to the OVS process by QEMU only when the virtual machine is shut down or disconnected, which means disconnecting the data path.

2. Data path processing

The implementation of the data path is in lib/librte_vhost/virtio_net.c of DPDK. Although the code looks very lengthy, most of it deals with various features and hardware offloading functions, and the main logic is very simple.

The main function responsible for sending and receiving data packets is:

uint16_t rte_vhost_enqueue_burst(int vid, uint16_t queue_id, struct rte_mbuf **pkts, uint16_t count)
//数据包流向 OVS 到 VM
uint16_t rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
    struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
//数据包流向 VM 到 OVS

The specific sending process is summarized as follows: if OVS sends a data packet to the VM, the corresponding vhost port reads the available buffer address in the avail ring, converts it into VVA, and copies the data packet. After the copy is completed, send eventfd to notify the VM; If the VM is sending to OVS, then instead, copy from the packet buffer in the VM to the mbuf data structure of DPDK. Let’s post a code comment below. Don’t worry about the iommu and iova inside. Those are new features of vhost-user. It can be understood that iova is GPA.

uint16_t
rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
    struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
{
    struct virtio_net *dev;
    struct rte_mbuf *rarp_mbuf = NULL;
    struct vhost_virtqueue *vq;
    uint32_t desc_indexes[MAX_PKT_BURST];
    uint32_t used_idx;
    uint32_t i = 0;
    uint16_t free_entries;
    uint16_t avail_idx;

    dev = get_device(vid);   //根据vid获取dev实例
    if (!dev)
        return 0;

    if (unlikely(!is_valid_virt_queue_idx(queue_id, 1, dev->nr_vring))) {
        RTE_LOG(ERR, VHOST_DATA, "(%d) %s: invalid virtqueue idx %d.\n",
            dev->vid, __func__, queue_id);
        return 0;
    }          //检查虚拟队列id是否合法

    vq = dev->virtqueue[queue_id];      //获取该虚拟队列

    if (unlikely(rte_spinlock_trylock(&vq->access_lock) == 0))  //对该虚拟队列加锁
        return 0;

    if (unlikely(vq->enabled == 0))     //如果vq不可访问,对虚拟队列解锁退出
        goto out_access_unlock;

    vq->batch_copy_nb_elems = 0;  //批处理需要拷贝的数据包数目

    if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
        vhost_user_iotlb_rd_lock(vq);

    if (unlikely(vq->access_ok == 0))
        if (unlikely(vring_translate(dev, vq) < 0))  //因为IOMMU导致的,要翻译iova_to_vva
            goto out;

    if (unlikely(dev->dequeue_zero_copy)) {   //零拷贝dequeue
        struct zcopy_mbuf *zmbuf, *next;
        int nr_updated = 0;

        for (zmbuf = TAILQ_FIRST(&vq->zmbuf_list);
             zmbuf != NULL; zmbuf = next) {
            next = TAILQ_NEXT(zmbuf, next);

            if (mbuf_is_consumed(zmbuf->mbuf)) {
                used_idx = vq->last_used_idx++ & (vq->size - 1);
                update_used_ring(dev, vq, used_idx,
                         zmbuf->desc_idx);
                nr_updated += 1;

                TAILQ_REMOVE(&vq->zmbuf_list, zmbuf, next);
                restore_mbuf(zmbuf->mbuf);
                rte_pktmbuf_free(zmbuf->mbuf);
                put_zmbuf(zmbuf);
                vq->nr_zmbuf -= 1;
            }
        }

        update_used_idx(dev, vq, nr_updated);
    }

    /*
     * Construct a RARP broadcast packet, and inject it to the "pkts"
     * array, to looks like that guest actually send such packet.
     *
     * Check user_send_rarp() for more information.
     *
     * broadcast_rarp shares a cacheline in the virtio_net structure
     * with some fields that are accessed during enqueue and
     * rte_atomic16_cmpset() causes a write if using cmpxchg. This could
     * result in false sharing between enqueue and dequeue.
     *
     * Prevent unnecessary false sharing by reading broadcast_rarp first
     * and only performing cmpset if the read indicates it is likely to
     * be set.
     */
    //构造arp包注入到mbuf(pkt数组),看起来像是虚拟机发送的
    if (unlikely(rte_atomic16_read(&dev->broadcast_rarp) &&
            rte_atomic16_cmpset((volatile uint16_t *)
                &dev->broadcast_rarp.cnt, 1, 0))) {

        rarp_mbuf = rte_pktmbuf_alloc(mbuf_pool);    //从mempool中分配一个mbuf给arp包
        if (rarp_mbuf == NULL) {
            RTE_LOG(ERR, VHOST_DATA,
                "Failed to allocate memory for mbuf.\n");
            return 0;
        }
        //构造arp报文,构造成功返回0
        if (make_rarp_packet(rarp_mbuf, &dev->mac)) {
            rte_pktmbuf_free(rarp_mbuf);
            rarp_mbuf = NULL;
        } else {
            count -= 1;
        }
    }
    //计算有多少数据包,现在的avail的索引减去上次停止时的索引值,若没有直接释放vq锁退出
    free_entries = *((volatile uint16_t *)&vq->avail->idx) -
            vq->last_avail_idx;
    if (free_entries == 0)
        goto out;

    LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);

    /* Prefetch available and used ring */
    //预取前一次used和avail ring停止位置索引
    avail_idx = vq->last_avail_idx & (vq->size - 1);
    used_idx  = vq->last_used_idx  & (vq->size - 1);
    rte_prefetch0(&vq->avail->ring[avail_idx]);
    rte_prefetch0(&vq->used->ring[used_idx]);

    //此次接收过程的数据包数目,为待处理数据包总数、批处理数目最小值
    count = RTE_MIN(count, MAX_PKT_BURST);
    count = RTE_MIN(count, free_entries);
    LOG_DEBUG(VHOST_DATA, "(%d) about to dequeue %u buffers\n",
            dev->vid, count);

    /* Retrieve all of the head indexes first to avoid caching issues. */
    //从avail ring中取得所有数据包在desc中的索引,存在局部变量desc_indexes数组中
    for (i = 0; i < count; i++) {
        avail_idx = (vq->last_avail_idx + i) & (vq->size - 1);
        used_idx  = (vq->last_used_idx  + i) & (vq->size - 1);
        desc_indexes[i] = vq->avail->ring[avail_idx];

        if (likely(dev->dequeue_zero_copy == 0))    //若不支持dequeue零拷贝的话,直接将索引写入used ring中
            update_used_ring(dev, vq, used_idx, desc_indexes[i]);
    }

    /* Prefetch descriptor index. */
    rte_prefetch0(&vq->desc[desc_indexes[0]]);  //从desc中预取第一个要发的数据包描述符
    for (i = 0; i < count; i++) {
        struct vring_desc *desc, *idesc = NULL;
        uint16_t sz, idx;
        uint64_t dlen;
        int err;

        if (likely(i + 1 < count))
            rte_prefetch0(&vq->desc[desc_indexes[i + 1]]); //预取后一项

        if (vq->desc[desc_indexes[i]].flags & VRING_DESC_F_INDIRECT) {  //如果该项支持indirect desc,按照indirect处理
            dlen = vq->desc[desc_indexes[i]].len;
            desc = (struct vring_desc *)(uintptr_t)     //地址转换成vva
                vhost_iova_to_vva(dev, vq,
                        vq->desc[desc_indexes[i]].addr,
                        &dlen,
                        VHOST_ACCESS_RO);
            if (unlikely(!desc))
                break;

            if (unlikely(dlen < vq->desc[desc_indexes[i]].len)) {
                /*
                 * The indirect desc table is not contiguous
                 * in process VA space, we have to copy it.
                 */
                idesc = alloc_copy_ind_table(dev, vq,
                        &vq->desc[desc_indexes[i]]);
                if (unlikely(!idesc))
                    break;

                desc = idesc;
            }

            rte_prefetch0(desc);   //预取数据包
            sz = vq->desc[desc_indexes[i]].len / sizeof(*desc);
            idx = 0;
        }
        else {
            desc = vq->desc;    //desc数组
            sz = vq->size;      //size,个数
            idx = desc_indexes[i];  //desc索引项
        }

        pkts[i] = rte_pktmbuf_alloc(mbuf_pool);   //给正在处理的这个数据包分配mbuf
        if (unlikely(pkts[i] == NULL)) {    //分配mbuf失败,跳出包处理
            RTE_LOG(ERR, VHOST_DATA,
                "Failed to allocate memory for mbuf.\n");
            free_ind_table(idesc);
            break;
        }

        err = copy_desc_to_mbuf(dev, vq, desc, sz, pkts[i], idx,
                    mbuf_pool);
        if (unlikely(err)) {
            rte_pktmbuf_free(pkts[i]);
            free_ind_table(idesc);
            break;
        }

        if (unlikely(dev->dequeue_zero_copy)) {
            struct zcopy_mbuf *zmbuf;

            zmbuf = get_zmbuf(vq);
            if (!zmbuf) {
                rte_pktmbuf_free(pkts[i]);
                free_ind_table(idesc);
                break;
            }
            zmbuf->mbuf = pkts[i];
            zmbuf->desc_idx = desc_indexes[i];

            /*
             * Pin lock the mbuf; we will check later to see
             * whether the mbuf is freed (when we are the last
             * user) or not. If that's the case, we then could
             * update the used ring safely.
             */
            rte_mbuf_refcnt_update(pkts[i], 1);

            vq->nr_zmbuf += 1;
            TAILQ_INSERT_TAIL(&vq->zmbuf_list, zmbuf, next);
        }

        if (unlikely(!!idesc))
            free_ind_table(idesc);
    }
    vq->last_avail_idx += i;

    if (likely(dev->dequeue_zero_copy == 0)) {  //实际此次批处理的所有数据包内容拷贝
        do_data_copy_dequeue(vq);
        vq->last_used_idx += i;
        update_used_idx(dev, vq, i);  //更新used ring的当前索引,并eventfd通知虚拟机接收完成
    }

out:
    if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
        vhost_user_iotlb_rd_unlock(vq);

out_access_unlock:
    rte_spinlock_unlock(&vq->access_lock);

    if (unlikely(rarp_mbuf != NULL)) {  //再次检查有arp报文需要发送的话,就加入到pkts数组首位,虚拟交换机的mac学习表就能第一时间更新
        /*
         * Inject it to the head of "pkts" array, so that switch's mac
         * learning table will get updated first.
         */
        memmove(&pkts[1], pkts, i * sizeof(struct rte_mbuf *));
        pkts[0] = rarp_mbuf;
        i += 1;
    }

    return i;
}

Batch processing is widely used in the network functions implemented by software, and the number of batches can be set by yourself, but it is generally agreed to batch process up to 32 data packets.

3. OVS polling logic

There are many such vhost ports in OVS, and the DPDK-accelerated OVS has implemented binding core polling for these ports. Therefore, many ports are generally bound to limited CPU cores as uniformly as possible. Some manufacturers even achieve load balancing in units of queues, as shown in the figure below in the actual production environment.

OVS multi-queue load balancing.png

In addition, there are some optimization problems in binding cores for physical ports and virtual ports, mainly to avoid read-write locks. For example, bind all the ports of the physical network card to one core, and put the virtual ports of the software on another core. Transceivers rarely collide on one core, etc.

The work of each polling core of OVS is also very simple. For each polled port, check how many data packets it needs to receive. After receiving these data packets, look them up (flow table, mainly five yuan Group matching, find the destination port), and then call the sending function of the corresponding port to send out (like the rte_vhost_enqueue_burst function of the vhost port), and continue to poll the next port after all are completed. The SLA and Qos mechanisms usually work before calling the sending function of the corresponding port. For example, check whether the token bucket of the port has enough tokens to send these data packets. If speed limit is required, some data packets will be selectively discarded. Then execute the send function.

In fact, no matter what kind of software implements the forwarding logic, it always follows the philosophy that the data path is as simple as possible, and other mechanisms and message responses can be complex and diverse.

Original link: https://mp.weixin.qq.com/s/cEC10Fmt0G9xfVSdG3KJhg

Learn more dpdk videos
DPDK learning materials, teaching videos and learning roadmap: https://space.bilibili.com/1600631218
Dpdk/network protocol stack/vpp/OvS/DDos/NFV/virtualization/high performance expert learning address: https://ke.qq.com/course/5066203?flowToken=1043799
DPDK develops learning materials, teaching videos and learning roadmap sharing. If necessary, you can add learning exchanges by yourself q Junyang 909332607 Remarks (XMG) Get

Guess you like

Origin blog.csdn.net/weixin_60043341/article/details/126557576