Hardware virtualization and related logic

Edited from: https://mp.weixin.qq.com/s/3zHRuKKexffJQ7JkhzGhtA

This article is selected from the Jishu column "IC Design", and is authorized to be transferred from the WeChat public account Xingong Awen. This article will sort out the content of hardware virtualization and its related logic to help you understand the definition of virtualization more easily.

Some time ago, when sorting out some concepts, I suddenly had a new understanding of virtualization, and then I learned about related concepts or technologies, and I was able to speculate what problems a certain technology was designed to solve and how to achieve it.

This article will list hardware virtualization and its related logic, which may be helpful for understanding virtualization.

Virtualization is the application of the layering principle through enforced modularity, whereby the exposed virtual resource is identical to the underlying physical resource being virtualized.

Virtualization is layered by means of modules, so that the exposed virtual resources are similar to physical resources.

——《Hardware and Software Support for Virtualizaton》

In the above definition, the principle of layering is used, and modules are used for isolation. From the point of view of a certain interface, virtual resources are guaranteed to be similar to physical resources. After supporting virtualization technology, lower-layer resources can be dispatched to multiple upper-layer virtual resources for use. The definitions of the lower layer and the upper layer are relative. There are different technologies, as well as software and hardware virtualization technologies. For example, Docker belongs to the virtualization technology at the software level, which is the virtualization of the operating system.

picture

Virtualization technology is very broad, this article only describes hardware virtualization, and it is a hardware-assisted virtualization technology.

In the minimalist system structure model, the SoC system can be divided into three parts, which are generalized DMA, memory banks and connection channels. In the article of data transmission, two ways of starting the module are described, one is self-starting, the typical example is DMA, the other is started, the typical example is DMA, accelerator, the modules of these two ways belong to the generalized DMA . At this point, the SoC system includes self-starting DMA, started DMA, memory banks and interconnection buses, as shown in the figure below.

picture

From my understanding, the hardware system structure is basically equivalent to the above model. The rest of the modules are all introduced to accelerate the performance of one of the above elements, such as Cache to accelerate the delay performance of DMA access to Memory. The upper layer of the hardware system is the software operating system, and virtualization technology allows this set of hardware structures to be called by multiple operating systems.

As mentioned in the previous article, a unified language is the premise of dialogue, and the language in the system is the address. Analyzing the address domain is an important part of understanding the system. The same address domain directly transmits data, and the data transmission between different address domains needs to be translated. If there is no conversion mechanism between different address domains, the two cannot directly exchange data. So as to achieve isolation and security.

In the scenario where a virtual machine exists, the analysis of the address domain can be approximately as shown in the figure below. The address signal is a binary number. To distinguish different address domains is nothing more than to expand its bit width and give it a new term. Here, it is Virtual Machine Identifier (VMID) and Address Space Identifier (ASID). The address used in the process is Virtual Address (VA), which is marked as vmid + asid + va in the system, and there is a conversion relationship (vmid + asid + va) -> pa (subsequently simplified to VA->PA). Virtual addresses are mapped to actual physical addresses, and this translation relationship is called a page table.

picture

Looking at the interconnection bus of the hardware system, data transmission between different modules must have the same address field, which is the Physical Address (PA) in the figure.

In fact, the above-mentioned VMs and processes are software-level concepts and exist inside the CPU. If the VA is used in the process to access the Memory system, address translation must be performed, and the hardware logic responsible for this translation is the MMU.

At this point, the hardware system structure model is as follows. The MMU converts the virtual address of the current process of the CPU into PA, and then sends it to the interconnection bus to access other modules. Since the page table of VA->PA is stored in the Memory memory, in order to avoid the need to query the page table of the memory for each conversion, a Cache is added for acceleration, that is, Translation Lookaside Buffer (TLB).

picture

Notice the address domain analysis diagram. In-process applications access Memory based on physical addresses and need to go through two levels of address translation. Stage 1 is to convert VA to the address in the corresponding virtual machine, which is called Intermediate Physical Address (IPA). Stage 2 is Convert IPA to PA. These two levels of address translation are completed by the MMU, as well as various permissions and attributes. According to different scenarios, both levels of address translation can be bypassed, and the translation relationship is shown in the figure below. In addition, many features in the system are built on this address translation process, such as isolation and security. The process uses VA access. Without the relevant page table, it cannot be accessed.

picture

The activated DMA includes on-chip devices and PCIe devices, wherein the on-chip devices are hardware accelerators or IO ports integrated in the chip, and the PCIe devices are devices connected through PCIe interfaces. Device virtualization, that is, a physical device is presented to the upper-layer software system as multiple logical devices, which can be directly used by the virtual machine and its processes.

As described in the data transmission article, the device needs to receive configuration requests from other modules and can also access other modules in the system. The upstream of the configuration request is the interconnection bus, which uses PA, so the configuration space address of each logical device is real. How many logical devices there are, how much configuration space needs to be allocated within the physical address of the system. The VMM will first convert the PA in the configuration space into a VA. The virtual machine and its process can use the VA to directly access the device. The MMU will convert the address into a PA and then send it to the interconnection bus.

The device directly receives the configuration request of the virtual machine and its process, and gets the VA. Then it uses the VA to access other modules. It must first connect to the interconnection bus. Therefore, the address translation module must be introduced between the access exit of the device and the interconnection bus to convert the VA. Convert to PA.

By extending the signal bit width to distinguish its address domain, the device also adopts the same strategy, but its name is different. In the ARM system, they are named StreamID and SubstreamID; in the PCIe system, they are named RequesterID and PASID; in fact, they are equivalent. The address conversion module converts (StreamID + SubstreamID + VA) into PA, and then sends it to the interconnection bus to access other modules. This module is SMMU or IOMMU.

The hardware system structure model is as follows, similar to MMU, SMMU also has TLB, its function is the same, there are differences in details (Note: If there is no SMMU in the system, the device cannot directly use VA, and the operating system needs to convert VA to PA and then sent to the device).

picture

The system basically takes the CPU as the core, and Device is a module that is started or called. Therefore, from a logical point of view, each process can be bound to one or more device substreams, but each device substream can only be bound to one process. Its logical relationship is shown in the figure below.

picture

To sum up, in a virtualization scenario, a typical process for a process to call a Device has the following steps.

  1. The process, marked as VMID+ASID, applies for Device resources, binds with StreamID+SubstreamID, obtains the virtual address of the device configuration address window, and records the mapping relationship between the process and the Device in the memory;

  2. The process applies for Memory space, obtains the virtual address in the process, and sends the address to the Device through the virtual address in the device configuration window;

  3. Device then performs read and write access based on the virtual address, and SMMU converts VA into PA and sends it to the interconnection bus.

The address conversion process of SMMU is the above reverse process, which is described in the official document of Arm. The following figure clearly describes the relationship. MMU is very similar to SMMU. There are also two levels of address translation. Bypass can be performed separately according to specific scenario requirements. The difference is that the Configuration lookup process in the figure converts StreamID + SubstreamID into VMID + ASID.

picture

In the previous hardware system structure model, Device includes the PCIe device. If the PCIe device is taken out separately, its structure is as follows.

picture

The RequesterID of the PCIe message is directly mapped to the StreamID, and the PASID is mapped to the SubstreamID. After receiving the message, (RequesterID, PASID) is first converted to (vmid, asid), and Stage 1 converts VA to IPA based on (vmid, asid). Stage 2 converts IPA to PA based on vmid, and then sends it to the interconnection bus. PCIe PASID belongs to the prefix of the message, the length is 1DW (32bit), and it needs to occupy the link bandwidth of 32bit. Its format is as follows.

picture

In fact, most current PCIe devices do not support PASID, which means SMMU does not need to handle Stage 1 address translation. Therefore, when the process application delivers VA to the PCIe device, the virtual machine needs to convert the VA to IPA and then deliver it to the PCIe device. After receiving the message, the subsequent SMMU only processes the address translation of Stage 2. Similar to the above, two-level address translation will also perform bypass processing based on specific scenario requirements.

PCIe Endpoint supports device virtualization through SR-IOV, so that a PCIe physical device can be mapped to multiple logical devices, which can be called by one or more operating systems and processes, as shown in the figure below.

picture

From a software perspective, the device appears as multiple Devices, each of which is independent of each other. In the PCIe protocol, the Device in the figure is called a Function, and each Function is assigned a unique 16-bit ID, which is used as the RequesterID when a read-write access request is issued. Note that each Function is independent. From the perspective of the memory of the system, each Function must have its own independent configuration space. The physical device can be shared, but the allocation of its BAR space cannot be shared by multiple Functions. .

After the Root Port receives the request, it sends it to the SMMU for address translation, and then sends it to the interconnection bus. In this process, address translation takes time. That is, enabling SMMU increases the delay of address translation compared with Bypass. If the corresponding page table is not in the TLB, it still needs to be retrieved from the memory, which greatly affects performance. In addition, if many devices are connected, there will be a large number of conversion requests, and these devices will also send a large number of small granularity requests (512Byte), which will put pressure on the SMMU and the system.

In the specific implementation, various means are used to solve this problem, such as allocating fixed page tables to Device. In fact, in some cases, the Device has already obtained the address of the access request, but the processing task has not been completed and it cannot directly issue read and write requests.

PCIe ATS is used to solve this problem, moving the address translation process to the Device side, integrating Address Translation Cache (ATC) internally, and the ATS protocol is used to maintain the ATC and its related processes. When the PCIe Device sends a read-write request, it converts VA to PA and then sends it. When it reaches the Host side, it does not need to perform address translation and sends it directly to the interconnection bus.

picture

In fact, the current PCIe devices do not support the ATS mechanism, and there are still many problems that cannot be solved, such as security issues, Cache maintenance issues, and so on.

Page table maintenance

All page tables are stored in Memory. If there is no TLB, all table lookup requests need to be searched in Memory. When a virtual machine or process applies for memory space, it obtains its virtual address, and at the same time establishes a page table in memory, that is, the mapping relationship of VA->PA; when releasing memory space, it corresponds to deleting the page table relationship.

In order to speed up the address translation process, MMU/SMMU implements TLB and caches part of the page table. If the received request hits the TLB, the conversion result is returned directly, otherwise it needs to be searched in Memory. Generally speaking, the system maintains multi-level page tables, and pointers are used to index between each level of page tables. When searching in Memory, multiple visits are required to get the results. This process is also called Page Table Walk.

Since there is a Cache, when deleting the page table relationship, the page table relationship of the TLB Cache must also be deleted, otherwise it will cause a memory stampede. A large number of TLB invalidation commands are defined in the SMMU protocol, which is relatively complicated. In the previous projects, a lot of logic was used to process this command. If ATC is implemented in a PCIe device, the page table relationship of the Cache needs to be invalidated at the same time.

Page table properties

In addition to implementing VA->PA conversion, the page table also defines various attributes and permission information, such as the attributes corresponding to the Memory space are Device, Cacheable, etc., whether the virtual machine or process is allowed to use the page table.

Page fault exception

If the virtual address access request cannot access the correct address, a page fault exception will occur. There are many reasons for page fault exceptions, and their processing procedures are also different. For example, if a process application accesses an illegal address that does not fall within the legal address range of the process, a page fault exception will occur.

Related ideas

Virtualization technology is one of the key technologies of cloud computing and plays a very important role. Virtualization technology also makes cloud computing possible. However, both software and hardware virtualization are complicated to implement. From a hardware point of view, a large amount of hardware resources are required. From a software point of view, after virtualization is supported, part of the hardware performance is required. I believe that the follow-up cloud computing will develop in the direction of data-centric, serverless, etc., and provide services based on functions. So can virtualization technology be simplified? Not even using virtualization technology?

Guess you like

Origin blog.csdn.net/qq_41854911/article/details/132527876