Virtualization principle (2)

Virtualization Overview

The bottom of the computer resources into sets of virtual computing platform mutually isolated from each other, and each should have a computing platform member all five devices (operator, controller, memory, IO devices,).

Category virtualization technology

Analog: cpu cpu architecture and physical architecture of the operating system + hardware + software simulator, a virtual machine can be inconsistent. (Out of the simulator simulates the underlying physical CPU instruction set and instruction set different from CPU, the virtual machine monitor will need to simulate the CPU instruction set into a true physical CPU instruction set, this process requires involvement and poor performance of the software)

Commonly used analog and: PearPC, Bochs, QEMU

Full virtualization: virtual CPU architecture and exactly the same physical CPU. (Virtual CPU architecture and exactly the same physical CPU, as long as the call is not sensitive to the privileged instruction or instructions, general instructions can execute on a virtual machine directly in terms of the underlying physical CPU, no need for conversion. If the virtual machine to call the privileged instructions, after capturing privileged instruction by the virtual machine monitor post-translational conversion instruction set Chengsu host or virtual machine to a virtual machine monitor (VMM) hypersor call by calling the way to achieve, depending on whether a full virtualization or paravirtualization. full virtualization technology will be omitted by means of the process instruction conversion HVM)

Common has full virtualization: VMware Workstation, VMware Server, Parallels Desktop, KVM, Xen

Paravirtualization: para-virtualization

When running on the hypervisor hardware, the hypervisor through underlying functionality hypervisor Call up output, to develop an operating system to run on the hypervisor, the hypervisor need to call the kernel development Call (called the API), also you need to call the hypervisor run run Call (called ABI). So each virtual machine kernel is modified so that the virtual machine itself knows itself is running in a virtualized environment. So the need to modify the guest os kernel in paravirtualized environment. (Usually IO paravirtualization without modifying the kernel, you can simply install specific drivers .CPU paravirtualized need to modify the kernel)

Common paravirtualized have: xen, uml (user-mode linux)

OS-level virtualization: virtual container level

The above-described embodiment of the virtualization Each virtual machine has its own separate user space and kernel space, and the bottom layer of the virtual machine management mode is achieved by the VMM (virtual machine monitor) or hypervisor on host, virtual machine creation the purpose of providing the user space to provide services, kernel space meaningless, so the operating system virtualization is to push virtualization layer on top of the underlying hardware that is running a kernel down a virtualization manager running in the kernel, virtualization management it is only available on the user space, kernel space without providing each user space shared kernel, thus achieving space into a plurality of users, each isolated from each other. (Low stability, but good performance)

Commonly used by the container-level virtualization: OpenVZ, LXC

Library virtualization: virtual run a program depends database environment.

wine, implemented in linux running windows programs.

Application virtualization:

jvm

Virtualization is implemented in two ways

Type-I: the hardware is not installed on the operating system, but directly install hyper, hyper is a virtualization software, you can directly control hardware.

xen, vmware ESX/ESXi

Type-II: the operating system installed on hardware, virtualization software installed on the operating system, create a virtual machine on the virtualization software.

kvm, vmware workstation, virtualbox

Basic Theory

cpu structure:

cpu divided into four rings, running generic command ring 3, the ring run privileged instructions 0 (including direct manipulation of hardware instructions, operating cpu associated with the hardware control registers, must Ring 0 instructions used to run), ring 1 ring 2 is not used. In developing the operating system kernel space clearly specified in the instructions executed on the ring 0, the user space is executed on a ring 3. User-space process running as long as not privileged instructions are executed directly on the cpu; if it is privileged instruction that captured by the kernel, the kernel-based approach to help the process of system calls to execute privileged to perform in the ring 0, then the implementation of the results returned to the process.

When an operating system is booted, the operating system itself does not perform any production task, the real production function is provided by the process, so the operating system is running in order to coordinate multi-tasking operating system is divided into two sections of the user space and kernel space

Kernel space: operable hardware, the privileged authority, running on the ring 0;
user space: a process running in user space, running on the ring 3;
when the user space processes To execute a privileged instruction, or the use of hardware, need system call to achieve.

concept

Host: usually called vm monotor or virtual machine monitor, also known as hypervisor, hypervisor directly manage hardware (usually only manage cpu memory, IO devices not owned by hypervisor management), equivalent to the kernel. hypervisor will use the CPU and memory, virtual allocation process into hyper call; referred calls to the kernel system call; call to the hypervisor is called hyper calls.

After introducing the problem of virtualization

A virtual machine is a complete master, but also by the composition of the kernel space and user space, the kernel on a virtual machine in theory be able to control all resources, this can also affect other virtual machines, resulting in the virtual machine can influence each other, in order to avoid this generating cases, the cpu on the virtual machine is a software simulation, simulation out cpu have ring 0, the upper ring 1, ring 2, ring 3, the same kernel space runs on the ring 0 user space operation in the ring 3; when later the same virtual infrastructure and physical infrastructure, process user space on the virtual machine still can run directly on the cpu ring 3, then just privileged instruction captured by the virtual machine manager host of the privileged instruction conversion, conversion physically performed cpu ring 0, then execution returns the result to the virtual machine; virtual infrastructure and if the physical structure is not the same, the virtual architecture (PPC), physical architecture (the x86), it is clear that the virtual instruction set architecture and user space physical architecture instruction set different from the user space, then the instruction is also set by the user instruction space conversion Chengsu host, so efficiency is low.

cpu virtualization: cutting the CPU time, achieve the physical CPU resources of time division multiplexing

cpu virtualization and virtual simulation can be achieved by two ways.

Analog: emulation, physical architecture and virtual machine architecture may not, be achieved by pure software, inefficient; cpu when implemented by software simulation, simulation needs ring0 1 2 3.

Virtual: virtulization, physical infrastructure and virtual machine architecture consistent.

Based on binary translation of full virtualization (full-virtulization): VMM fully virtualizes a virtual platform, guest do not even know they are running on a virtual platform;

VMM runs in kernel space running ring 0, Guess OS in a user space on Guess OS running ring in ring 3. Guess OS on the ring 1 will execute a privileged instruction, that instruction execution privilege level exception is generated (Trap) on the non-privilege level, the VMM may be captured to the exception, the VMM to intercept found that will translate into a privileged instruction the virtual machine virtual instruction execution operation. ring 3 non-privileged instructions that execute on not blocked VMM, performed directly on the ring 3 on the physical CPU. (X86 architecture virtualization vulnerabilities exist, 19 non-privileged instruction is present in the X86 architecture can affect the system state, 19 non-privileged instructions can not be intercepted VMM. So in x86 virtualization architecture requires these 19 non-privileged instruction translated into a system without the threat of instructions to execute. the 19 non-privileged instruction known as the critical instruction, we will be privileged instructions and instructions collectively referred to as critical sensitive instruction. in full virtualization scene virtual machine can not perceive itself is running in a virtual environment, by the VMM do BT sensitive instruction that is binary translation virtual instruction virtual machine operating in pairs, which make up a virtual machine to virtual vulnerability by modifying the way the kernel without going through BT's way, but poor performance.)

Paravirtualization (para-virtulization)

CPU paravirtualization guess os need to modify the kernel, making the guest kernel is clearly aware of itself is running in a virtual environment, so when a guest needs to run sensitive instruction is not called directly but through the hypervisor call sensitive command is sent to the command VMM, therefore sensitive guest call instruction was simplified to a simple request for a sensitive host VMM certain instructions, in this way there is no need to catch exceptions VMM and BT technology, which makes guest and host kernel interacting the process is greatly simplified.

Full hardware-assisted virtualization: full virtualization and paravirtualization applies to physical CPU does not support VT virtualization or not turned on in the BIOS scene (KVM only run in hardware-assisted virtualization scenario)

Physical CPU supports virtualization, opens VT-X function in the BIOS. CPU operates in two modes: root mode and the non-root mode, VMM root mode in work, guess os in non-root operation mode. root modes are ring 0 1 2 3, VMM runs at ring 0; non-root mode have ring 0 1 2 3, guess os run at ring 0. When sending sensitive guess os instruction, the instruction is non-root mode of the CPU for processing, non-root mode, the instruction is found to be sensitive to the switching instruction to the VMM root mode translating sensitive instruction translate into a corresponding virtual machine operating virtualization instructions. Other non-privileged instructions directly executed directly on the physical CPU ring 3.

The difference between simulation and virtualization:

Simulation: Virtual machine all hardware devices using software to simulate the physical architecture and virtual machine architecture inconsistent;

Virtualization: the same platform architecture underlying hardware platform architecture of the virtual machine.

CPU virtualization performance computing

Number VCPU calculated:

Reuse rate suggestions:

Memory virtualization: the memory space by cutting

Memory itself is virtualized, each process is to identify the linear address space, the kernel is the physical address space. In a simple single scenario, after installing the operating system, the kernel can be allocated using the whole physical memory space. In the virtualization environment, by hyper manages physical memory, the memory is divided into the hyper memory pages allocated to the virtual machines VM get i.e. memory is discrete.

When a process on the virtual machine to access a data memory space:

Cpu process running on a virtual machine, a process to issue a linear address space of the cpu (referred GVA) to access the data, the cpu address space will be transferred to the MMU (memory management unit), according to the MMU for each kernel a process maintains memory map table to find the physical address of the virtual linear address space corresponding (known as GPA), the software needed by calling on the hyper GPA converted again to a physical address (HPA). After the conversion of a virtual address twice. At the time of conversion, the correspondence between GVA and the HPA will be cached to the TLB, when the operating system more virtual machines to switch on the cpu can be confusing, so each virtual machine to be switched to empty TLB. So this will cause a TLB hit the bottom, in order to solve this problem, use the MMU virtualization at the hardware level.

MMU Virtulization：Intel: EPT, Extended Page Table和AMD: NTP, Nested Page Table

实现虚拟内存和物理内存的映射。

当vm中的进程向虚拟cpu调用线性地址空间，MMU将GVA转化为GPA的同时，MMU虚拟化会自动完成GVA和HPA的转化（即拥有两层MMU）。所以在虚拟化技术中，虚拟机中的每一个进程依然是从GVA转化成GPA；只不过借助于CPU上MMU的虚拟化技术，同时将GPA直接映射成HPA，省略了GPA到HPA的转化（通过硬件方式），从而提升了虚拟化性能。

问题：上述借助于CPU上MMU的虚拟化技术，提升了虚拟化性能，但是TLB的命中率仍然无法提高，TLB中缓存的仍然是GVA到HPA的映射关系，虚拟机切换时仍然要清空TLB。为了提高TLB的命中率，于是有了TLB虚拟化。

TLB virtulization：tagged TLB

默认TLB中缓存的是GVA到HPA的映射关系，在TLB virtulization技术中增加了一个字段，即虚拟机的标识符。

内存技术：

大内存页

GPA到HPA的映射关系即内存映射表保存在内存中，为了优化性能会将内存映射表存放在CPU的寄存器中即TLB(页面缓冲寄存器)，但是寄存器是比较小的，所有只能存放部分内存映射表。于是就有命中率的问题。当GPA到HPA的映射关系在TLB中可以查找到的话，则cache命中，性能就好；当GPA到HPA的映射关系在TLB查找不到时就会到内存中查找，性能就不太好（因为CPU到寄存器中查找的速度快于到内存中查找的速度）。为了提高cache的命中率，就修改TLB表，比如将虚拟内存所对应的物理内存从1k变成4k，这样就能提高cache命中率。

NUMA架构

早先的服务器都是单CPU，性能无法满足需求，于是就有了多CPU的服务器，如何在一台服务器内放置多个CPU：

1.SMP架构

能够在一台服务器上放置多个CPU，多个CPU共享内存，但会有内存冲突的问题。

2.MPP架构

多个节点，每个节点可以设置为SMP，也可以设置非SMP。多个节点之间不会共享内存。

3.NUMA架构

多个节点，每个节点拥有一个CPU和自己的内存，多个节点之间可以共享内存。CPU使用内存的时候，如果访问本节点的内存速度快，跨节点访问内存速度慢。基于该特点有两种技术：

1.host NUMA

在宿主机BIOS上启用MUMA，开启host NUMA后，宿主机操作系统能够识别主机内部的NUMA架构。好处是：1.当宿主机创建虚拟机时，优先使用同个node的CPU、内存。2.当宿主机创建虚拟机时，会查看哪个node的CPU、内存资源较空闲，将虚拟机创建在该node上。

2.guest NUMA

在集群中启用guest NUMA，开启guest NUMA后，集群集群下的虚拟机能够识别宿主机内部的NUMA架构，当虚拟机上的应用程序使用内存的时候，优先使用本节点的内存。

主机内存超分配

内存复用技术

内存复用技术和guest NUMA冲突：因为在内存复用中存在内存置换，会将内存中的内存置换到硬盘，guest NUMA技术无法访问硬盘上的数据。优先使用内存复用技术。

I/O虚拟化（硬件要支持VT-D）

外存：硬盘、光盘、U盘

网络设备：网卡

显示设备：VGA: frame buffer机制

键盘鼠标：通过模拟实现

ps/2, usb

I/O虚拟化的方式：

全虚拟化IO设备：workstation、kvm、xen

VMM通过软件（QEMU）模拟出IO设备，VMM捕获Guess os对物理IO设备的访问请求，然后将IO请求转化成对底层硬件设备的操作；

半虚拟化

在完全虚拟化的IO设备的情况下，虚拟机上有内核空间和用户空间，虚拟机某个程序要向发送一个网路包，程序无法直接和网卡打交道，程序向内核发起对网卡设备的系统调用，此时内核要通过驱动程序驱动网卡设备，这个网卡设备由软件模拟的，所以驱动后的结果依然不是能够向外发送报文的网卡设备，所以这个模拟的网络设备还要转化为对hyper的调用，最终转化为在hyper上的一个软件模拟网卡设备，在hyper上虚拟网卡可能存在多个，但是物理网卡只能存在有限个，所以hyper上的虚拟网卡要想向外发送网络报文都要转化成真正的物理网卡向外发送网络报文，IO路径长，性能差。

在以上的过程中在左边的驱动过程是没有意义的。在半虚拟化中虚拟机内核明确知道自身的网卡设备是虚拟的，所以就不会在本地通过驱动的方式调用该虚拟设备，而是在本地通过前端驱动的方式即对虚拟设备的调用直接转换到后端去，减少了IO路径，提高性能。如下图：

IO半虚拟化通常由virtio实现，分成两段，virtio架构如下图：

前端驱动：virtio前半段，virtio前半段在虚拟机实例中即创建虚拟机时内核已经加载了这些驱动模块，virtio前半段主要由virtio-blk,virtio-net,virtio-pci,virtio-balloon,virtio-console驱动组成。在CentOS 4.8+,5.3+,6.0+,7.0+内核中都直接支持前端驱动；windows系统中需要安装专用的程序才能实现；

virtio-balloon：让kvm中运行的GuestOS动态调整其内存大小

启用方式：

#qemu-kvm -balloon virtio

手动查看GuestOS的内存用量：

#info balloon

#balloon N

virtio-net：实现网络半虚拟化

其依赖于GuestOS中的前端驱动，及Qemu中的后端驱动

前端驱动：virtio_net.ko

后端驱动：qemu-kvm -net nic,model=?

启用方式：

#qemu-kvm -net nic,model=virtio

vhost-net：用于取代工作于用户空间的qemu中为virtio-net实现的后端驱动以实现性能提升的驱动（后端处理程序在qemu中实现即后端驱动是依赖qemu软件模拟实现，使用
vhost-net驱动可以提升性能）

#qemu-kvm -net tap,vnet_hdr=on,vhost=on

virtio-blk：实现块设备半虚拟化

其依赖于GuestOS中的前端驱动，及Qemu中的后端驱动。

启用方式“

-drive file=/path/to/some_image_file,if=virtio

virtio：虚拟队列。所有虚拟机的io请求都将发往该队列，相应的后端处理程序从队列中取出请求并响应请求；

transport：传输层，即将前端驱动发来的任何队列由transport发往后端处理程序；

后端驱动(virtio backend drivers)：后端驱动在qemu中实现即后端驱动是依赖qemu软件模拟实现，后端处理程序驱动相应的物理设备处理请求。

IO-through： IO透传，需要硬件支持VFIO（实现USB,PCI and SCSI passthrough）

让虚拟机直接使用物理设备（需要在虚拟机上安装对应硬件的驱动程序）。仍然需要hyper去协调，可以接近于硬件性能。但是无法实现虚拟机迁移，虽然性能提升明显，但是应用范围不大。

IO虚拟化总结：

完全虚拟化场景中，虚拟机对IO的请求被VMM捕获，由VMM分时的调用底层硬件资源；半虚拟化场景中，虚拟机对IO的请求发给后端驱动，由后端驱动直接调用底层硬件资源；硬件辅助虚拟化场景中，虚拟机对IO的请求直接调用底层硬件资源。

总结：

完全虚拟化和半虚拟化的区别是：虚拟机是否要协助VMM完成虚拟机所需要的虚拟化环境。虚拟机无感知，即无需修改guest os内核，就是完全虚拟化；让虚拟机感知自身运行在虚拟化环境中，就要修改guest os内核，就是半虚拟化。注意：kvm只要完全虚拟化和硬件辅助虚拟化，没有半虚拟化。

Virtualization principle (2)

Guess you like