cells

Cells does not pass any accesses to the mux_fb driver from background VPs to the hardware back end, ensuring that the foreground VP has exclusive hardware access. Standard control ioctls are applied to virtual hardware state maintained in RAM. Custom ioctls, by definition, perform non-standard functions such as graphics acceleration or memory allocation, and therefore accesses to these functions from background VPs must be at least partially handled by the same kernel driver which defined them. Instead of passing the ioctl to the hardware driver, Cells uses a new notification API that allows the original driver to appropriately virtualize the access. If the driver does not register for this new notification, Cells either returns an error code, or blocks the calling process when the custom ioctl is called from a background VP. Returning an error code was sufficient for both the Nexus 1 and Nexus S systems. When an application running in a background VP mmaps the framebuffer device, the mux_fb driver will map its backing buffer into the process' virtual address space.

Cells 不将任何后台VP对mux_fb驱动的访问传给硬件后端，确保前台VP独占硬件访问。
标准控制ioctls被提交给内存中维护的虚拟硬件状态。

传统ioctls，执行图形加速或者内存分配，从后台VP访问这些函数，必须至少被定义它们的内核驱动部分处理。Cells没有传递ioctl给硬件驱动，而是用了一个新的通知API，允许原生驱动适当虚拟化访问。如果驱动没有注册新的通知，当后台VP调用传统ioctl，Cells或者返回错误码，或者阻塞调用的进程。Nexus 1和Nexus S系统里返回错误码就足够了。

当一个后台VP跑的应用mmaps 映射帧缓存设备，mux_fb驱动将他的后端缓存映射到进程的虚拟地址空间。

Switching the display from a foreground VP to a background VP is accomplished in four steps, all of which must occur before any additional FB operations are performed: (1) screen memory remapping, (2) screen memory deep copy, (3) hardware state synchronization, and (4) GPU coordination. Screen memory remapping is done by altering the page table entries for each process which has mapped FB screen memory to redirect virtual addresses in each process to new physical locations. Processes running in the VP which is to be moved into the background have their virtual addresses remapped to backing memory in system RAM, and processes running in the VP which is to become the foreground have their virtual addresses remapped to physical screen memory. The screen memory deep copy is done by copying the contents of the screen memory into the previous foreground VP's backing buffer and copying the contents of the new foreground VP's backing buffer into screen memory. This copy is not strictly necessary if the new foreground VP completely redraws the screen. Hardware state synchronization is done by saving the current hardware state into the virtual state of the previous foreground VP and then setting the current hardware state to the new foreground VP's virtual hardware state. Because the display device only uses the
current hardware state to output the screen memory, there is no need to correlate particular drawing updates with individual standard control ioctls; only the accumulated virtual hardware state is needed. GPU coordination, discussed in section 4.2, involves notifying the GPU of the memory address switch so that it can update any internal graphics memory mappings.

显示从前台VP切换到后台VP经过四步完成，这些步骤必须在附加FB操作执行前做：
（1）屏幕内存重映射；
（2）屏幕内存深度拷贝；
（3）硬件状态同步；
（4）GPU协调；
1. 屏幕内存重映射：修改每个已经映射了FB屏幕内存的进程页表入口，用类重定向每个进程的虚拟地址到新的物理位置。在即将被移到后台的VP中的进程将他们的虚拟地址重定向到系统内存中的后端内存；在即将被移到前台的VP中的进程将虚拟地址重定向到物理屏幕内存。

2. 屏幕内存深度拷贝：拷贝屏幕内存内容到前一个前台VP的后端缓存中，将新的前台VP的后端缓存中的内容拷贝到屏幕内存中。如果新的前台VP完全重画屏幕，那么这个拷贝不是严格必须的。

3. 硬件状态同步：保存当前硬件状态到上一个前台VP的虚拟状态中，然后将当前硬件状态写入新的前台VP的虚拟硬件状态。没有必要关联特别绘画升级到每个独立的控制ioctls，只有加速虚拟硬件状态是必须的。GPU 协调，在4.2讨论，包括通知GPU，内存地址转换，以便它更新任意内部图形内存映射。

To better scale the Cells FB virtualization, the backing buffer in system RAM could be reduced to a single memory page which is mapped into the entire screen memory address region of background VPs. This optimization not only saves memory, but also eliminates the need for the screen memory deep copy. However, it does require the VP's user space environment to redraw the entire screen when it becomes the foreground VP. Redraw overhead is minimal, and Android conveniently provides this functionality through the fbearlysuspend driver discussed in Section 5.1.

为了更好衡量Cells FB 虚拟化，系统内存中的后端缓存能被缩减至一个内存页，被映射到后台VP的整个屏幕内存地址区域。这个优化不仅节省内存，也减少了屏幕内存深度拷贝的需要。但是，当VP变到前台时，需要VP用户空间环境重绘整个屏幕。重绘开销很小，android提供便捷的功能，在5.1中讨论。

6.4 GPU
Cells virtualizes the GPU by leveraging the GPU's independent graphics contexts together with the FB virtualization of screen memory described in Section 4.1. Each VP is given direct pass-through access to the GPU device. Because each process which uses the GPU executes graphics commands in its own context, processes are already isolated from each other and there is no need for further VP GPU isolation. The key challenge is that each VP requires FB screen memory on which to compose the final scene to be displayed, and in general the GPU driver can request and use this memory from within the OS kernel.

Cells通过和屏幕内存FB虚拟化共同修改GPU的独立图形内容。每个VP被给予直接访问GPU设备的权限。因为每个进程用GPU在他自己的上下文中来执行图形命令，进程已经彼此隔离，所以没有必要进一步做VP GPU隔离。关键挑战在于每个VP需要FB屏幕内存来组成最终的屏幕显示。通常情况下，GPU驱动能在OS内核中申请使用这些内存。

Cells solves this problem by leveraging its foreground-back-ground usage model to provide a virtualization solution similar to FB screen memory remapping. The foreground VP will use the GPU to render directly into screen memory, but background VPs, which use the GPU, will render into their respective backing buffers. When the foreground VP changes, the GPU driver locates all GPU addresses which are mapped to the physical screen memory as well as the background VP's backing buffer in system RAM. It must then remap those GPU addresses to point to the new backing buffer and to the physical screen memory, respectively. To accomplish this remapping, Cells provides a callback interface from the mux_fb driver which provides source and destination physical addresses on each foreground VP switch.

Cells 通过修改前后台使用模型提供一个虚拟化解决方案来解决这个问题，与FB屏幕内存重映射类似。前台VP将用GPU直接显示到屏幕内存，但是后台VP，会显示到各自的后端缓存。当前台VP变化了，GPU驱动分配所有GPU地址，这些地址映射到物理屏幕内存，同时，GPU驱动也分配后台VP在系统内存中的后端缓存。然后必须重映射那些GPU地址，分别指向新的后端缓存和物理屏幕内存。为了完成这个重映射，Cells在mux_fb驱动中提供一个回调接口，它为每次的前台VP切换提供源和目的物理地址。

While this technique necessitates a certain level of access to the GPU driver, it does not preclude the possibility of using a proprietary driver so long as it exposes three basic capabilities. First, it should provide the ability to remap GPU linear addresses to specified physical addresses as required by the virtualization mechanism. Second, it should provide the ability to safely reinitialize the GPU device or ignore re-initialization attempts as each VP running a stock user space configuration will attempt to initialize the GPU on startup. Third, it should provide the ability to ignore device power management and other non-graphics related hardware state updates, making it possible to ignore such events from a user space instance running in a background VP. Some of these capabilities were already available on the Adreno GPU driver, used in the Nexus 1, but not all. We added a small number of lines of code to the Adreno GPU driver and PowerVR GPU driver, used in the Nexus S, to implement these three capabilities.

当这个技术需要对GPU驱动一定的访问权限，并不反对可能使用一个私有驱动，只要这个驱动有三个基本能力。首先，要提供虚拟化机制要求的重映射GPU线性地址到特定物理地址的能力；第二，应该提供安全重新初始化GPU设备或忽略重新初始化，因为每个VP跑一个库存用户空间配置将会在启动时试图初始化GPU；第三，应该提供忽视设备电源管理和其它非图像相关硬件状态更新，使之可能忽视在后台的VP上跑的用户空间实例发出的事件。某些能力在Nexus 1的Adreno GPU驱动上已经具备了，但不是所有能力。我们在Nexus S的Adreno GPU驱动和PowerVR GPU驱动中增加了些许代码，用来实现这三个能力。

While most modern GPUs include an MMU, there are some devices which require memory used by the GPU to be physically contiguous. For example, the Adreno GPU can selectively disable the use of the MMU. For Cells GPU virtualization to work under these conditions, the backing memory in system RAM must be physically contiguous. This can be done by allocating the backing memory either with kmalloc, or using an alternate physical memory allocator such as Google's pmem driver or Samsung's s3c_mem driver.

大多数现代GPU包含MMU，有些设备要求GPU使用的内存是物理连续的。例如，Adreno GPU可以选择性地禁用MMU。为了让Cells GPU虚拟化在这些条件下正常工作，系统中的后端内存必须是物理连续的。这可以通过用kmalloc来分配后端内存，或者用替代的物理内存非配器，就像Google的pmem 驱动或者Samsung的s3c_mem驱动。

To provide Cells users the same power management experience as non-virtualized phones, we apply two simple virtualization principles: (1) background VPs should not be able to put the device into a low power mode, and (2) back-ground VPs should not prevent the foreground VP from putting the device into a low power mode. We apply these principles to Android's custom power management, which is based on the premise that a mobile phone's preferred state should be suspended. Android introduces three interfaces which attempt to extend the battery life of mobile devices through extremely aggressive power management: early suspend, fbearlysuspend, and wake locks, also known as suspend blockers [33].

为了提供给Cell用户和非虚拟化手机同样的电源管理体验，我们提供了两条简单的虚拟化准则：
（1）后台VP不应该有权将设备设置成低电模式。
（2）后台VP不应该阻止前台VP将设备设置成低电模式。
我们为android的传统电源管理提供这两点原则是基于“移动电话通常是待机状态”这一前提。android提供3个接口，试图通过非常主动的电源管理来延长移动设备的电池使用时间：early suspend，fbearlysuspend和wake locks（suspend blockers）。

The early suspend subsystem is an ordered callback interface allowing drivers to receive notifications just before a device is suspended and after it resumes. Cells virtualizes this subsystem by disallowing background VPs from initiating suspend operations. The remaining two Android-specific power management interfaces present unique challenges and offer insights into aggressive power management virtualization.

early suspend 子系统是一个顺序回调接口，允许驱动在一个设备挂起之前和恢复之后接受通知。Cells通过禁止后台VP初始化挂起操作来虚拟化这个子系统。余下的两个android独有的电源管理接口呈现独特的挑战，提供对电源管理虚拟化深刻见解。

Frame Buffer Early Suspend
The fbearlysuspend driver exports display device suspend and resume state into user space. This allows user space
to block all processes using the display while the display is powered off, and redraw the screen after the display is
powered on. Power is saved since the overall device workload is lower and devices such as the GPU may be powered down or made quiescent. Android implements this functionality with two sysfs files, wait_for_fb_sleep and wait_for_fb_wake. When a user process opens and reads from one of these files, the read blocks until the framebuffer device is either asleep or awake, respectively.

fbearlysuspend 驱动输出显示设备挂起和恢复状态给用户空间。允许用户空间当显示下电时，阻止所有进程用显示器，并且在显示器上电后重画屏幕。整体设备负载更低，像GPU这样的设备可能下电或者休眠情况下，电量会节省下来。android用两个sysfs文件实现这个功能，wait_for_fb_sleep和wait_for_fb_wake。当一个用户进程打开并从其中读数据时，读阻塞直到framebuffer设备休眠或唤醒。

Cells virtualizes fbearlysuspend by making it namespace aware, leveraging the kernel-level device namespace and foreground-background usage model. In the foreground VP, reads function exactly as a non-virtualized system. Reads from a background VP always report the device as sleeping. When the foreground VP switches, all processes in all VPs blocked on either of the two files are unblocked, and the return values from the read calls are based on the new state of the VP in which the process is running. Processes in the new foreground VP see the display as awake, processes in the formerly foreground VP see the display as asleep, and processes running in background VPs that remain in the background continue to see the display as asleep. This forces background VPs to pause drawing or rendering which reduces overall system load by reducing the number of processes using hardware drawing resources, and increases graphics throughput in the foreground VP by ensuring that its processes have exclusive access to the hardware.

Cells通过使其感知名字空间来虚拟化fbearlysuspend，修改内核层设备名字空间和前后台使用模型。在前台VP中，读函数和非虚拟化系统一致。后台VP中，读总是上报设备在睡眠。当前台VP切换时，所有被阻塞在任意两个文件的VP中的进程被放行，读调用的返回值基于进程所在VP的新状态。在新的前台VP中的进程认为显示器被唤醒，在上一个前台VP中的进程认为显示器休眠。强制后台VP暂停画图或者显示，这样减少了使用硬件画图资源的进程，从而降低整个系统的负载，同时通过进程独占硬件的方式增强了前台VP的图像流量。

Wake locks are a special kind of OS kernel reference counter with two states: active and inactive. When a wake lock is “locked”, its state is changed to active; when “unlocked,” its state is changed to inactive. A wake lock can be locked multiple times, but only requires a single unlock to put it into the inactive state. The Android system will not enter suspend, or low power mode, until all wake locks are inactive. When all locks are inactive, a suspend timer is started. If it completes without an intervening lock then the device is powered down.

唤醒锁是一种特殊的有两种状态的OS内核引用计数：活跃和不活跃。当一个唤醒锁被锁住了，它的状态变为活跃；当解锁时，它的状态变为不活跃。一个唤醒锁可以被锁多次，但是只需要一次解锁就变为不活跃状态。android 系统在所有唤醒锁都是不活跃之前，不会进入挂起状态。当所有唤醒锁都是不活跃的，一个挂起计时器被启动。如果计时完成期间没有干扰锁，设备会被下电。

Wake locks in a background VP interfering with the foreground VP's ability to suspend the device coupled with their distributed use and initialization make wake locks a challenging virtualization problem. Wake locks can be created statically at compile time or dynamically by kernel drivers or user space. They can also be locked and unlocked from user context, kernel context (work queues), and interrupt context (IRQ handlers) independently, making determination of the VP to which a wake lock belongs a non-trivial task.

后台VP中的唤醒锁影响前台VP的挂起设备的能力，加上分布式使用和初始化，使唤醒锁成为一个有挑战的虚拟化问题。唤醒锁可以在编译时静态创建或者在内核驱动或用户空间动态创建。他们可以在用户上下文，内核上下文（工作队列）和中断上下文（IRQ handlers）独立地加锁、解锁，使得决定唤醒锁属于哪个VP成为一个不平凡的任务。

Cells leverages the kernel-level device namespace and foreground-background usage model to maintain both kernel and user space wake lock interfaces while adhering to the two virtualization principles specified above. The solution is predicated on three assumptions. First, all lock and unlock coordination in the trusted root namespace was correct and appropriate before virtualization. Second, we trust the kernel and its drivers; when lock or unlock is called from interrupt context, we perform the operation unconditionally. Third, the foreground VP maintains full control of the hardware.

Cells修改内核层设备名字空间和前后台使用模型用来维护内核和用户空间唤醒锁接口，当遵循上面特意提到的两条虚拟化原则。这个方案基于三个假设。第一，所有加锁、解锁在信任跟名字空间协作在虚拟化之前是正确、合适的。第二，我们信任内核和它的驱动；当加锁或解锁从中断上下文调用，我们无条件地执行操作。第三，前台VP维护硬件的全部控制。

Under these assumptions, Cells virtualizes Android wake locks by allowing multiple device namespaces to independently lock and unlock the same wake lock. Power management operations are initiated based on the state of the set of locks associated with the foreground VP. The solution comprises the following set of rules:

在这些假定下，Cells 通过允许多个设备名字空间独立对同一把唤醒锁加锁和解锁来虚拟化android唤醒锁。电源管理操作是基于与前台VP相关的一组锁的状态来初始化的。这个方案由以下一组规则组成：

1. When a wake lock is locked, a namespace “token” is associated with the lock indicating the context in which the lock was taken. A wake lock token may contain references to multiple namespaces if the lock was taken from those namespaces.

1. 当一个唤醒锁被锁住，一个名字空间“token”与锁相关联，表示这个锁被拿到的上下文。一个唤醒锁token可能包含多个名字空间的引用，如果这个锁被那些名字空间拿到了。

2. When a wake lock is unlocked from user context, remove the associated namespace token.

2. 当一个唤醒锁在用户上下文被释放，移除相关名字空间token。

3. When a wake lock is unlocked from interrupt context or the root namespace, remove all lock tokens. This follows from the second assumption.

3. 当一个唤醒锁在中断上下文或跟名字空间被释放，移除所有锁token。这个遵循第二个假设。

4. After a user context lock or unlock, adjust any suspend timeout value based only on locks acquired in the current device namespace.

4. 在用户上下文加锁或解锁后，校正任何挂起超时值，这个值仅基于当前设备名字空间需要的锁。

5. After a root namespace lock or unlock, adjust the suspend timeout based on the foreground VP's device namespace.

在跟名字空间中加锁或解锁后，校正基于前台VP的设备名字空间的挂起超时。

6. When the foreground VP changes, reset the suspend timeout based on locks acquired in the newly active namespace. This requires per-namespace bookkeeping of suspend timeout values.
当前台VP改变，重置基于新的活跃名字空间需要锁的挂起超时。这个要求每个名字空间记录挂起超时值。

One additional mechanism was necessary to implement the Cells wake lock virtualization. The set of rules given above implicitly assumes that, aside from interrupt context, the lock and unlock functions are aware of the device namespace in which the operation is being performed. While this is true for operations started from user context, it is not the case for operations performed from kernel work queues. To address this issue, we introduced a mechanism which executes a kernel work queue in a specific device namespace.

为了实现Cells唤醒锁虚拟化，一个额外的机制是必须的。上面给出的一组规则含蓄地假定，除了中断上下文以外，加锁和解锁函数知道操作正发生的设备名字空间。这点对于用户上下文中的操作是成立的，但是，对内核工作队列里的操作却不成立。为了处理这个问题，我们引入了一个机制：在一个特定的设备名字空间执行一个内核工作队列。

The third method is to modify a device driver to be aware of device namespaces. For example, Android includes a number of custom pseudo drivers which are not part of an existing kernel subsystem, such as the Binder IPC mechanism. To provide isolation among VPs, Cells needs to ensure that under no circumstances can a process in one VP gain access to Binder instances in another VP. This is done by modifying the Binder driver so that instead of allowing Binder data structures to reference a single global list of all processes, they reference device namespace isolated lists and only allow communication between processes associated with the same device namespace. A Binder device namespace context is only initialized when the Binder device file is first opened, resulting in almost no overhead for future accesses. While the device driver itself needs to be modified, pseudo device drivers are not hardware-specific and thus changes only need to be made once for all hardware platforms. In some cases, however, it may be necessary to modify a hardware-specific device driver to make it aware of device namespaces. For most devices, this is straightforward and involves duplicating necessary driver state upon device namespace creation and tagging the data describing that state with the device namespace. Even this can be avoided if the device driver provides some basic capabilities as described in Section 4.2, which discusses GPU virtualization.

第三个方法是修改一个设备驱动，使其知道设备名字空间。例如，android包含一系列传统伪驱动，不是存在的内核子系统的一部分，就像Binder IPC机制。为了在VP间提供隔离机制，Cells需要确认，任何环境下，一个VP里的进程也不能获得另一个VP里的Binder实例的访问权。这是通过修改Binder驱动来实现的，因此，相对于允许Binder数据结构引用一个关于所有进程的全局列表，他们引用了设备名字空间独立列表，并且只允许同一个设备名字空间中的进程间通信。一个Binder设备名字空间上下文只有当Binder设备文件被第一次打开时初始化，因此几乎对未来的访问没有系统开销。因为设备驱动本身需要被修改，伪设备驱动不是针对硬件的，因此对于所有硬件平台只需要做一次修改。在某种情况下，可能需要修改针对硬件的设备驱动使其感知设备名字空间。对于大多数设备，这是简单的，并且包含在设备名字空间创建和标注描述设备名字空间过程中复制必要的设备状态。当设备驱动提供一些4.2节讨论GPU虚拟化描述的基本能力时，这是可以避免的。

The second method is to modify a device subsystem to be aware of device namespaces. For example, the input device subsystem in Linux handles various devices such as the touchscreen, navigation wheel, compass, GPS, proximity sensor, light sensor, headset input controls, and input buttons. The input subsystem consists of the input core, device drivers, and event handlers, the latter being responsible for passing input events to user space. By default in Linux, input events are sent to any process that is listening for them, but this does not provide the isolation needed for supporting VPs. To enable the input subsystem to use device namespaces, Cells only has to modify the event handlers so that, for each process listening for input events, event handlers first check if the corresponding device namespace is in the foreground. If it is not, the event is not raised to that specific process. The implementation is simple, and no changes are required to device drivers or the input core. As another example, virtualization of the power management subsystem is described in Section 5.

第二个方法是修改一个设备子系统，让它感知设备名字空间。例如，Linux中的输入设备子系统处理各种设备，如：触屏，导航轮，罗盘，全球定位系统，贴近传感器，光传感器，耳机输入控制，和输入按钮。输入子系统包括输入核，设备驱动，和事件处理器，后面负责传递输入事件给用户空间。Linux中默认，输入事件被发给监听事件的任何进程，但是没有提供支持VP需要的隔离。为了让输入子系统使用设备名字空间， Cells只需要修改事件处理器，对每个监听输入事件的进程，事件处理器先检查对应的设备名字空间是否在前台。如果不是，那么事件不被发给特定进程。实现简单，不修改设备驱动或输入核。另外一个例子，电源管理子系统在第5节描述。

Figure 1 provides an overview of the Cells system architecture. We describe Cells using Android since our prototype is based on it. Each VP runs a stock Android user space environment. Cells leverages lightweight OS virtualization [3, 23] to isolate VPs from one another. Cells uses a single OS kernel across all VPs that virtualizes identifiers, kernel interfaces, and hardware resources such that several execution environments can exist side-by-side in virtual OS sandboxes. Each VP has its own private virtual namespace so that VPs can run concurrently and use the same OS resource names inside their respective namespaces, yet be isolated from and not conflict with each other. This is done by transparently remapping OS resource identifiers to virtual ones that are used by processes within each VP. File system paths, process identifiers (PIDs), IPC identifiers, network interface names, and user names (UIDs) must all be virtualized to prevent conflicts and ensure that processes running in one VP cannot see processes in other VPs. The Linux kernel, including the version used by Android, provides virtualization for these identifiers through namespaces [3]. For example: the file system (FS) is virtualized using mount namespaces that allow different independent views of the FS and provide isolated private FS jails for VPs [16].

图1提供了Cells系统架构的总揽。我们用android来描述Cells因为我们的原型是基于android的。每个VP跑一个android用户空间环境。Cells修改轻量级OS虚拟化来进行VP隔离。Cells让所有的VP使用一个OS内核，VP来虚拟出标示符，内核接口和硬件资源，因此几个可执行环境可以在虚拟OS沙盒中并列存在。每个VP有他自己的私有虚拟名字空间，因此VP可以一起跑，并且在各自的名字空间使用相同的OS资源名字，它们被隔离并彼此不冲突。这是通过明显的重映射OS资源标示符到每个VP中的进程使用的虚拟的资源。文件系统路径，进程标示符（PID），IPC标示符，网络接口名字，和用户名字（UID）必须全部被虚拟化来避免冲突，保证在一个VP里跑的进程看不到其他VP里跑的进程。Linux内核，包括android使用的版本，提供通过名字空间来虚拟化这些标示符。

2013年5月25日上传

猜你喜欢