Mesh: Compacting Memory Management for C/C++ Applications

Remarks

Conference: PLDI 2019
Full Paper: https://dl.acm.org/doi/10.1145/3314221.3314582
Artifact: https://github.com/plasma-umass/Mesh

Summary

Sigenifcance: A technique to reduce memory consumption and improve runtime performance, protecting the C/C++ program from memory exhaustion. This paper shows analytically that Mesh provably avoids catastrophic memory fragmentation with high probability, and empirically show that Mesh can substantially reduce memory fragmentation for memory-intensive applications written in C/C++ with low runtime overhead.
Novelty: Mesh combines remapping of virtual to physical pages (meshing) with randomized allocation and search algorithms to enable safe and effective compaction without relocation for C/C++. Mesh is a drop in replacement for malloc() that can transparently recover from memory fragmentation without any changes to application code.
Soundness: Experiments are conducted on SPEC2006 benchmark, Redis 4.0.2, Ruby 2.5.1 and Firfox browsers. From the results we can see that it reduces the memory of consumption of Firefox by 16% and Redis by 39%.

Introduction

Q: 为什么需要减小程序的内存消耗？
A: 过多的内存消耗是现代计算平台的一个严重问题，从移动到桌面到数据中心。例如，在低端Android设备上，Google报告称，超过99%的Chrome崩溃是由于试图显示网页时内存不足所致。FirefoxWeb浏览器已经连续五年来以减少内存占用作为目标。在数据中心，开发人员实现了一系列技术，从自定义分配器到其他特殊方法，以提高内存利用率。

Q: 减小C\C++程序的内存消耗有哪些困难？
A: 一个关键的挑战是，不同于Java这类具有垃圾回收机制的语言环境，通过压缩的方式自动减少C/C++应用程序的内存占用是不可能的。因为分配对象的地址直接暴露给程序员，C/C++应用程序可以自由地修改或隐藏地址。例如，程序可以将地址保存为整数，将标志存储在对齐地址的低位，对地址执行算术运算并随后引用它们，甚至可以将地址存储到磁盘，然后重新加载它们。这种完全不同的语言环境使得无法安全地重新定位对象：如果重新定位对象，则必须更新指向其原始位置的所有指针。然而，没有办法在每个引用都不明确时安全地更新它们，更不用说在它们不存在时了。

Q: 内存分配函数主要有哪些？
A:
malloc()
malloc() is a user-space interface for C/C++ programmers to allocate memories from heap, and it requires users to “free” it manually.
glibc malloc() ← ptmalloc()
In user-space, it uses a free-list to manage the allocated memory chunks. Memory chunks are added to free-list when the user invokes free(), but the chunks will not be freed immediately. If neighbors of the chunk can be merged, it may be merged for reducing the fragmentation. If the required memory size cannot be found in the free-list, it uses brk()/sbrk() to acquire desired memory from kernel.
Other malloc() replacement
tcmalloc() from Google
jemalloc() from Facebook
For more details in chinese.

Q: 为什么会有内存碎片的存在？
A: free操作是对堆空间的回收，回收的区块并不是立即返还给内核。而是将区块对应的chunk“标记”为空闲，加入空闲队列中。当然，如果空闲队列中出现相邻地址的chunk，那么可以考虑合并，以解决内存的碎片化，一遍满足之后的大内存申请的需求

Q: 有些语言具有压缩垃圾收集的功能?
A: 例如Java、LISP。其他语言如Rust和Go目前不支持压缩垃圾收集，Mesh的作者声称他们可能会尝试集成到其中。

Key Technique I: Meshing

Q: 什么是Meshing机制？
A: Mesh执行压缩而不重新定位；也就是说，不更改对象的地址。此属性对于与任意C/C++应用程序兼容至关重要。为了实现这一点，Mesh建立在我们称之为meshing的机制上，Novark等人的Hound内存泄漏检测器首先引入了这个机制。Hound采用网格技术，以避免由于内存分配效率低下而导致的灾难性内存消耗该方案只能在释放页面上的每个对象时回收内存。Hound首先搜索活动对象不重叠的页面。然后，它将一个页面的内容复制到另一个页面上，重新映射其中一个虚拟页面以指向当前保存两个页面内容的单个物理页面，最后将另一个物理页面让初给操作系统。下图说明了Meshing的作用。

Q: Meshing可能存在什么问题？
A: Mesh的工作原理是找到成对的页面并将它们物理地（而不是虚拟地）合并在一起：这种合并允许它将物理页面放弃给OS。只有当页面上没有对象占据相同的偏移量时，才可以进行网格划分。一个关键的观察是，随着碎片的增加（也就是说，有更多的自由对象），成功地找到网格页对的可能性也会增加。

Key Technique II: Random Allocation

Q: 怎么解决Meshing的问题？
A: Meshing的一个关键威胁是页面可以包含相同偏移量的对象，从而防止它们被网格化。在最坏的情况下，所有跨距将只有一个分配的对象，每个对象在相同的偏移量，使它们不可网格化。Mesh采用随机分配，使得这种最坏情况下的行为极不可能发生。它在一个范围内的所有可用偏移上均匀地随机分配对象。因此，所有对象占用相同偏移量的概率是（1/b）^（n−1），其中b是一个跨距（页）中的对象数，n是跨距数。

Implementation

简单来说，Mesh = MiniHeaps + Shuffle Vectors + Thread Local Heaps + Global Heap + Meshing

Page: The smallest block of memory managed by the operating system, 4Kb on most architectures. Memory given to the allocator by the operating system is always in multiples of the page size, and aligned to the page size.
Span: A contiguous run of 1 or more pages. It is often larger than the page size to account for large allocations and amortize the cost of heap metadata.
Arena: A contiguous range of virtual address space we allocate out of. All allocations returned by malloc() reside within the arena.
GlobalHeap: The global heap carves out the Arena into Spans and performs meshing.
MiniHeap: Metadata for a Span – at any time a live Span has a single MiniHeap owner. For small objects, MiniHeaps have a bitmap to track whether an allocation is live or freed.
ThreadLocalHeap: A collections of MiniHeaps and a ShuffleVector so that most allocations and free()s can be fast and lock-free.
ShuffleVector: A novel data structure that enables randomized allocation with bump-pointer-like speed.

MiniHeaps:

MiniHeaps manage allocated physical spans of memory and are either attached or detached. An attached MiniHeap is owned by a specific thread-local heap, while a detached MiniHeap is only referenced through the global heap. New small objects are only allocated out of attached MiniHeaps. Each MiniHeap contains metadata that comprises span length, object size, allocation bitmap, and the start addresses of any virtual spans meshed to a unique physical span.

Shuffle Vectors:

核心思想就两个字，“洗牌”

Thread Local Heaps

应用程序的所有malloc和free请求都从线程的本地堆开始。根据分配的大小，分配请求的处理方式不同。如果分配请求大于16K，则将其转发到全局堆以实现。16K及更小的分配请求是小对象分配，由与分配请求相对应的大小类的洗牌向量直接处理。如果shuffle向量为空，则通过从全局堆请求大小适当的小堆来重新填充它。

Global Heap

全局堆为线程本地堆分配小堆，处理所有大型对象分配，对小型和大型对象执行非本地释放，并协调网格划分。

Meshing

Meshing is rate limited by a configurable parameter, settable at program startup and during runtime by the application through the semi-standard mallctl API. The default rate meshes at most once every tenth of a second. If the last meshing freed less than one MB of heap space, the timer is not restarted until a subsequent allocation is freed through the global heap. This approach ensures that Mesh does not waste time searching for meshes when the application and heap are in a steady state.

实现中用到了与OS相关的APIs

mmap
fallocate (FALLOC_FL_PUNCH_HOLE)
mallctl
memfd_create
mprotect

Example

下图给出了一个Meshing图的例子，边的关系代表了网格没有冲突。
在这里插入图片描述

Evaluation

测量内存使用量：为了精确测量应用程序随时间变化的内存使用量，我们开发了一个基于Linux的实用程序mstat1，它在一个新的内存控制组中运行一个程序[26]。mstat以固定频率轮询控制组中所有进程的常驻集大小（RSS）和内核内存使用统计信息。这使我们能够在评估中考虑较大页表（由于网格划分）所需的内存。我们已经验证了mstat不会干扰性能结果。
在这里插入图片描述
对于许多内存密集型应用程序，包括像Firefox这样的积极的空间优化应用程序，Mesh可以大幅度减少内存消耗（16%到39%），同时对运行时性能施加适度的影响（例如Firefox和SPECint 2006大约1%）。