"Garbage Collection Algorithm Handbook The Art of Automatic Memory Management" - Runtime Interface (Notes)

11. Runtime interface

11.1 Object allocation interface

stage

  1. Allocating a memory unit of appropriate size and byte alignment is done by the allocation subsystem of the memory manager.

  2. System-level initialization (system initialization) , that is, before an object can be accessed by a user program, all its fields must be initialized to appropriate values. For example, in an object-oriented language, setting the method dispatch vector of a newly allocated object is one of the tasks in this stage. This stage usually also needs to set the header fields required by the programming language or memory manager in the object header. For Java objects, it includes hash values ​​and synchronization related information, while Java arrays need to clearly record their length.

  3. Secondary initialization (secondary initialisation) , that is, when the object has "detached" from the allocated subspace and can potentially be accessed by other parts of the program, threads, to further set (or update) some of its fields.

The following is the object allocation and initialization process in C and Java languages:

C:

  • All allocation work is done in Phase 1, and the programming language does not need to provide any form of system-level initialization or sub-initialization, all of which are done by the developer (or initialization fails).

  • Note that the allocator still needs to modify the header of the allocated memory unit to ensure that it can be freed in the future, but this header exists outside of the memory unit returned to the caller.

Java:

  • Phase 1 and Phase 2 jointly complete the initialization of the new object's method dispatch vector, hash value, and synchronization information, while setting all other fields to a certain default value (usually all zeros).

  • The length field of the array is also initialized in these two stages. The object returned by bytecode new is in this state. At this time, although the object meets the type safety requirements, it is still a completely "blank" object.

  • The corresponding manifestation of stage 3 in the Java language is the code in the object constructor or static initializer, or the code segment that sets certain fields to non-zero values ​​after the object is created. The initialization of the final field is also completed in phase 3, so once the newly created object is exposed to other threads prematurely, and at the same time prevent other threads from perceiving the change of the object field, it will be very complicated to implement.


If the programming language requires full object initialization semantics, there are some subtle problems with the definition of the object allocation interface:

  • To ensure that developers can provide initial values ​​for each field of each type of object, there may be an infinite number of allocation interfaces, depending on the number and types of fields the object contains.

Modula-3 allows developers to provide functional initialization methods (not necessarily required), and can pass the initialization closure (initializing closure) to the allocation subprocess, which will allocate the appropriate space and execute the initialization closure to fill the object. domain, which solves this problem.

The initialization closure contains the initial values ​​that need to be set and the code to set them to specific fields of the object. Modula-3 uses a static scope (static scope) , and the closure itself does not need to be allocated from the heap, it is just a static chain pointer (pointing to the variable in the environment (enclosing environment, closed environment) containing it), So it avoids infinite loop recursion during allocation. However, if the compiler can automatically generate the initialization code, it doesn't matter whether the initialization is inside or outside the allocation.

The Glasgow Haskell compiler employs a different strategy to solve this problem:

It inlines all operations in phase 1 and phase 2 and invokes the collector when memory is exhausted. When creating a new object, the allocator uses sequential allocation to acquire memory. Its implementation is simple, and the initialization process usually only needs to fill the object's header and other fields with calculated values. This is a case of the compiler being tightly tied to a particular allocation algorithm (and thus a collection algorithm).

Functional initialization has two significant advantages:

  1. Not only does it ensure that the initialization of the object is complete, but its initialization code is an atomic operation for the collector
  2. The write operation during the initialization process can avoid the introduction of some write barriers, especially in the generational collector, the object being initialized must be younger than other objects it refers to, so the initialization process can ignore generation between write barriers. But it is worth noting that this conclusion is generally not true in Java constructors .

Object allocation requirements at the language level will eventually call the memory allocation subroutine, and some compilers will inline this process and complete all operations in phase 1 and some or all operations in phase 2.

A key requirement that the allocation process needs to meet is:

  • All operations in phase 1 and phase 2 should be atomic for other threads and collectors. Only in this way can we ensure that other modules of the system will not access objects that have not been initialized by the system.

If we think deeply about the allocator interface (stage 1), we can see that there are many possible combinations of work division between the 3 stages. The parameters to be considered during the allocation process are as follows.

  • The size of the space to be allocated , usually in bytes, may also be in words or other granularity. When an array needs to be allocated, the allocation interface can take the element size and the number of elements as independent input parameters.

  • Byte alignment requirements , the allocator usually performs byte alignment in a default way, but the caller can also require a stricter byte alignment. These requirements may include aligning to a power of 2 (such as aligning by word, doubleword, quadword, etc.), or adding an offset on this basis (such as offsetting a word on the basis of quadword alignment) ).

  • The kind of object to be allocated , for example, managed runtime languages ​​such as Java typically distinguish arrays from non-array objects, some systems distinguish objects that do not contain pointers from other objects, and still others Objects that contain executable code are treated differently from objects that do not. In short, any need for special treatment by the allocator should be reflected in the allocation interface.

  • The specific type (type) of the object to be allocated , that is, the type that the programming language cares about. Unlike "categories", allocators usually don't need to pay attention to the "type" of the object, but will initialize the object through the "type". Passing this information to the allocation subroutine not only simplifies the atomic implementation of phase 2 (i.e. shifting this task to phase 1), but also avoids introducing extra instructions at each allocation location, thereby reducing code size.


Exactly which of the above various parameters the allocation interface needs to support depends to some extent on the programming language it serves. We can also pass some redundant parameters in the allocation interface to avoid extra calculations at runtime.

One implementation strategy of the allocation interface is to provide a full-featured allocation function, which supports a large number of parameters and can handle all situations. In order to speed up allocation and simplify parameters, we can also customize different allocations for different types of objects interface.

Taking Java as an example, the customized allocation interface can be divided into the following types:

  • Allocation of pure objects (not arrays)
  • Allocation of byte/boolen array (element is 1 byte)
  • Allocation of short/char arrays (elements are 2 bytes)
  • Allocation of int/float arrays (elements are 4 bytes)
  • Allocation of pointer arrays and long/double arrays (8 bytes for elements)

除此之外还需考虑系统内部对象的分配接口,例如表示类型的对象、方法分派表、方法代码等,具体的分配方式取决于是否要将其置于可回收堆中,即使它们不从托管堆中分配,系统仍需为其提供特殊接口,以从显式释放的分配器中进行分配。

阶段1完成后,分配器可以通过如下几个 后置条件(post-condition) 来检测该阶段的执行是否成功。

  • 已分配内存单元满足预定的大小以及字节对齐要求,但此时该内存单元还不能被赋值器访问。

  • 已分配内存单元已完成清零,这可以确保程序不会将内存单元中原有的指针或者非指针数据误认为是有效的引用。零是非常好的一个值,对于指针而言,零值表示空指针,而对于大多数类型而言,零值都是平常的、合法的值。某些语言(如Java)需要通过清零或者其他类似的方式来确保安全类型的安全性。在调试系统中,将未分配的内存设置成特殊的非零值十分有用,例如 0xdeadbeef或者0xcafebabe,其字面意思就是表示其当前所处的状态。

  • 内存单元已被赋予调用者所要求的类型。当然这一过程只有当调用者将类型信息传给分配器时才需考虑。与最小后置条件(即该条款中的第一条)相比,此处的区别在于分配器会填充对象的头部。

  • 确保对象的完全类型安全性。这不仅涉及清零行为,而且还涉及填充对象头部的行为。这一步完成后,对象并未达到完整初始化的标准,因为此时对象中的每个域均只是安全的、平常的、默认的零值,而应用程序通常要求将至少一个域初始化到非默认的值。

  • 确保对象完全初始化。这通常要求调用者在分配接口中传递所有的初值,因而这一要求并不普遍。一个较好的例子是Lisp语言中的cons 函数,该函数的调用相当普遍,因而有理由为其提供单独的分配函数,以加速并简化其分配接口。

究竟最合适的后置条件是哪一个?

Some postconditions (such as zeroing) depend on the relevant semantics of the programming language, and there are also postconditions that depend on the degree of concurrency of their environment and the ways in which the object may be "from the thread it was born" escape" (thus becoming reachable by other threads or the collector). Generally speaking, the higher the degree of concurrency and the more common the escape situation, the higher the requirements for postconditions.

Next, let's consider what to do when the allocator cannot immediately satisfy the allocation request.

In most systems, we want to call garbage collection inside the allocation subroutine and hide this fact from the caller. The caller hardly needs to do anything at this point, and avoids inducing retries at each allocation location.

However, we can also inline most cases of the fast path (i.e., the case where the allocation succeeds), while placing the recycle-retry function outside the inlined code. If we inline the code of phase 1, there will be no clear dividing line between phase 1 and phase 2, but the entire code sequence must be implemented efficiently and atomically. Later, I will introduce the handshake mechanism between the setter and the collector, which includes the specific implementation of this atomization requirement. After realizing the atomization of the allocation process, we can regard the allocation process as an action that only the setter participates in.

11.1.1 Acceleration of the allocation process

One of the key technologies is to inline the code in general cases (that is, the "fast path"), and at the same time use the "slow path" that is less executed and handles more complex situations as a function call, and the specific choice needs to be made in Carefully perform comparative measurements under suitable loads.

Fast path (fast path): refers to a path that has a shorter instruction path length (Instruction path length) than the general path in a program . An efficient fast path will handle the most common cases more efficiently than the general path, letting the general path handle special cases, corner cases, error handling, and other anomalies

The obvious advantage of sequential allocation is that it is simple to implement, and its code sequence is generally shorter.

If the processor has enough registers, the system can even use a register to save the bump pointer , and use another register to save the upper limit of the heap address. At this time, the typical code sequence may be:

  • Copy the step pointer to the result register, add the size of the space to be allocated to the step pointer, judge whether the step pointer exceeds the upper limit of the heap address, and call the slow path when the result is true.

Note that saving step pointers in registers is only possible when using thread-local order allocation. Some ML and Haskell further combine multiple allocation requests in a code sequence into a larger request, so that only one address upper limit judgment and branch is required. Similar techniques can also be used for other single-entry multiple-exit code sequences, that is, to allocate the maximum memory requirements under all possible execution paths at once, or use this value to make basic address upper limit judgments only when the code sequence begins to execute.

Although sequential allocation will almost certainly be faster than free list allocation, partition-adaptive allocation can be quite efficient with the help of partial inlining and optimization. If we can statically calculate the corresponding space size classification, and use registers to save the address of the free linked list array, the allocation process at this time will be:

  • Load the head pointer corresponding to the free linked list, judge whether it is zero, if it is zero, call the slow path, load the next pointer, and set the head pointer of the linked list to the next pointer.

In a multi-threaded system, the last operation may need to be atomic, i.e. use the compareAndSwap operation and retry on failure. In addition, it is also possible to provide each thread with its own free list sequence and reclaim it independently.

11.1.2 Clear

To be on the safe side, some systems require that their free memory be set to a specified value, usually zero, or some other special value (usually for debugging purposes). Systems that provide only the most basic allocation functions (such as C) usually don't do this, or only do it in debug mode.

Systems with strong allocation guarantees (such as functional languages ​​with full initialization capabilities) generally do not need to zero free memory.

Still, setting free memory to a specific value can help with system debugging. Java is a typical case where free memory needs to be cleared.


When should the system perform the zeroing operation? How to perform the zeroing operation?

We could zero an object every time it is allocated, but experience tells us that it is more efficient to zero a large space at once.

Clearing using explicit memory writes can cause a large number of cache misses, while performing a large number of zeroing operations on some hardware architectures can also affect read operations, because read operations must block in the hardware write buffer. until all clearing operations are completed.

某些ML的实现以及Sun的 HotSpot Java虚拟机会(在顺序分配中)对位于阶跃指针之前的数据进行精确地预取,并以此掩盖新分配数据从内存加载到高速缓存时的延迟,但现代处理器通常可以探测到这一访问模式并实现硬件预取。


如何清零

Diwan等人 发现,使用支持 以字为单位(per-word basis) 进行分配的 写分配高速缓存(write-allocate cache) 可以获得最佳性能,但在实践中这一结论并非永远成立。

从分配器的实现角度来看,将整个内存块清零的最佳方式通常是调用运行时库提供的清零函数,例如bzero。

extern void bzero(void *s, int n);
参数说明:s 要置零的数据的起始地址; n 要置零的数据字节个数。(C++)

这些函数通常会针对特定系统进行高度优化,甚至可能使用特殊的指令直接清零高速缓存而不将其写入内存,例如 PowerPC上的 dcbz指令(Data Cache Block Zero) 。开发者直接使用这些指令可能较难,因为高速缓存行的大小是与处理器架构密切相关的一个参数。任何情况下,系统在对以2的整数次幂对齐的大内存块清零时通常会达到最佳性能。

另一种清零技术是使用虚拟内存的 请求二进制零页(demand-zero page)

This technology is usually more suitable for the scene when the program starts. If this technology is used at runtime, the developer needs to manually remap ( remap) the page to be cleared. It is cleared when paged. Due to the relatively large overhead of related operations, its performance may not be as good as the developer calling the library function to clear by himself. Only when the number of pages that need to be cleared is large and the addresses are continuous, can the execution overhead of this technology be effectively covered, and its performance advantages can be highlighted.

Trap: An event that can interrupt the normal operation of the cpu and force it to run some special code to handle it, we call it a trap.

  • System call (system call)
  • exception
  • interrupt

When to clear

  1. We can clear immediately after the garbage collection is completed, but its obvious disadvantage is that it prolongs the collection pause time, and it may also cause a large amount of memory to be modified, which is likely to be used for a long time. Zeroed data will likely need to be written back from cache to memory and reloaded to cache during the allocation phase.

  2. We may arbitrarily believe based on intuitive experience that the best time to clear memory should be at a certain moment before it is about to be allocated, so that the processor can prefetch it to the memory before the allocator accesses it. cache, but the problem is that even if the zeroed memory is not far from the step pointer, it can still be easily flushed into memory.

  3. For modern hardware processors, it is difficult to say how effective the prefetching technique described by Appel is, or at least requires fine tuning to determine the appropriate prefetching range. If it is in a debugging environment, the operation of clearing free memory or writing a special value to it should be performed immediately after the node is released, so that we can catch errors in the largest possible time range.

11.2 Pointer Lookup

The collector needs to do a pointer lookup to determine the reachability of an object. Certain recycling algorithms require precise knowledge of all pointers in the program. Especially for the mobile collector, if an object needs to be moved from address x to a new address x', all pointers to x must be updated to x'.

The prerequisite for safe recycling of an object is that the program will not access the object again, but the reverse is not true: there is no security problem in keeping objects that are no longer used by the program, although this may reduce space utilization (admittedly, if the program cannot to obtain available heap memory, it may crash).

Therefore, the collector can conservatively believe that all references point to immovable objects, but it should not arbitrarily move objects that it cannot determine whether it can be moved. The basic reference counting algorithm is conservative. Another reason to use pessimistic collection is that the collector lacks precise pointer information, so it may treat a non-pointer value as a pointer, especially if the value looks like it refers to an object.

11.2.1 Pessimistic pointer lookup

The technical basis of conservative pointer lookup is to treat each aligned byte sequence with the same size as a pointer as a possible pointer value, that is, an ambiguous pointer .

The collector can grasp the set of memory regions that make up the heap, and even know which parts of these regions have been allocated, so it can quickly exclude values ​​​​that must not be pointers.

To ensure the performance of the collection process, the work of authenticating pointers must be very efficient. This process usually involves two stages.

  1. The collector first filters out values ​​that do not point to any heap address. If the heap space itself is a large block of continuous memory, this process can be realized through simple address judgment. In addition, the corresponding memory block number can be calculated according to the high address of the fuzzy pointer, and a heap memory block index table to look up.

  2. The recycler needs to identify whether the address pointed by the fuzzy pointer is actually allocated, and this process can be accomplished with the help of a bitmap that records allocated memory particles.


For example, the Boehm-Demers-Weiser conservative collector uses a block-structured heap, where each memory block is used to allocate memory cells of only one size. The size of a memory unit is kept in the metadata associated with the memory block, while its state (allocated or free) is reflected in a bitmap.

For an ambiguous pointer, the collector first uses the heap boundary to judge it, and then judges whether the memory block it refers to has been allocated. If the judgment is true, it further checks whether the memory unit it points to has been allocated.

Only when the judgment result of the last step is true, can the collector mark the target object of the fuzzy pointer. Figure 11.1 shows the entire process of processing fuzzy pointers, and each judgment requires about 30 RISC instructions (reduced instruction set).

Some programming languages ​​require that the address pointed to by the pointer is the first word of the object it refers to, or some standard offsets are added on this basis (for example, after several header words, see Figure 7.2). With the help of this rule, the collector can ignore the interior pointer and only need to pay attention to the canonical pointer . Regardless of whether it needs to support internal pointers, the design of the conservative collector is relatively simple. The Boehm-Demers-Weiser conservative collector can be configured to choose whether it needs to support internal pointers.

If you use the conservative collector in C language, there is a detail that needs attention:

  • The C language allows the internal pointer to point to the first element outside the range of an array. At this time, the conservative collector must either maintain two objects, or must allocate an additional word for the array to avoid ambiguity.

An explicit memory deallocation system can solve this by inserting extra headers between objects. The compiler's optimization may "destroy" the pointer, causing misjudgment by the collector.

Certain non-pointer values ​​may cause the collector to mistakenly retain an object that is not actually reachable, so Boehm designed a black-listing mechanism to avoid using objects "pointed" by these non-pointer values ​​in the heap. virtual address space.

In particular, if the collector determines that an ambiguous pointer points to an unallocated memory block, it can add the memory block to the blacklist, but it must ensure that it will never be allocated in it, otherwise the subsequent tracking process may mistake it for a fake pointer real pointer.


The collector also supports only allocating objects that do not contain pointers (such as bitmaps) in a specific memory block. This differentiation strategy can not only improve recycling efficiency (because there is no need to scan the contents of the object), but also avoid expensive blacklist queries Overhead (that is, naturally avoiding the data in the bitmap as a pointer).

The collector can also further distinguish whether the illegal pointer may be an internal pointer, and improve the blacklist accordingly (if internal pointers are not allowed, there cannot be internal pointers in the heap space).

  • When the use of internal pointers is allowed, the memory blocks recorded in the blacklist must not be used under any circumstances
  • 而当不允许使用内部指针时,黑名单所记录的内存块可以分配不包含指针的小对象(这通常不会造成太多浪费)

在赋值器首次执行堆分配之前,回收器先发起一次回收以初始化黑名单。分配器通常也会避免使用地址末尾包含太多零的内存块,因为栈中的非指针数据通常可能会“引用”这些地址。

insert image description here

11.2.2 使用带标签值进行精确指针查找

某些系统(特别是基于动态类型的系统)支持为每个值附带一个特殊的 标签(tag) ,以表示其类型。标签的基本实现策略有二:

  • 位窃取(bit stealing)
  • 页簇(big bags of pages)

位窃取的方法需要在每个值中预留出一个或者多个位(通常是字的最高或最低几位),同时要求可能包含指针的对象必须以 面向字(word-oriented) 的方式进行布局。

例如,对于一台依照字节进行寻址且每个字包含四个字节的机器,如果我们要求每个对象都必须依照字来进行对齐,则指针的最低两位必然都是零,因而我们可以将这两位用作标签。我们亦可使用其他值来表示整数,例如可以要求所有用于表示整数的值最低位都必须是1,同时以高31位来表示整数的具体值(尽管这一方案确实减少了我们可以直接表达的整数范围)。

为确保堆的可解析性(参见第7.6节),我们可以要求堆中对象的第一个字必须以二进制 1 0 作为低两位。


表11.1介绍了一种标签编码方案,它与Smalltalk中真正使用的编码方案类似。

可能会有读者对带标签整数的处理效率提出挑战,但对于现代流水线处理器而言,这几乎不会成为问题,一次高速缓存不命中所造成的延迟便可轻易掩盖掉这一开销。

为支持使用带标签整数的动态类型语言,SPARC架构提供了专门的指令来对带标签整数直接进行加减操作,且这些指令均可以判断操作是否发生溢出。某些版本甚至还可以针对操作溢出或者被操作数低两位不为零的情况设置陷阱。

insert image description here

基于SPARC架构我们可以使用表11.2所示的标签编码方案。

该方案要求我们对指针所代表的引用进行调整,在大多数情况下,这一调整操作可以通过在加载和存储指令中引入一个偏移量来实现,但对数组的访问是一个例外:

  • 在访问数组中的某个元素时,我们需要根据数组索引号以及这一额外的偏移量来计算其最终的访问地址。

真实硬件架构对带标签整数的支持进一步说明了位窃取方案的合理性:

  • Motorola MC68000处理器曾经使用过这一编码方案,该处理器包含一条加载指令,该指令可以通过一个基址寄存器、一个其他寄存器外加一个立即数来构造有效地址,因此在MC68000处理器上使用该编码方案不存在太大的额外开销。

页簇方案是将标签/类型信息与对象所在的内存块相关联,因此其关联关系通常是动态的且需要额外的查表操作。

该方案的不足之处在于标签/类型信息的获取需要额外的内存加载操作,但其优势在于整数以及其他原生类型可以完全使用其原本所占据的空间。

该方案意味着系统中存在一组内存块专门用于保存整数,同时还有一组专门的内存块用于保存浮点数等。由于这些纯值不可能发生变化,因而在分配新对象时可能需要进行哈希查找以避免创建已经存在的对象。

11.2.3 对象中的精确指针查找

如果不使用带标签值,那么要找出对象中所包含的指针,必然需要知道对象的类型(至少需要知道对象中的哪些域是指针域)。

对于面向对象语言(确切地讲,是使用 动态方法分派(dynamic method dispatch) 机制的语言),指向对象的指针并不能完全反映运行时对象的类型,因而我们需要将对象的类型信息与对象本身关联,其实现方式通常是在对象头部增加一个指向其类型信息的指针域


面向对象语言通常会为每一种类型生成方法分派向量,并在对象头部增加一个指向其方法分派向量的指针,因此编程语言便可将对象的类型信息保存在方法分派向量中,或者从方法分派向量可达的其他位置。

In this way, the collector or other modules in the runtime system that rely on object type information (such as Java's reflection mechanism ) can quickly obtain object type information.

What the collector needs is a table that can reflect the location of the object's internal pointer field. There are two ways to implement this table:

  1. Use bit vectors similar to marker bitmaps
  2. Use a vector to record the offset of the pointer field in the object

Huang et al. obtained different tracking orders by adjusting the order of elements in the offset vector, and the copy collector can arrange the surviving objects in different orders accordingly, thus improving the cache performance.

This adjustment needs to be done carefully at runtime (in an all-static collector). Partitioning objects that contain pointers and objects that do not contain pointers is, in some respects, a simpler pointer identification method than looking up a table.


This strategy works out-of-the-box in some languages ​​and system designs, but may run into problems in others. For example in ML, objects can be polymorphic .

Assuming that an object treats a field as a pointer in some cases and a non-pointer value in other cases, if the system generates a piece of code that is suitable for all polymorphic cases, it will not be able to handle it at all. Two situations are distinguished.

For an object-oriented system that allows derived classes to reuse the code of the base class, the fields of the subclass will be located after all the fields of the base class, which will inevitably lead to a mixture of pointer fields and non-pointer fields.

One solution to this problem is to arrange two different domains along different directions :

  • The pointer field is arranged along the direction of the negative offset, and the non-pointer field is arranged along the direction of the positive offset. This scheme is also called a bidirectional object layout (bidirectional object layout) .

On a byte-addressed machine, for an object aligned by word, we can set the lowest bit of the first word of the object header to 1, and align by word to ensure that the lowest two bits of the pointer field must be all zeros, thus ensuring the parsability of the heap. In practical applications, the flat arrangement is usually not a problem.


Some systems generate object-oriented style code for each type, enabling objects to be tracked, copied, etc. We can think of the look-up table approach as a type interpreter, and the object-oriented code approach as the corresponding compiled code.

Thomas proposed a very valuable idea in his design, that is, when copying a closure (closure) , you can customize a special copy function for the closure's environment (environment) , which will avoid copying those in a specific function Environment variables that are not used in .

This strategy can not only save space when copying environment variables, but more importantly, it can avoid copying parts of the environment that are no longer used.


In managed languages, we can use the indirect call process of object-oriented methods to implement special recycling-related operations. In the copying collector of Cheadle et al., they realize the self-erase of the read barrier by dynamically changing the function pointer of the object , which is similar to the technique used by Cheadle et al. in the Glasgow Haskell compiler (GHC).

The system implements multiple versions of stack barriers using a similar technique, in addition to implementing an intergenerational write barrier that is used when updating values ​​to be computed (thunks) based on this technique. An advantage of a system that can update the closure environment is that it can shrink existing objects, but to ensure heap resolvability, the system needs to insert a dummy object in the heap after the shrink is complete.

Correspondingly, the system may also need to extend an object. At this time, the system will overwrite the original object with a transit object, and save the pointer to the extended object in the transit object. The subsequent recycling process can also optimize the transit object. The collector can also perform additional calculations for the setter, such as early calculations for "well-known" functions whose parameters have already been calculated. A function that returns the first element of a linked list is an example of a "well-known" function.

In principle, statically typed languages ​​can omit object headers and save space. Appel and Goldberg describe how this requirement can be implemented in the language ML. In their solution, the collector only needs to know the type information of the root (because the tracking process of collection must have a starting point).

11.2.4 Exact Pointer Lookup in the Global Root

The precise pointer lookup in the global root is relatively simple, and most of the techniques for finding pointers in objects can be reused here.

In terms of global roots, the main difference between languages ​​is whether the set of global roots can grow dynamically. Dynamic code loading is one of the reasons for the growth of the global root set.

Some systems start with a basic collection of objects, and some Lisp and some Java systems start (especially when started in an interactive environment) with a basic system "image", also called Boot image (boot image) , which includes numerous classes/functions and their object instances.

During the execution of the program, the boot image may be partially modified, resulting in boot objects referencing new objects created at runtime. At this time, the collector must also regard the fields in these boot objects as the root of the program.

Boot objects can also become garbage during the running of the program, so it is also a good idea to occasionally trace the boot image and find unreachable objects in it. Whether you need to pay attention to the boot image usually depends on whether you use a generational recycling strategy. At this time, we can regard the boot image as a special old generation object.

11.2.5 Exact Pointer Lookups in Stacks and Registers

One solution to finding the exact pointer on the stack is to allocate the active record on the heap, as suggested by Appel, the same scheme is used, and again demonstrated by Miller and Rozas.

Some language implementations use the same way to manage the heap to manage the stack frame, so as to achieve the effect of killing two birds with one stone, such as the Glasgow Haskell compiler and Non-Stop Haskell. Language implementers can also specifically provide the collector with some relevant guidance on the content on the stack. For example, Henderson handles user-generated C code in this way in the Mercury language, and Baker et al. A similar technique was used.

Can refer to - what is a stack frame

However, due to various efficiency factors, most languages ​​will perform special processing on the stack frame to obtain the best runtime performance. At this time, the implementer of the collector needs to consider the following three issues:

  1. How to find a frame in the stack (Active Record)

  2. How to find the pointer in the frame

  3. How to deal with the parameters passed by convention, the return value, and the saving and restoration of the value in the register

In most systems, it is not only the collector that needs to find frames in the stack, other mechanisms such as exception handling and recovery also need to "parse" the stack, not to mention the vital stack inspection function in the debugging environment . At the same time, the resolvability of the stack is also a requirement of some systems (such as Smalltalk).


From a developer's point of view, the stack itself is of course very simple, but behind this simple appearance, the real stack is highly optimized in implementation, and the layout of the frame is usually more primitive.

Since stack resolvability is often useful, frame layout management usually needs to support this.

For example, in the design and implementation of many stacks, there will be a field in each frame to record the dynamic linked list pointer pointing to the previous frame, while other fields are located at fixed offsets in the frame (the offset here The amount is relative to the address pointed to by the frame pointer or the dynamic linked list pointer).

Many systems also contain a mapping table from the return address of the function to the function it is in. In non-garbage collected systems, this table is usually only part of the debug information table, but many managed systems need to access this table at runtime, so this The table must become a part of the program code (it can be loaded into the program at startup, and can also be generated after the program starts), and cannot be used only as auxiliary debugging information.


为确保回收器可以精确地找出帧中的指针,系统可能需要为每个栈显式增加 栈映射( stack map) 信息。

这一元数据可以通过位图来实现,即通过位图中的位来记录帧中的哪些域包含指针。除此之外,系统也可以将帧划分为指针区和非指针区,此时元数据中所记录的便是两个区各自的大小。

需要注意的是

  • 当栈帧已经存在但尚未完全初始化时,函数可能会需要插人额外的初始化指令,否则回收器在这一状态下进行栈扫描则可能遇到问题。

  • 我们可能需要对帧初始化代码进行回收方面的仔细分析,同时也必须谨慎地使用push指令(如果机器支持的话)或者其他特殊的压栈方式。

当然,如果编译器可以将帧中的给定域固定当作指针或者非指针来使用,则帧扫描的实现便十分简单,此时所有的函数只需共享同一个映射表即可。

但是,单一栈映射方案通常不可行,如果使用该方案,则至少两种语言特性无法实现:

  • 泛型/多态函数
  • Java虚拟机的jsr指令

jsr: 跳转至指定16位offset位置,并将jsr下一条指令地址压入栈顶

我们曾经提到,多态函数可能会使用同一段代码来处理指针和非指针值,由于单映射表无法对两种情况进行区分,所以系统需要一些额外的信息。

尽管多态函数的调用者可能“知道”具体的调用类型,但调用者本身也可能是一个多态函数,因而调用者需要将这一信息传递给更上层的调用者。因此在最差情况下,可能需要从main()函数开始逐级传递类型信息。这将与从根开始识别对象类型的策略十分类似。


Java虚拟机通过jsr指令来实现 局部调用(local call) ,该指令不会创建新的帧,但它所调用的代码却能够以调用者的角色来访问当前帧中的局部变量

Java implements the try-finally feature through this instruction. Under normal and abnormal logic, the code in the finally block will be called through the jsr instruction. The problem here is that when the virtual machine invokes the jsr instruction, the types of some local variables may be ambiguous. At this time, the type of the local variable may depend on the caller who calls the finally block through the jsr instruction.

For a variable that is not used in the finally block but will be used in the future, under normal call logic, it may contain a pointer, but under abnormal call logic, it may not contain a pointer.

There are two strategies to address this issue.

  1. It relies on the caller of the jsr instruction to eliminate ambiguity. At this time, the stack slot category of the domain in the stack map cannot be simply divided into two types: pointer and non-pointer (that is, it can be represented by one bit), and it also needs to include "ask jsr call the third category. At this point we need to find the return address of the jsr call. To achieve this, the program needs to analyze the Java bytecode.

  2. is to simply copy the finally block, although this may change the bytecode or dynamically compile the code, but this solution is more widely used in modern systems. Although this solution may cause an exponential increase in code size in the worst case, it does simplify the design of finally blocks in the system. There is said to be evidence that generating stack maps for dynamically compiled code is an important source of some cryptic bugs, so controlling the complexity of the system may be more important here. Some systems will delay the generation of the stack map until the collector really needs it. Although this can save time and space under normal execution logic, it may increase collection pause time.


Another problem with the system choosing a single stack map is:

  • It will further restrict the allocation of registers, that is, each register can only store pointers or non-pointers fixedly.

Therefore, this factor first determines that the single stack mapping scheme is not suitable for machines with a small number of registers .

It should be noted that whether we create a stack map for each function, or create different stack maps for different parts of a function, the compiler must ensure that the function with the deepest call level can also obtain the type information of the stack slot . If we can realize the importance of this requirement before developing the compiler, it will not be particularly difficult to implement, but if we want to modify the existing compiler, it will be quite difficult .

Pointer lookups in registers .

So far we've ignored pointers in registers. A pointer lookup in a register is much more complicated than a pointer lookup in a stack for several reasons:

  • We have mentioned that for a specific function, the compiler can fixedly use a field in its stack frame as a pointer field or a non-pointer field, but this scheme usually cannot be simply applied to registers, or there are Larger limitations:

    • This scheme needs to be divided into two special subsets of registers, and the registers in one set can only be used as pointers, while the other can only be used as non-pointers, so this scheme may only be suitable for machines with a large number of registers. In most systems, each function may correspond to multiple register maps.
  • Even if we could ensure that all global roots, objects on the heap, and local variables do not contain internal and derived pointers, it is still possible for highly optimized native code sequences to cause registers to hold such "deformed" pointers.

  • The function calling convention (call convention) requires:

    • Some registers follow the caller-save protocol: if the caller wants to continue using the value in a register after the call is complete, it must save the value of the register before initiating the call
    • Some registers obey the callee-save protocol: the callee must save the value in a register before using it, and be sure to restore it when it is done using it.
    • For the pointer lookup in the register, there is not much difficulty in the caller-save register , because the caller must know the type of data saved in the register, but what is the value in the callee-save register? category, only the upper caller (if any) will know about it. Therefore, the callee cannot determine whether a callee register that has not been saved contains a pointer, even if the callee saves the value of the register into a certain field of the frame, it cannot determine whether the field contains a pointer.

Many systems implement stack frame and call chain reconstruction (reconstruct) through the stack unwinding mechanism , especially for systems that do not provide a dedicated "last frame" register.

Stack unwinding: If an exception is thrown inside a function, and the exception is not caught inside the function, it will cause the running of the function to end at the place where the exception is thrown, and all local variables that have been allocated on the stack will be deleted. freed


Solve the pointer lookup problem in the callee-saved register, and give a strategy. The strategy first requires adding a piece of metadata for each function, which records information including which callee-saved registers are saved by the function, and which field of the frame the value of each register is saved in.

We assume that the system uses the most common function calling scheme, that is, the function saves the callee-saved registers it may use into the frame at the beginning. If the compiler is so complex that different code segments within the same function may use registers in different ways, it needs to insert information about callee-saved registers for different code segments within the function.

  1. Starting from the top-level frame, when we rebuild the registers, we should first restore the callee-saved registers and obtain the state of these registers when the caller executed the call.

  2. In order to ensure that the stack backtracking is smooth, we need to record which registers are restored and the values ​​​​obtained after the restoration operation.

  3. After reaching the function at the bottom of the call stack, all callee-saved registers can be ignored (because it does not have any callers), at which point we can determine the pointers in all registers, and the collector can use this information and Update the value in it.

The stack unwinding process needs to restore the callee-saved registers. It should be noted that if the collector updates a pointer, it also needs to update the saved register value. Once the function uses the callee register, we need to obtain the original value of the register from an additional table, and the collector needs to update the value if necessary.

In the subsequent process of processing the caller, we should avoid secondary processing of the already processed callee-saved registers. In some collectors, there is no side effect of reprocessing the root (such as mark-sweep collector), but in copy collectors, we will naturally think that all references that have not been forwarded are located in the source space. , so if the collector processes the same root twice (not two roots that refer to the same object), an extra copy may be made in the target space.

Algorithm 11.1 describes the above processing flow in detail, and a specific processing example is shown in Figure 11.2.

In Algorithm 11.1, func is the function used by the recycler to scan frames and registers. It can be the code segment in the for each loop of the markFromRoots function in Algorithm 2.2 (Mark-Sweep Recycling, Mark-Clean Recycling), or Algorithm 4.2 (copy The collect function in the collection) scans the code segment in the loop of the root.

insert image description here
insert image description here

Let's first consider the call stack shown in Figure 11.2a (the shaded box on the right), whose call process is as follows:

The program starts to execute from the main() function. In the initial state, the value of register r1 is 155, and the value of r2 is 784.

To ensure efficiency, the caller of the main() function should be outside the entire garbage collection system, so its corresponding frame must not refer to any objects in the heap, and the value of its registers cannot be a pointer. Similarly, we don't need to pay attention to the return address oldIP of the main() function.

The operations performed by the main() function are as follows:

  • save r1 into slot 1
  • Set variable 2 to a pointer to object p
  • Assign variable 3 a value of 75.
  • Then the main() function will call the function f(), the value of r1 is p, the value of r2 is 784 before the call is executed, and the function return address is main() + 52.
  • The function f() first saves the return address
  • Then store r2 in slot 1
  • store r1 in slot 2
  • Assign variable 3 to -13
  • Assign variable 4 to a pointer to object q.
  • Then the function f() will call the function g(), the value of r1 is a pointer to r before the call is executed, the value of r2 is 17, and the return address is f()+178.
  • Function g() saves the return address
  • store r2 in slot 1
  • Set variable 2 to a reference to object r
  • Assign variable 3 to -7
  • Set variable 4 to a pointer to object s.

In Figure 11.2a, each thick box represents a function frame, the register value above each box indicates the state of the register when the function starts executing, and the register value below the box indicates the register value when the function call is initiated status. The values ​​of these registers should be restored during subsequent stack unwinding.

Suppose the function g() triggers garbage collection during its execution.

The garbage collection process occurs at position g()+36 in the function g(). At this time, the value of r1 is a pointer to r, and the value of r2 is a pointer to t. We assume that at this time the instruction pointer (IP) and the values ​​of each register have been saved in a data structure of the suspended thread, or in a certain frame of the garbage collection process.

At a certain point, the collector will call the processStack function on the thread stack, and the parameter func is the function for the collector to scan frames and registers. For the copy-type collector, func is the copy function. At this time, since the target object will move, the collector needs to update the stack and the references in the register.

The box on the left side of Figure 11.2a shows the changes of variables Regs and Restore during processing, and the collector will process in the order of g(), f(), main(). Our snapshots of Regs and Restore are numbered on the left, and the order of numbering is consistent with the execution steps we describe below.

The processstack function writes the current register value in the thread state to Regs and initializes Restore to empty.

At this time, the function executes to line 15 of Algorithm 11.1, and the frame it processes is the frame of function g().

The algorithm is executed to the 19th line, and the processed frame is still the frame corresponding to the function g(). At this point we have completed the update of Regs, and have saved the modification of the registers by the function g() before triggering garbage collection in Restore.

Since the function g() initially saves the value of r2 in slot 1, we can infer that the value of r2 should be 17 when the function f() calls the function g(). When the function g() initiates garbage collection, the value of r2 is t, and we save this information in Restore9.

Before further processing the function g(), we recursively call the processstack function to process its caller. In Figure 11.2a, we record the pair returned by the calleeSavedRegs function and the instruction pointer on the left side of the frame corresponding to the function g().

The algorithm is executed again to line 19. At this time, the frame corresponding to the function f() is processed, and we restore the values ​​of r1 and r2 from slot 2 and slot 1 respectively.

The algorithm executes again to line 19 to process the frame of the function main(). Since the function main() "doesn't exist" as a caller, we don't need to restore any callee-saved registers.

More precisely, the caller of the main() function should be located outside the entire garbage collection system, and any of its registers will not contain pointers related to garbage collection.

After completing the reconstruction of the register data of the function main() before calling the function f(), we can process the frames and registers of the main() function, and the functions f() and g() are also processed in exactly the same way.

Next, we use Figure 11.2b to introduce the two states that each frame will reach. One state corresponds to the 35th line of Algorithm 11.1, and the other corresponds to the line after the 38th line.

Figure 11.2b reflects the status of each frame at line 35 of the algorithm, where the bold value represents the updated value (although the value may not need to be updated), and the gray value represents the non-updated value.

insert image description here

What Regs records is the register state before the function main() calls the function f(), and the set Done is still empty at this time.

The function func updates the register r1 (because r1 belongs to the set pointerRegs at main() + 52), and adds it to the set Done, the purpose is to record that rl has been updated to the new address of the object it refers to (if it exists).

What Regs records is the register state before the function f() calls the function g(). Note that the values ​​of rl and r2 need to be restored to slot 1 and slot 2, and their corresponding values ​​in Regs need to be restored from Restore.

The function func updates r1 and adds it to the set Done

The register value recorded by Regs is the state of the register before the function g() initiates garbage collection.

Similar to step 11, the collector needs to re-save the value of r2 to slot 1, and its corresponding value in Regs needs to be restored from Restore. Since r1 has not been restored from Restore, r1 still exists in the set Done.

Function func will skip register r1 (because it already exists in set Done), but it will update r2 and add it to set Done.

Finally, in step 15, the function processstack restores the value of the register in Regs to the thread state.

insert image description here
Algorithm 11.1 Variant algorithm and stack map compression are omitted here.

11.2.6 Exact pointer lookup in code

References to objects on the heap may be embedded in program code, especially those managed runtime systems that allow code to be loaded at runtime or dynamically generated. Even for pre-compiled code, the static/global data it refers to may still be allocated from the newly initialized heap when the program starts.

The precise pointer lookup in the code has the following difficulties:

  • It is often difficult, if not impossible, to tell embedded data from code.

  • In code generated by an "uncooperative" compiler, it's nearly impossible to distinguish non-pointer data from pointers that might point to objects on the heap.

  • When a pointer is embedded in an instruction, the pointer itself may be broken up into several smaller pieces. The MIPS processor usually needs to use the load-upper-immediate instruction to load the 32-bit static pointer value into the register. This instruction first loads a 16-bit immediate value into the upper 16 bits of the 32-bit register and clears the lower 16 bits. Then use the or-immediate instruction to load another 16-bit immediate value into the lower 16 bits of the register. Similar code sequences may also occur for other instruction sets. The pointer value here is a special kind of derived pointer (see Section 11.2.8).

  • An embedded pointer value may not point directly to its target object, see our discussion of internal pointers (see Section 11.2.7) and derived pointers (see Section 11.2.8).


  • In some cases, we can find out the embedded pointer through code disassembly, but if every collection needs to disassemble the entire code and process the roots, it may introduce huge overhead. Of course, since the program does not modify these embedded pointers, the collector can cache their locations for efficiency.

  • A more general solution is for the compiler to generate an additional table to record the position of the embedded pointer in the code.

  • Some systems avoid this problem by simply disabling inline pointers. The possible problem with using this strategy is that the performance of the code may vary under different target architectures, different compilation strategies, and different access characteristics.

The case where the target object is movable .

If the target object of the embedded pointer is moved, the collector must update the embedded pointer.

  1. One of the difficulties in updating embedded pointers is that, for security or confidentiality reasons, the program code segment may be read-only, so the collector may have to temporarily modify the protection strategy of the code area (if possible), but this Operations may incur large system call overhead. Another strategy is to prohibit embedded pointers from referencing movable objects.

  2. Another difficulty with updating embedded pointers is that modifications to code in memory usually do not invalidate or force updates to copies of the code in other instruction caches , which may require all processors to be affected The affected instruction cache line is invalidated.

On some machines, the collector may also need to execute a special synchronization instruction after invalidating an instruction cache line, in order to ensure that future instruction loads occur after the invalidation operation.

In addition, the collector may also need to force the modified data cache line to memory (which holds the code modified by the collector) before invalidating the instruction cache line, and synchronization operations need to be used to ensure that The operation is complete. The implementation details here are related to the specific hardware architecture.

The case where the code is removable .

A special case is that the collector may move program code.

  1. At this time, the collector should not only consider all the problems when the target object can be moved, but also consider the correction of the return address stored in the stack and registers, because the collector may have moved the code where the return address is located.

  2. The collector must invalidate all instruction cache lines associated with the new address of the code, taking care to perform all relevant operations listed above. The deeper problem is that if even the code of the collector itself is movable, it will be more complicated to deal with.

  3. Code movement in a concurrent collector will be an extremely difficult task. At this time, the collector must either suspend all threads, or it can only use a more complicated method, that is, first ensure that both old and new code can be used by threads, and then Migrate all threads to the new code for a period of time, and finally reclaim the space occupied by the old code on the premise of ensuring that all threads are migrated.

11.2.7 Handling of internal pointers

The so-called internal pointer refers to a pointer that points to an address inside the object, but the address it points to is not a standard reference of the object. More precisely, we can think of an object as a set of memory addresses that do not overlap with other objects, and the internal pointer points to an address in the set.

Looking back at Figure 7.2 we can see that a standard object may not be equal to any of its internal pointers. In addition, the actual space occupied by the object may be larger than the space required by the developer's visible data. For example, the C language allows pointers to point to data beyond the end of the array, but this is still a legal internal reference to the array.

In some systems, language-level objects may be composed of several discontinuous memory segments, but when describing internal pointers (and derived pointers), our "objects" here only refer to objects located on a piece of contiguous memory. (language level) object.

The main problem encountered by the collector when dealing with internal pointers is to determine which object it points to, that is, how to deduce the standard reference of its target object through the value of the internal pointer. Possible options are as follows:

  • Use a table to record the starting address of each object .

    • The system can maintain the starting address of the object through an array, and the array can be organized using a two-level mapping, which is similar to the strategy used by Tarditi to record the recycling point in the code (see Section 11.2).
    • Another strategy is to use a bitmap. Each bit in the bitmap corresponds to a memory particle (memory allocation unit) in the heap, and at the same time, set the bit corresponding to the memory particle where the first address of the object is located to 1. This scheme may apply to all allocators and collectors.
  • If the system supports heap resolvability (see Section 7.6), the collector can scan the heap to determine which object the address pointed to by the internal pointer falls into.

    • It would be too expensive to search from the start address of the heap every time, so the system usually records the start address of the first (or last) object in the heap for each k-byte memory block, for convenience And to ensure computational efficiency, k is usually an integer power of 2. Based on this information, the collector can search in the memory block pointed to by the internal pointer, and may need to start searching from the previous memory block if necessary. Using additional tables will introduce space overhead, and heap parsing will introduce time overhead, and the collector needs to make an appropriate trade-off between the two. (The spanning mapping described in Section 11.8).
  • If the page cluster allocation strategy is used, the collector can obtain the size of the object through the metadata of the memory block pointed to by the internal pointer, and can also calculate the offset of the target address in the memory block (combining the target address with the appropriate The mask is ANDed to obtain the lower bits of the address), and the offset is rounded down according to the size of the object to obtain the first address of the object.


We assume that for any internal pointer, the collector can calculate the canonical reference of its target object. When the target object of an internal pointer is moved (such as in a copy collector), the collector must update the internal pointer at the same time and ensure that the relative position of its target address in the new object is exactly the same as before the move. In addition, the system may also pin the object .

If the system allows the use of internal pointers, the resulting main problem is that it takes extra time and space to process the internal pointers. If the number of internal pointers is relatively small and can be distinguished from regular pointers (tidy pointers) (i.e., pointers to the object's standard reference location), the time overhead of processing internal pointers may not be too large.

However, if you want to fully support internal pointers, you may need to introduce additional tables (although specific collectors usually contain some necessary tables or metadata), which increases the space overhead of the system, and maintaining the table will also introduce Additional time overhead.

The return address in the code is a special kind of internal pointer. Although there is no special difficulty in handling them, the table used by the collector when looking up the function corresponding to a return address is usually different from the table for various reasons. other objects.

11.2.8 Handling of derived pointers

Diwan et al. define derived pointers as:

  • A pointer obtained by performing an arithmetic operation on one or more pointers.

An internal pointer is a special case of a derived pointer, which can be expressed in the simple form of p + i or p + c, where p is a pointer, i is a dynamically calculated integer offset, and c is a static constant.

Since the address pointed to by the internal pointer must be located in one of the memory addresses covered by the object p, its processing is relatively simple, but the form of the derived pointer can be more generalized, for example:

  • u p p e r k ( p ) upper_k(p) upperk( p ) orlowerk ( p ) lower_k(p)lowerk( p ) , that is, the high k bits or low k bits of the pointer p.
  • p ± c p \pm c p±c , but the calculated address is outside the object p.
  • p − q p - q pq , the distance between the two objects.

In some cases, we can deduce the regular pointer (that is, the pointer to the standard reference address) based on the derived pointer, for example, the derived pointer p + c and c is a constant determined at compile time .

We usually have to know the basic expression that generates a derived pointer. Although the expression itself may also be a derived pointer, if we go back to the source, we can definitely find the regular pointer that generates a derived pointer.


In a non-moving collector , the collector can simply treat the regular pointer as the root. However, it should be noted that at the time of garbage collection, even if the derived pointer is still alive, the regular pointer of the target object may still be determined to be dead by the compiler's live variable analysis, so the compiler must reserve at least one regular pointer for each derived pointer . pointer , but p ± cp \pm cp±The case of c is an exception, because the collector can calculate the corresponding regular pointer by adjusting the derived pointer through a compile-time constant, and this process does not need to rely on other runtime data.


In the mobile collector , the processing of derived pointers requires further support from the compiler:

  • In order to record from which address each derived pointer was calculated, and how to reconstruct the derived pointer, the compiler needs to expand the stack map.

Diwan et al. gave a processing form such as ∑ ipi − ∑ jqj + E \sum _i p_i - \sum _j q_j+Eipijqj+A general solution for derived pointers of E where pi p_ipiand qj q_jqjis a regular or derived pointer, and E is a pointer-independent expression (even if pi p_ipior qj q_jqjmoves, the expression will not be affected in any way).

Its processing flow is:

  • First subtract pi p_i from the derived pointerpiThen add qj q_jqj, and then calculate the value of E, and then perform the movement, and finally according to the pi ′ p'_i after the movementpiand qj ′ q'_jqj, and E computes a new derived pointer value.

Diwan et al. pointed out that compiler optimization may bring some additional problems to the processing of derived pointers

  • dead base variables
  • Multiple derived pointers point to the same code location (causes the collector to involve more variables when processing a derived pointer)
  • Indirect reference (the value of the variable is recorded somewhere in the middle of the reference chain)
  • wait

Compilers sometimes need to optimize code less to support derived pointers, but the impact is usually minor.

11.3 Object table

Based on the performance of the setter and the consideration of space overhead, many systems use pointers directly to objects to represent references. A more general solution is to assign a unique identifier to each object, and use a certain mapping mechanism to locate the address of its specific data.

This technique is attractive for scenarios where objects are large and potentially persistent, but the underlying hardware address space is relatively small. In this section we focus on how the heap fits into the address space.

In addition to the object table being a very useful solution in the above scenario , object tables are also useful in many other systems.

An object table is usually a dense array where each entry refers to an object. The object table can contain only pointers to object data, or it can contain other additional state information.

To ensure execution speed, an object's reference is usually its direct index in the object table, or a pointer to its corresponding entry in the object table. If the direct index is used, the work of the recycler to migrate the object table is very simple, but when the system accesses a specific object, it must first obtain the base address of the object table, and then execute the offset. If the system can provide a special register to save The base address of the object table, this operation does not require additional instructions.

Notable advantages:

  • It can simplify the tidying of the heap, that is, when an object needs to be moved, the collector can simply move the object and update its corresponding entry in the object table.

In order to simplify this process, a self-referencing field (or a pointer to its corresponding entry in the object table) should be implied inside the object, so that the collector can quickly find its corresponding entry in the object table through the data of the object .

On this basis, the mark-organize collector can complete the mark in the traditional way (need to be realized indirectly through the object table), and then simply "squeeze out" garbage objects, so as to realize the sliding arrangement of object data. The collector can organize the free entries in the object table into a free list.

It should be noted that it is more efficient to place the tag position of the object in its corresponding entry in the object table, which can save a memory access operation when checking or setting the tag bit. Additional marker bitmaps have similar advantages. Additional metadata about the object can also be placed in the object table, such as references to its type and size information.

The object table itself can also be sorted, for example using double-pointer arithmetic as described in Section 3.1. The object table can also be organized while the object data is being sorted. In this case, only one object data traversal is required to realize the sorting of the object data and the object table at the same time.

The object table strategy can be problematic, or even a hindrance, if the programming language allows the use of internal or derived pointers. Similarly, object tables also have difficulty handling references to objects on the heap from outside code, a problem we will discuss in Section 11.4.


If a programming language prohibits internal pointers, whether or not object tables are used, the specific implementation of the language will not be affected by any semantics. However, there is a language feature that more or less needs to rely on object tables to ensure its implementation efficiency, namely Smalltalk's become: primitives. The function of this primitive is to swap the identities of two objects. If the object table is used, its implementation is quite simple. The setter only needs to swap their corresponding entries in the object table. Without the support of the object table, the become: operation may require a scan of the entire heap. But even if you don't use the object table, it is acceptable to use the become: operation sparingly (Smalltalk usually uses the become: operation to set the new version of the object), after all, the direct reference method is more efficient than the object table in most cases.

11.4 References from external code

Some languages ​​or systems allow code outside the managed environment to use objects allocated in the heap. A typical example is the Java Native Interface , which allows code developed in C, C++, or other languages ​​to access the Java heap. Object. More generally, almost every system needs to support input/output, which almost certainly requires some exchange of data between the operating system and the heap.

If the system needs to support external code and data references to objects in the managed heap, there are two difficulties.

  1. If an object is reachable from external code, how can the collector correctly treat it as a live object and ensure that the object will not be reclaimed until the external code's access is over.

We usually only need to satisfy this requirement during the call to external code, so we can keep a living reference to the object on the stack of the thread that originated the external call. However, some managed objects may also be used by external code for a long time, and their reach may also extend beyond the function that originally initiated the external call.

For this reason, the collector usually maintains a table of registered objects to record such objects. If external code needs to continue using an object after the current call completes, it must register the object, and it must explicitly unregister it when it is no longer needed and will not be used in the future. The collector can simply treat the references in the registered object table as additional roots.

  1. How can external code determine the address of the object (this problem only occurs in the mobile collector)

Some implementation interfaces isolate concrete objects from external code, which can only access objects on the heap through the channels provided by the collector. This type of interface supports mobile collectors better. The collector usually converts the pointer into a handle before being used by external code. The handle will contain the real reference to the object in the heap, and may also contain some other managed data. The handle here is equivalent to an entry in the registered object table, and it is also the root of recycling. The Java native interface implements external calls in this way. Note that handles are very similar to entries in the object table.

Handles can not only serve as a bridge between the managed heap and the unmanaged world, but also better adapt to the mobile collector, but not all external accesses can follow this access protocol, especially operating system calls.

At this point the collector must avoid moving objects referenced by external code. To this end, the collector may need to provide a pin interface and provide pin and unpin operations.

When an object is pinned, the collector will not move the object, which also means that the object is reachable and will not be recycled .


If we know that the object may need to be pinned when we allocate it, we can directly allocate it in the non-moving space. File stream IO buffers are allocated in this way. But it is usually difficult for programs to judge in advance which object needs to be pinned in the future, so some languages ​​support pin and unpin functions so that developers can independently pin and unpin any object.

The pegging operation is not a problem in non-mobile recyclers, but it can cause some inconvenience in mobile recyclers. There are several solutions to this problem, each with its own advantages and disadvantages.

Delay collection, or at least for regions containing pinned objects. This scheme is simple to implement, but has the potential to run out of memory before unpinning.

If the application needs to pin an object, and the object is currently in the movable region, we can immediately reclaim the region where the object was (and other regions that must be reclaimed at the same time) and move it to the non-movable region.

This strategy is suitable for scenarios where pinning operations are infrequent, and it is also suitable for collectors (such as generational collectors) that promote surviving objects in the new generation to non-moving mature spaces.

Extending the collector so that pinned objects are not moved when collecting, increases the complexity of the collector and may introduce new efficiency issues.

We will take the basic non-generational copy collector as an example to consider how to extend the mobile collector to support pinned objects.

To achieve this goal, the collector must first be able to distinguish pinned objects from unpinned objects.

  • The collector can still copy and forward unpinned objects

  • For pinned objects, the recycler can only track and update the pointer to the moved object, but cannot move the object. The recycler must also record the pinned objects it finds.

  • When all surviving objects are copied, the collector cannot simply free the entire source space, but only the gaps between pinned objects.

At this time, what is recovered by recycling is no longer a single, continuous free memory, but may be several smaller, discontinuous space collections. The allocator can use each piece of space as a separate sequential allocation buffer. .

Pinned objects inevitably cause memory fragmentation, but the resulting fragmentation can be eliminated during future collections once pinned objects are unpinned. As we saw in Section 10.3, some principal non-moving collectors
employ a similar scheme of sequential allocation between surviving objects.


Another difficulty that pinning objects introduces to the mobile collector is that:

  • Even if the object is pinned, the collector still needs to scan and update it, but while doing so, external code may be accessing the object, causing a race condition.

To this end, the collector not only needs to pin objects directly referenced by external code, but also may need to pin other objects it references. Similarly, if the external code traverses other objects from a certain object, or only judges/copyes the reference of the object without caring about its internal data, the collector still needs to pin it.

The nature of the programming language itself or its specific implementation may also rely on the object's pinning mechanism.

例如,如果编程语言允许将对象的域当作引用来传递,则栈中可能会出现指向对象内部域的引用。此时我们可以使用11.2.7节所描述的内部指针相关技术来移动包含被引用域的对象,但该技术的实现通常较为复杂,且正确处理内部指针的代码可能会难以维护。

因此某些语言实现通常会简单地将此类对象钉住,这便要求回收器能够简单高效地判定出哪些对象包含直接被其他对象(或者根)引用的域。

该方案可以轻易解决内部指针的处理问题,但却无法进一步拓展到更一般化的派生指针问题(参见11.2.8节)。

11.5 栈屏障

回收器可以使用增量式栈扫描策略,但也可以使用 栈屏障(stack barrier) 技术进行主体并发扫描。该方案的基本原理是在线程返回(或者因抛出异常而展开)到某一帧时对线程进行劫持。

假设我们在栈帧F上放置了屏障,然后回收器便可异步地处理F的调用者及其更高层次的调用者等,同时我们可以确保在异步扫描的过程中,线程不会将调用栈退回到栈帧F中。

引入栈屏障的关键步骤在于劫持帧的返回地址,即将帧上保存的返回地址改写为栈屏障处理函数的入口地址,同时将原有的返回地址保存在栈屏障处理函数可以访问到的标准地址,例如线程本地存储中。栈屏障处理函数可以在合适的时候移除栈屏障,同时还应当小心确保不会对上层调用者的寄存器造成任何影响。

  • 同步(synchronous) 增量扫描: 当赋值器线程陷入栈屏障处理函数时,其会向上扫描数个栈帧,并在扫描结束的位置设置新的栈屏障(除非处理函数已经完成整个栈的扫描)。

  • Asynchronous (asynchronous) incremental scanning: is performed by the recycling thread , and the purpose of the stack barrier at this time is to suspend the scanned thread before it reaches the scanned stack frame. After scanning several frames, the scanning thread can move the stack barrier along the return direction of the call stack, so the scanned thread may never touch the stack barrier. Once it is reached, the scanned thread must wait for the scanning thread to execute Complete and remove the stack barrier before continuing to execute.


Cheng and Blelloch use a stack barrier technique to limit the amount of work within a collection increment and use this technique to implement asynchronous stack scans. They divide the thread stack into sub-stacks (stacklets) of fixed size , each sub-stack can be scanned at one time, and the position returning from one sub-stack to another sub-stack is the candidate position of the stack barrier. This solution does not require the continuous layout of each sub-stack, and also does not need to determine in advance which frames can be placed on the stack barrier.

The collector can also use stack barriers in a completely different way, by using stack barriers to record which parts of the stack have not changed , so the collector does not have to look for new pointers in these locations every time. In the main concurrent collector, this technique can reduce the flip time at the end of the collection cycle.

Another use of stack barriers is to handle dynamic changes to code, especially optimized code. For example, suppose that in a certain scene, sub-process A calls B, and B calls C. We further assume that the system inlines A and B, that is, A+B share one frame. If the user modifies B, subsequent calls to B should execute into its new version of the code.

Therefore, when the thread returns from C, the system needs to deoptimize A+B , and create new frames for the unoptimized versions of A and B respectively. Only after the thread returns from B to A, the subprocess A can Access the new version of subprocess B. It is even possible to re-optimize the system and build a new version of A+B. What we focus on here is that the process of returning from C to A+B will trigger de-optimization, and the stack barrier is just an implementation of the trigger mechanism.

11.6 Safe collection points and suspension of setters

We mentioned in Section 11.2 that the collector needs to know which stack slots and which registers contain pointers; Usually subject to change.

For which locations can be garbage collected, there are two issues that need attention:

  1. Whether the collector can safely perform garbage collection at a certain IP
  2. How to control the size of the stack map (see Section 11.2 for details on stack map compression), if you allow garbage collection to be performed in more places, then a larger stack map is usually required.

Let's consider the reasons why the collector cannot safely perform garbage collection at a certain IP.

Most systems typically have small sequences of code that must be executed as a whole to ensure that some invariants on which garbage collection depends are satisfied. For example, a typical write barrier not only performs low-level write operations, but also records some additional information.

If the garbage collection process occurs between these two phases, it may cause some objects to be missed, or some pointers to be incorrectly updated.

Systems often contain many such short code sequences, all of which should be atomic from the garbage collector's point of view (although they are not really atomic operations in the strict concurrency sense). More examples include the creation of new stack frames, the initialization of new objects, and so on.

The system can simply allow the collector to initiate garbage collection at any IP location. At this time, the collector will not need to care whether the setter thread has been suspended at a location where garbage collection can be safely performed, that is, the safe collection point (GC-safe point ) or short Recycling point (GC-point) , but such systems are usually more complicated to implement, because the system must provide a corresponding stack map for each IP, or can only use techniques that do not require stack maps (such as for "non-cooperative" C and C++ compiler related technologies).

  • Assuming that the system allows the collector to initiate garbage collection at most of the IP locations, if a thread is suspended at an unsafe IP location when recycling is initiated, the collector can perform garbage collection after the thread suspension location and before the next safe collection point. instructions to be parsed, or to wake up the thread for a short period of time so that it (with a certain probability) can run to a safe recycling point. Instruction parsing increases the risk of errors, while driving a thread forward for a short distance only guarantees that it reaches a safe recycling point with a certain probability. In addition, the stack map space required by such systems can also be large.

  • Many systems use an entirely different strategy of only allowing garbage collection to occur at specific, registered safe collection points, and only generating stack maps for those collection points. For the sake of correctness, the minimum set of safe collection points should include every memory allocation location (because garbage collection usually occurs here), all subroutine calls where object allocation may occur, and all possible thread hangs. Subroutine calls (because while a thread is suspended, other threads may trigger garbage collection).


In order to ensure that the thread can reach the safe recycling point within a limited time, the system can add recycling points in more locations than the minimum set of safe recycling points.

To do this the system may need to add safe recycling points in each loop:

  • A simple rule is to set all back branches inside functions as safe collection points . It is also necessary for the system to set a safe recovery point at the entry and return position of each function, otherwise the thread may need to go through a large number of function calls, especially recursive calls, before reaching the safe recovery point.

由于这些额外的回收点并不会真正触发垃圾回收,所以在线程这些位置只需要检查是否有其他线程发起垃圾回收,因此我们可以称其为 回收检查点(GC-checkpoints)

尽管回收检查点会给赋值器带来一定开销,但这一开销通常不大,编译器也可以通过一些简单的方法来减轻这一开销。例如当函数十分短小,或者其内部不包含循环或进一步函数调用时将回收检查点优化掉。

为避免在循环的每次迭代中都执行回收检查,编译器也可以额外引人一层循环,即在每n轮迭代之后才执行回收检查。

当然,如果回收检查的开销很小,这些优化手段便不再必要。总之,系统必须在回收检查的频率和回收发起时延之间做出平衡。


Agesen 对两种将线程挂起在安全回收点的策略进行了比较。

  • 一种策略是 轮询(poll) ,即我们刚刚介绍的方案,该方案要求线程在每个回收检查点都要对一个旗标进行检查,该旗标被设置则意味着其他线程已经发起了垃圾回收。

  • 另一种方案是使用 补丁(patching) 技术,即当某一线程处于挂起状态时修改其执行路径上的下一个(或者多个)回收点的代码,线程恢复执行后便可在下一个回收点停顿下来。

这与调试器在程序中放置临时断点的技术类似。Agescn发现,补丁技术的开销要比轮询技术低得多,但其实现起来也更加复杂,在并发系统中也更容易出现问题。

在引出回收检查点这一思想时,我们曾经提到过回收器和赋值器之间的 握手(handshake) 机制。

Even for the non-true "concurrent" situation where multiple setter threads execute on the same processor, a handshake mechanism is necessary. Before recycling starts, the collector must clear all The thread at the safe recycling point wakes up and makes it run to the safe recycling point. To avoid this extra complexity, some systems can guarantee that threads will only be suspended at safe recycling points, but for other reasons, the system may not be able to control all aspects of thread scheduling, so handshaking must still be resorted to.

special handshake mechanism

Each thread can maintain a thread local variable, which is used to reflect whether other threads in the system need the thread to pay attention to a certain event at the safe recovery point. This mechanism can be used in various scenarios including signaling garbage collection. The thread will check this local variable at the recycling checkpoint. If the variable is non-zero, the thread will execute a specific system sub-process according to the value of the variable.

A special value will mean "it's time for garbage collection". When the thread finds this request, it will set another local local variable, which indicates that the thread is ready. A global variable that a collector is watching performs an autodecrement to do this. The system usually tries to reduce the access overhead of thread local variables as much as possible, so this strategy may be a good handshake mechanism implementation.

Another solution is to set the processor condition code (processor condition code) in the thread state saved by the suspended thread , so that the thread can call the system subsystem corresponding to the condition code through a very cheap conditional branch at the recycling checkpoint. process.

This scheme is only applicable to processors (such as PowerPC) that contain multiple sets of condition codes, and it must also ensure that the thread will not be in the context of external code after being woken up. If the processor has enough registers, a single register can be used to represent a signal, and the overhead of using a register is almost as small as a condition code. If the thread is executing external code, the system needs some way to pay attention to when the thread returns from the external code (unless the thread happens to be suspended at a position equivalent to a safe recovery point), hijack the return address (or See Section 11.5) is one of the strategies for catching a thread returning from external code.

The system can also implement handshaking using inter-thread signaling at the operating system level, such as some implementations in POSIX threads. This strategy may not be widely portable, and its execution efficiency may also be an issue. One of the reasons for affecting the efficiency is that the signal transfer to the user-level processing function needs to pass through the operating system kernel-level channel, and the processing path of this channel is relatively long. In addition, this mechanism not only needs the help of the underlying processor interrupt, but also affects the cache and translation back buffer, which is also one of the important reasons affecting its execution efficiency.


To sum up, there are two main ways to implement the handshake mechanism between the recycler and the setter thread:

  1. Synchronous notification, also known as polling
  2. Asynchronous notification through some kind of signal or interrupt

We need to further point out that if each thread directly scans its stack, the concurrency of the hardware and software must also be considered, and the relevant content of Chapter 13 may be involved here. Perhaps the most relevant part of the handshake mechanism is Section 13.7, where we will introduce how relevant threads migrate from one phase of collection to another, and what work the setter thread should perform at the beginning and end of collection.

11.7 Recycling for Code

Many systems already have the ability to dynamically load or build code and optimize it at runtime. Since the system can dynamically load or generate code, we naturally hope that the space occupied by these codes can be reclaimed when they are no longer used. Faced with this problem, direct trace-style or reference-counting algorithms usually cannot meet this requirement, because many function codes reachable from global variables or symbol tables will never be cleared. Some languages ​​can only rely on the developer to explicitly unload these code instances, but the language itself may not even support this operation at all.

In addition, there are two special scenes that deserve further attention.

  1. A closure that binds a function to a set of environment variables. We assume that a simple closure is composed of a function g embedded in a function f and the complete environment variables of the function f, and an environment object may be shared between them. Thomas and Jones [1994] describe a system that can specialize a closure's environment variables as variables used only by the function g when garbage collected. This strategy ensures that some other closures are eventually unreachable and recycled.

  2. in a class-based system. Object instances in such systems often refer to information about the type they belong to. The system usually stores type information and the code corresponding to its methods in a non-moving, non-garbage-collected area, so the collector can ignore pointers to type information in all objects. However, if the type information is to be recycled, the collector must track the pointers to the type information in all objects. Under normal circumstances, this operation may significantly increase the recycling overhead. The collector can track pointers to type information only in special modes.

For Java, the runtime class is determined by its class code and class loader (class loader) .


insert image description here
Class life cycle:

  • load
  • verify
  • Prepare
  • initialization
  • uninstall

As for when to load, there is no mandatory constraint in the Java virtual machine specification, which can be left to the specific implementation of the virtual machine to grasp freely. But for the initialization phase, the virtual machine specification strictly stipulates that there are only 5 situations where the class must be "initialized" immediately (and loading, verification, and preparation naturally need to start before this):

  1. When encountering the four bytecode instructions of new, getstatic, putstatic or invokestatic, if the class has not been initialized, its initialization needs to be triggered first. The most common Java code scenarios for generating these 4 instructions are: when using the new keyword to instantiate an object, reading or setting a static field of a class (static field modified by final, and the result has been put into the constant pool at compile time) fields), and when calling a static method of a class.

  2. When using the method of the java.lang.reflect package to make a reflective call to a class, if the class has not been initialized, you need to trigger its initialization first.

  3. When initializing a class, if you find that its parent class has not been initialized, you need to trigger the initialization of its parent class first.

  4. When the virtual machine starts, the user needs to specify a main class to be executed (the class containing the main() method), and the virtual machine first initializes the main class.

  5. 当使用JDK 1.7的动态语言支持时,如果一个java.lang.invoke.MethodHandle实例最后的解析结果REF_getStatic、REF_putStatic、REF_invokeStatic的方法句柄,并且这个方法句柄所对应的类没有进行过初始化,则需要先触发其初始化。


由于系统在加载类时通常会存在一些副作用(例如初始化静态变量),所以类的卸载会变得不透明(即存在副作用——译者注),这是因为该类可能会被同一个类加载器重新加载。

唯一可以确保该类不被某个类加载器加载的方法是使类加载器本身也能得到回收。类加载器中包含一个已加载类表(以避免重复加载或者重复初始化等),运行时类也需要引用其类加载器(作为自身标识的一部分)。

因此,如果要回收一个类,则必须确保其类加载器、该类加载器所加载的其他类、所有由该类加载器所加载的类的实例都不被现有的线程以及全局变量所引用(此处的全局变量应当是由其他类加载器加载的类的实例)。

另外,由于 引导类加载器(bootstrapclass loader) 永远不会被回收,所以其所加载的任何类都无法得到回收。由于Java类卸载是一种特殊的情况,所以某些依赖这一特性的程序或者服务器可能会因此耗尽空间。

即使对于用户可见的代码元素(例如方法、函数、闭包等),系统也可能为其生成多份实例以用于解析或者在本地执行,例如经过优化的和未经优化的版本、函数的特化版本等。

为函数生成新版本实例可能会导致其老版本实例在未来的调用中不可达,但这些老版本实例可能仍在当前的执行过程中得到调用,它们在栈槽或者闭包中的返回地址会保持其可达性。

因此在任何情况下,系统都不能立即回收老版本代码实例,而只能通过追踪或者引用计数的方法来将其回收。

The relevant technique here is the on-stack replacement technique, that is, the system uses a new version of the code instance to replace its executing old version instance. This solution can not only improve the performance of running method calls, but also help to recycle old versions of code, so it is more and more widely used.

The direct purpose of on-stack replacement technology is usually to optimize code or some other applications, such as debugging requirements that require de-optimization of code. On the other hand, the collector can also use this technology to recycle old version code.

11.8 Read and write barriers

Many garbage collection algorithms require the setter to detect and record collection-related pointers (interesting pointers) at runtime . If the collector reclaims only a portion of the heap, any pointers to that region from outside that region are collection-related pointers, and the collector must treat them as roots in subsequent processing.

For example, a generational garbage collector must catch all writes that write references to objects in the young generation to objects in the old generation.

When setters and collectors are executed alternately (regardless of whether the collector runs on a separate collection thread), it is very likely that the setter operation will cause the collector to fail to track some reachable objects. If these references do not Live objects may be prematurely collected if they are correctly detected and passed to the collector. These scenarios require the evaluator to immediately add the recovery-related pointers to the work list of the collector, and the completion of this task requires the help of a read-write barrier.

In this section, we will abstract the read-write barriers in various specific recycling algorithms (such as generational collectors or concurrent collectors), and focus on the detection and recording of recycling-related pointers.

  • Detection is to determine whether a pointer belongs to a recovery related pointer
  • Record (record) is to register the recycling related pointers for subsequent use by the recycler

Probing and logging are somewhat orthogonal, but the use of certain probing methods may enhance the advantages of a particular logging method, for example, if a write barrier is detected by a page protection violation, logging where modified would more reasonable.

11.8.1 Design Engineering of Read-Write Barriers

A typical barrier usually includes some additional checks and operations in addition to performing the actual read/write operations. Typical checks include judging whether the pointer being written is null, the relationship between the referenced object and the generation of the referrer, etc., and the typical operation is to record the object into the memory set.

The complete check and logging operations may be too large to be inlined as a whole, but this depends on the specific implementation of the barrier. Even if the resulting inlined sequence of instructions is relatively short, it can still cause significant bloat in the code generated by the compiler and further affect the performance of the instruction cache.

Since most of the code inside a barrier is usually rarely executed, the designer can divide instruction sequences into "fast path" and "slow path":

  • The fast path usually does inlining to ensure performance
  • The slow path is called only when necessary, that is, in order to save space and improve the performance of the instruction cache, the slow path usually only has one code instance.

It is important that the fast path should cover the most general cases, while the slow path should only be executed in some cases. In some cases, this rule also applies to the slow path design:

  • If the barrier often performs multiple checks, it is necessary for the designer to properly order the check logic and ensure that the first check will filter out most cases, the second check can filter out the most cases, and so on, so that Minimize inspection overhead.

To meet this requirement, designers usually need to sequence the check logic in multiple ways and measure their performance separately, because there are so many influencing factors in the modern hardware environment that simple analysis models usually cannot give good enough results. Guidelines.


Another factor in improving the performance of read-write barriers is to speed up the access speed of all necessary data structures, such as card tables. The system can even pay the price of a register to save a pointer to a certain data structure, such as the base address of the card table, etc., but whether it is worth it depends on the type of machine and algorithm.

Designers also need to pay attention to software engineering, including how to integrate various aspects of the garbage collection algorithm (ie, read and write barriers, collection checks, allocation order, etc.), which will be built into the system's compiler.

If possible, the designer should be able to indicate to the compiler which sub-processes need to be inlined, and inside these sub-processes should be the code sequences corresponding to the fast path. This way, the compiler doesn't need to know the details, and the designer is free to replace these inlined subroutines. But as we mentioned earlier, these code sequences may have some restrictions, such as not allowing garbage collection to occur during their execution, which requires the designer to be careful.

Compilers may also need to avoid optimizing these code sequences, such as keeping some obvious useless writes (they write data that is useful to the collector), prohibiting instruction reordering of barrier code or interleaving with surrounding code. Finally, the compiler may need to support some special pragmas , or allow designers to use special compilation attributes, such as uninterruptible code sequences.

11.8.2 Write Barrier Precision

There are many different implementation strategies and mechanisms for recording recycling-related pointers, and the specific implementation strategy determines the accuracy of the memory set to record the location of recycling-related pointers. When choosing a recording strategy for reclaiming related pointers, we need to balance the respective overheads of the setter and collector.

In practice, we usually tend to increase the overhead of relatively infrequent collection processes (such as finding the root collection), while reducing the overhead of more frequent setter actions (such as heap write operations).

After the introduction of write barriers, the number of instructions required for pointer writes may double or more, but this overhead is likely to be masked if the locality of the write barrier is better than that of the setter itself ( For example, write barriers typically do not cause delays in user code when recording collection-related pointers).

Generally speaking, the higher the recording accuracy of recycling related pointers in the memory set, the lower the overhead of the recycler's search operation, and the higher the overhead of the evaluator for filtering and recording pointers.


As an extreme case, the setter in the generational collector can not record any pointer write operations, thus transferring all collection-related overhead to the collector, which can only scan the entire heap space and find all pointers to References to generational convictions.

Although this is not a general and successful generational strategy, it may be the only optional generational strategy for scenarios where the pointer write operation cannot be captured with the support of the compiler or operating system. At this time, the collector can Use a more localized linear scan rather than a trace strategy to find collection-related pointers.

The design strategy of the memory set needs to be considered from three dimensions.

  1. What precision should the record of recycling related pointers achieve

Although not all pointers are collection-related pointers, the overhead of unconditional logging for setters is obviously lower than logging after filtering for recycling-unrelated pointers. The specific implementation of the remembered set is the key to determining the filtering overhead:

  • If the memoized set can add entries using a very cheap mechanism, such as simply writing a byte in some fixed-size table, then this strategy is well suited for unconditional logging, especially if the adding operation itself is idempotent Down

  • If the overhead of adding entries to the remember set is high, or if the size of the remember set also needs to be controlled, a write barrier to filter out irrelevant pointers is necessary. For concurrent collectors or incremental collectors, filtering operations are essential to ensure that the collector's work list can eventually be empty.

Each filtering strategy needs to consider how far the filtering logic should be inlined, and when it should be performed through external calls or adding pointers to remembered sets. The more instructions that are inlined, the fewer instructions need to be executed, which can lead to bloated code size and increase the probability of instruction cache misses, which in turn affects the performance of the program. Therefore, developers need to fine-tune the order of filter checks and filter operations that need to be inlined.

  1. What granularity should be achieved in recording pointer positions
  • The most accurate solution is to record the addresses of the fields the pointers are written to, but this may increase the size of the remembered set if there are many pointer fields in an object (such as when updating an array).
  • In addition, it is possible to record the object where the pointer field is modified, which has the advantage of deduplication according to the object, which is usually impossible to record the pointer field (because there is usually no extra space in the pointer field to mark whether the field is has been recorded).

The method of recording objects requires the collector to scan each pointer field inside the object during the tracking phase, so that it can find the objects they refer to that have not yet been tracked.

A hybrid solution records arrays at the object granularity and plain objects at the pointer field granularity, because when one field in the array is updated, the other fields are usually updated as well. The opposite strategy can also be used, that is, arrays are recorded at the granularity of pointer fields (to avoid scanning the entire array), and pure objects are recorded at the granularity of objects (pure objects are usually relatively small).

For arrays, it is also possible to record only a part of the array. This strategy is very similar to the card mark strategy, except that it is aligned according to the array index rather than the address of the array field in virtual memory. Whether an object or a field should be recorded also depends on what information is available to the evaluator:

  • If the write operation can obtain both the address of the object and the address of the pointer field, it can choose arbitrarily

  • Computing the address of the object it belongs to may introduce additional overhead if the write barrier can only take the address of the domain being written to.

Hosking et al. solved this problem in an interpreted Smalltalk system. Their strategy is to record the addresses of objects and fields in the sequential storage buffer at the same time.


The card table technique logically divides the heap into smaller, fixed-size cards.

This scheme uses the card as the granularity to record the modification operation of the pointer, and the recording method is usually to set a mark byte in the card table. The card mark can correspond not only to the field to be modified, but also to the object to be modified (the two types of information can correspond to different cards). In the recycling phase, the collector must first find all the dirty cards related to the generation to be collected , and then find out all the recycling-related pointers recorded in them. The recording method of the card table (record object or record field) will affect the performance of the lookup process.

The recording method with a coarser granularity than the card table is based on the virtual memory page. The advantage is that the write barrier can be implemented with the support of the hardware and the operating system, so that it will not bring any direct burden to the evaluator, but it is different from the card. Tables are similar, and the workload of the collector will increase. The difference with the card table is that since the operating system cannot obtain the layout information of the object, the page marking scheme usually only corresponds to the modified pointer field, but cannot obtain the object to which it belongs.

  1. Whether to allow remembered sets to contain duplicate entries

The advantage of allowing duplicate entries is that it can reduce the deduplication detection overhead of the setter, but the cost is to increase the size of the memory set and the overhead of the collector to handle duplicate entries.

  • The card table and page marking technology is marked by setting the mark bit or mark byte in the table, so it can naturally realize deduplication.

  • If you use the method of recording objects, you can also achieve deduplication by marking objects, such as recording whether it has been added to the log through a mark bit in the object header, but if you use the pointer field as the record granularity, you cannot pass this. One way to perform simple deduplication.

Although this strategy can reduce the space size of the memory set, it requires the setter to perform an additional judgment logic and an additional write operation.

If duplicate objects are not allowed in the remember set, the remember set implementation must be a true set rather than a multiset .


综上所述,如果使用卡表或者基于页的记录策略,则回收器的扫描开销取决于脏卡或者脏页的数量。

如果允许记忆集中出现重复条目,则回收器的开销将取决于指针写操作的数量,而如果不允许重复,则回收器的开销取决于被修改的指针域的数量。不论对于哪种情况,过滤掉回收无关指针都会减少回收器扫描根集合的开销。记忆集的实现方式包括哈希表、顺序存储缓冲区、卡表、虚拟内存机制与硬件支持,我们将逐一进行介绍。

11.8.3 哈希表

如果对象头部中没有足够的空间来记录其是否已经添加到记忆集,需要通过集合来记录对象。我们进一步希望向记忆集中增加条目的操作可以很快完成,最好是在常数时间内。哈希表即是满足这些条件的实现方案之一。

在Hosking 等 的多分代内存管理工具包中,他们给出了一种基于线性 散列环状哈希表(circular hash table) 的记忆集实现方案,并将其应用在一种Smalltalk解释器中,该解释器将栈帧保存在堆的第0分代的第0阶中。

具体而言,每个分代都会对应一个独立的记忆集,且记忆集中既可以记录对象,也可以记录域。其哈希表基于一个包含 2 i + k 2^i+k 2ik个元素的数组实现(k = 2),它们将地址映射为一个i位的哈希值(从对象的中间几位中获取),并以此作为该地址在数组中的索引。

如果该索引对应的位置为空,则将该对象的地址或域保存在该索引位置,否则将在后续的k个位置中查找可用位置(此时并非环状查找,正因如此,数组的大小才是2+k)。如果依然查找失败,则对数组进行环状查找。

为减轻记录指针的工作量,写屏障首先过滤掉所有针对第0代对象的写操作以及所有新—新指针(即从新生对象指向新生对象的指针)的创建。另外,写屏障会将所有回收相关指针添加到一个单独的“草稿”记忆集中,而非直接将其添加到目标分代所对应的记忆集。

This strategy does not take the evaluator's time to determine which memory set the recycling-related pointer belongs to, so it may be more suitable for a multi-threaded environment. In addition, maintaining a "draft" memory set for each processor can also avoid potential Conflict issues, because thread-safe hash tables may introduce large overhead at runtime.


Hosking et al. use 17 inline MIPS instructions to implement a fast path to the write barrier, which includes the associated calls to update the memoized set.

Even for an architecture with many registers like MIPS, the overhead of this solution is relatively high.

During the recycling phase, roots from a generation are either in the generation's corresponding memory set, or in the "draft" memory set. The collector can deduplicate by rehashing the collection-related pointers in the generation's corresponding memory set into the "draft" memory set, and then add all the collection-related pointers in the "draft" memory set to the appropriate memory set.

Garthwaite also uses hash tables in his implementation of the train reclamation algorithm.

The operation of its hash table is generally insertion and iteration, so it uses open addressing (open addressing) to solve the conflict problem. Since adjacent addresses are often recorded in the hash table, it discards the linear addressing method that maps adjacent addresses to adjacent slots in the hash table (that is, simple address modulo N, where N is the size of the hash table) , replaced by a generic hash function.

Garthwaite chose a 58-bit prime number p, and bound two parameters α and b to each hash table. These two parameters are generated by repeatedly calling a pseudo-random function, and 0<a, b<p. The index corresponding to an address r in the hash table is ( ( ar + b ) modp ) mod N ((ar + b) mod p) mod N((ar+b)modp)modN

当冲突发生时,开放定址法需要一定的手段来进行再次探测。线性探测与平方探测(下一个探测位置的索引值为当前索引值加 d,且每次探测都对d增加一个常量i)可能会导致一组插入请求产生相同的探测序列,因而Garthwaite使用再散列方法,即把平方探测中的增量i替换为一个基于地址的函数。

对于大小为2的整数次幂的哈希表,如果探测增量i为奇数,则可以确保整个哈希表都可以探测到。Garthwaite的策略是:

  • 在每次探测时判断d是否为奇数,如果是,则将i设置为零(线性探测),否则则将d和i都设置为d +1。如此一来,可用探测序列的集合便可翻倍。

最后,如果哈希表的负载过高,则需要进行扩展,一种可选方案是通过修改插人过程来重新平衡哈希表。当发生碰撞时,我们需要判断是要将当前正在插入的地址进行再次探测,还是将当前槽中原有的对象进行冲突探测(并将其插人到新的位置)。

Garthwaite等人使用robin hood哈希,其每个槽中存储的条目都会记录其插入过程中的探测次数,由于哈希表所记录的地址中会存在很多为零的位(例如卡的地址),所以可以复用这些位来记录探测次数。

当插入一个新地址时,如果探测所得的槽已被占用,我们会选择槽中的现有地址以及待插入地址中探测次数较多的一个留在槽中,而对另一个地址继续进行探测。

11.8.4 顺序存储缓冲区

使用简单的 顺序存储缓冲区(sequential store buffer,SSB) (例如内存块链表)可以加快指针的记录速度。

每个线程可以针对所有分代维护统一的本地顺序存储缓冲区

  • 可以避免写屏障选择适当缓冲区的开销
  • 可以消除多线程之间的竞争

In general, only a few instructions are required to add an entry to a sequential store buffer:

  • Simply judge whether the next pointer reaches the upper limit, store the reference in the next position in the buffer, and increment the next pointer forward.

MMTk uses a linked list of memory blocks to implement sequential storage buffers. The size of each memory block is an integer power of 2, and it is also aligned according to an integer power of 2. The filling direction is from high address to low address. At this time, the write barrier can simply complete the overflow detection by judging whether the low bit of the next pointer is zero (this operation is usually very fast).

There are ways to eliminate the overhead of explicit overflow detection so that the number of instructions required to append an entry to a sequential buffer can be reduced to one or two instructions, as shown in Algorithm 11.4. On PowerPC, this operation can be done with a single instruction if special registers are used: stwu fld , 4 (next).
insert image description here

Appel, Hudson and Diwan, and Hosking et al. use write-protected sentinel pages (guard pages) to eliminate explicit overflow detection.

When the write barrier attempts to add an entry on the sentinel page, the trap handler performs the appropriate overflow operation. The overhead of triggering and handling page protection exceptions is very high, which usually costs hundreds or even thousands of instructions, so only when the trap is rarely triggered, this strategy will show an advantage in efficiency, that is, the trap execution overhead should be less than (Large) overhead spent on software detection:

Execution cost of page protection trap ≤ cost of overflow judgment × buffer size

Appel saves sequential storage buffers in the young generation and uses linked lists to organize memory blocks, thereby ensuring that the page protection trap is triggered exactly once in each collection cycle.

Appel places sentinel pages in the reserved space at the end of the young generation, so any allocation operation (whether allocating objects or memory-set memory blocks) may trigger a trap and invoke garbage collection.

This technique requires that the space of the young generation must be contiguous. Some systems may place the heap at the end of the data area and use the brk system call to grow (or shrink) the heap.

But as Reppy mentioned, setting a special protection policy for pages outside the end boundary of the heap will interfere with the malloc function's call to brk, so a better solution is to use a higher address space and use mmap to manage the heap expansion and contraction.


Special mechanisms supported by some architectures can also be used to eliminate overflow detection. For example, the UTRAP exception of the Solaris system is used to handle misaligned data access, and the speed is hundreds of times faster than the Unix signal processing mechanism.

Detlefs et al. use 2 n 2^n2A linked list of n -byte memory blocks is used to allocate buffers sequentially, and each memory block is2 n + 1 2^{n+1}2n + 1 byte aligned but not2 n + 2 2^{n+2}2The alignment requirement of n + 2 bytes may cause a certain waste of space.

Algorithm 11.5 describes its insertion process:

  • The next register usually points 4 bytes after the next entry, when the buffer is filled (i.e. the next register points to 2 n + 2 2^{n+2}2n + 2 slots before the alignment boundary) will trigger the UTRAP trap, as in line 5 in the example.

insert image description here
The above example may have errors. The author's notes + modifications are as follows, which may not be correct.

n = 4, think that the memory block is 2 4 2^424 = 16, with2 5 2^525 = 32-bit aligned, alignment boundary2 6 2^626 = 64 bits.
The previous slot of 64: 64 - 4 = 60, that is, when it reaches 60, the trap is triggered.

1. When the position is at 32, next is at 40, and the insertion point is at 36.
After the insertion, the position is at 36, the insertion point is at 40, and next is at 44.
...
3. When the position is at 40, next is at 48, and the insertion point is 44.
After the insertion, the position is at 44, and the insertion point is 48, next>>(n-1) = 0B 110 000 >> 3 = 6 (possible example here Wrong), tmp = 6, next is 54
4. When the position is at 44, next is at 54, the insertion point is 50
After insertion, the position is at 48, the insertion point is 50, next = 60
5. Insert it again at this time, it will Trigger the trap.

Sequential store buffers can be used directly in rememberset implementations, or as fast record front-ends to hashtables. For a simple two-generation collector that uses a collective promotion strategy, the young generation will be emptied after the secondary collection is completed, so it can simply discard the memory set, so in this scenario, the collector does not need more complex Remembered set structure (assuming sequential store buffers don't overflow before performing a reclamation).

However, other more complex collectors need to retain memory sets between two collections. If multiple generations are used, even if the condemned generation uses a collective promotion strategy, the intergenerational pointers between higher-level generations still need to be preserved. If the convicted generation contains steps internally, or if other delayed promotion strategies (see Section 9.4) are used, the remembered set still needs to hold references to unpromoted objects from older generations.

One solution is to simply remove entries in the sequential store buffer that are no longer needed, either by emptying the pointer at that location, or by pointing it to an object that will only be processed when the full heap is collected (or never recycled objects). Additionally, if an object does not contain any collection-related pointers, its corresponding entry can be removed. However, these solutions

Neither solution can control the growth of the memory set, and may cause the collector to repeatedly process the same long-lived entry. A better solution is to move the entries that need to be kept to the memory sets corresponding to each generation. These target memory sets can be implemented using sequential storage buffers, or they can be converted into more accurate hash tables for recording.

11.8.5 Overflow handling

Both hash tables and sequential store buffers can overflow, and there are several solutions to this problem. When the sequential storage buffer overflows, MMTk's solution is to allocate a new memory block and link it to the sequential storage buffer. The strategy of Hosking et al. is to transfer the data in the sequential storage buffer regardless of whether it will overflow. to the hash table and clear the former.

In order to keep the hash table relatively sparse, if a conflict occurs when inserting a pointer, or there is still a conflict after k linear detections, or the usage rate of the hash table exceeds a certain threshold (for example, 60%), the hash table needs to be expand. The way to expand is to increase the key length and simply double the size of the hash table, but then the key length cannot be a compile constant, which will increase the size of the hash table and the execution overhead of the write barrier.

Appel saves its sequential storage buffer in the heap. Once it overflows, it will immediately invoke garbage collection. MMTk will also initiate recycling when the collector's own metadata (such as sequential storage buffer) is too large.

11.8.6 Card table

The card table (card marking) strategy logically divides the heap into fixed-size contiguous regions, each of which is called a card.

Cards are usually smaller, between 128 and 512 bytes. The simplest implementation of a card table uses a byte array indexed by the card number. When a pointer write operation occurs inside a certain card, the write barrier sets the corresponding byte of the card in the card table as dirty (as shown in Figure 11.3).

  • The index number of the card can be obtained by shifting the address of the pointer field.
  • The card table is designed to simplify and improve the performance of the write barrier implementation by inlining it into the setter code.
  • Unlike hash tables or sequential storage buffers, card tables do not have overflow problems.

insert image description here
But these gains always come at a price:

  • The workload of the collector will be increased, because the collector must scan the fields in the dirty card one by one, and find out the fields that have been modified and may contain pointers related to recycling. At this time, the workload of the collector will be proportional to the already Marks the number of cards (and thus the size of the card), not the number of occurrences of write operations that resulted in reclaim-related pointers.

The purpose of using the card table is to reduce the burden of the setter as much as possible, so it is usually applied in the unconditional write barrier, which means that the card table must be able to map all addresses that may be modified by the write operation to a certain slot in the card table .

If we can ensure that certain areas in the heap can never be written to recovery-related pointers, and introduce conditional checks to filter out pointer writes to these areas, the size of the card table can be reduced. For example, if the space in the heap above a certain fixed virtual address boundary is used as a new area (the collector processes this area during each collection), the card table only needs to create a corresponding space for the space below the boundary address. groove.

The most compact card table implementation should be a bit array, but various factors determine that the bit array is not the best implementation. The instruction sets of modern processors do not have separate instructions for writing individual bits, so bit operations require more instructions than primitive operations:

  • Read a byte, set or clear a bit with logical operations, write the byte back.
  • To make matters worse, these operation sequences are not atomic, and multiple threads updating the same card table entry at the same time may cause some information loss, even if they modify different fields or objects in the heap.

Because of this, card tables usually use byte arrays. Since the processor executes instructions to clear memory faster, 0 is usually used to represent the "dirty" flag. In the scenario of using a byte array, setting the dirty mark in the card table only needs two SPARC instructions (other architectures may require slightly more instructions), as shown in Algorithm 11.6.

For the convenience of expression, we use zERo to represent the SPARD register %g0, which is usually 0. The value of the BASE register needs to be initialized to CT1-(H>>LOG_CARD_SIZE), where CT1 is the starting address of the card table, and H is the starting address of the heap, both of which are aligned according to the size of the card (ie 512 bytes).


Detlefs and others use a SPARC local register as the BASE register, and set its value when the program enters a function that may perform a write operation. The value of the register is saved when the function is called, relying on the register window mechanism.

insert image description here
Algorithm 11.7, reduces the overhead of write barriers in most cases, at the cost of sacrificing record accuracy. In this algorithm, marking the i-th byte in the card table means that all cards from the i-th to the i+L-th card may have been modified.

If the offset of the modified field in an object is less than L cards, the byte corresponding to the first address of the object can be set in the card table. Setting L to 1 generally covers most pointer write scenarios, with the exception of arrays, for which write barriers must be marked in the traditional way.

If a 128-byte card is used, then for an object whose size does not exceed 32 words, modification of any field within it can ensure that its first address can be recorded in the card table.

insert image description here
An ambiguity can only arise if the space occupied by the last object inside a certain card extends to the next card, at which point the collector may also need to scan the object (or the necessary starting part of the object).

Even if the card size is relatively small, the space occupied by the card table is usually acceptable. For example, under the 32-bit architecture, the card table corresponding to a 128-byte card will only occupy less than 1% of the space in the heap. When determining the size of the card, a tradeoff needs to be made between the space overhead and the time overhead for the collector to scan the root:

  • Increasing the size of the card will reduce the space overhead of the card table, but its accuracy will also decrease, and vice versa.

In the recycling phase, the recycler must look for recycling-related pointers in all dirty cards, so it must first scan the card table and find out the dirty cards. Since the update operation of the setter usually has high locality, there is usually an aggregation effect between clean cards and dirty cards, and the recycler can speed up the search process according to this characteristic. If the card table is implemented using byte arrays, the recycler can detect words consisting of 4 or 8 slots in the card table at one time.

If the generational collector does not use the strategy of collective promotion, after the secondary collection, some young generation surviving objects will remain in the young generation, and others will be promoted.

If a promoted object references an object that has not been promoted, the resulting old-new pointer will inevitably mark a card as dirty. But these non-promoted objects referenced by the promoted objects will eventually be promoted, so we should try not to mark the cards where the promoted objects are located, otherwise there will be some unnecessary card scans in the next round of recycling.

When promoting an object to a clean card, Hosking et al. use a filtered copy barrier to scan promoted objects, so it only marks the card as dirty when necessary. Even so, if the heap size is too large, the collector may still spend a lot of time skipping clean cards.


Detlefs等 观察发现,绝大多数卡都是干净的,且单个卡中很少会包含超过16个分代间指针。因此可以使用两级卡表来加速回收器查找脏卡的过程,尽管这一策略会付出额外的空间开销。

第二级卡占用的空间更小,其中的每个槽对应 2 n 2^n 2n个粒度更细的卡,因而其能够将干净卡的扫描速度提升n倍。写屏障可以使用与算法11.6类似的技术实现(只需要额外增加两条指令),但其需要确保第二级卡表的起点与第一级对齐,即CT1-(H>>L0G_CARD_SIZE)=CT2-(H>>L0G_SUPERCARD_S1zE),如算法11.8所示。这一要求可能会造成一定的空间浪费。

insert image description here

11.8.7 跨越映射

在回收阶段,回收器必须对其在卡表中找到的脏卡进行处理,这一过程需要确定卡中被修改的对象及对象内部被修改的槽。扫描对象中的域通常只能从对象的起始地址开始,但卡的起始地址却不一定会与对象的起始地址重合,因而扫描过程并不能直接进行。

更加糟糕的是,导致卡被标记为脏的指针域可能属于某一大对象,而该对象的头部则可能位于该卡之前的某个卡中(这也是需要对大对象进行分离存储的原因之一)。为确保回收器可以从对象头部开始扫描,我们必须借助于跨越映射来描述对象在卡内部或者卡之间的布局。

跨越映射中的每个条目与卡表中的卡是一一对应的关系,其每个条目所记录的是对应卡中第一个起始地址落入该卡的对象在卡中的偏移量。回收器会在提升对象时设置年老代卡所对应的跨越映射条目,如果分配器直接将对象预分配在年老代则也需设置这一信息。新生区的对象不可能指向更年轻的对象(它们已经是最年轻的对象),因而无需为其维护卡表。卡表的记录方式(记录对象或是记录被修改的域)决定了跨越映射的设计方式。


If the write barrier uses a card table to record modified pointer fields, the spanning mapping must record the offset of the last (starting address falling into that card) object in each card, if any object's starting address is Not in the card, the spanning map must record a negative offset.

Since the object may span multiple cards, the start address of the object to which the modified slot belongs may be located in another card before the dirty one. For example, in Figure 11.3, the white box in the heap represents the object, assuming that the scene described in the figure is in a 32-bit environment, and the card size is 512 bytes.

The offset of the last object in the first card is 408 bytes (102 words), and this value will be recorded in its corresponding entry in the spanning map. The object spans 4 cards, so the last two entries in the spanning map have negative values. When a field of the fifth object in the heap is modified (the area shown in gray), the corresponding card (the fourth card) will be marked as dirty (the black area). To find the starting address of the modified object, the collector must look back from the spanning map until it finds an entry with a non-negative offset (see Algorithm 11.9).

It should be noted that a negative number indicates the distance that needs to be moved back. When the object is large, the collector can quickly find the entry where the first address of the object is based on this value. Of course, the system can also fill in these entries with specific values ​​to indicate regression, such as -1, but this will slow down the collector's search speed in large objects.


The old generation is usually managed using a non-moving collection algorithm. At this time, free memory blocks and allocated memory blocks will be mixed in the heap. In the parallel collector, in order to avoid possible conflicts in the recycling process, different recycling threads often have different target promotion areas, so the promoted objects can easily form multiple islands, and each island is relatively large. free area.

To better support heap resolvability, each free area can be filled with a self-describing pseudo-object. However, the slot-based spanning mapping algorithm is more suitable for densely arranged objects in the heap:

  • If there is a large free memory block (for example, 10MB) between two dirty cards, the first loop of the search method in Algorithm 11.9 may need to iterate tens of thousands of times to find the pseudo object used to describe the free memory block head.

One way to reduce this lookup overhead is to store the logarithm of the back-off distance in the spanning map, i.e. if an entry records the value − k -kk , it means that the collector needs to go back2 k − 1 2^{k-1}2k 1 cards, then continue the lookup based on the value recorded in the new position (similar to the linear backoff strategy).

If it is necessary to allocate an object at the starting address of a large block of free memory, the collector only needs to update log(n) spanning mapping entries to correct the state of the spanning mapping, where n is the number of cards occupied by this memory block.

insert image description here
Garthwaite et al. designed an ingenious coding method across mappings, which can eliminate loops in the search process. In this strategy, we can simply treat each entry v in the spanning map as a 16-bit unsigned integer (two bytes). Table 11.3 describes its encoding strategy. If the value of v is zero, it means that any object in the corresponding card does not contain references. If the value of v is not greater than 128, this value represents the distance (in words) between the first object in the card and the end of the card.

It should be noted that the recording method here is different from the recording method described in Figure 11.3. Recording the offset of the first object instead of the last object can ensure that the collector does not need to go back to the front in most cases a card.

A large object such as an array may span multiple cards. In this case, an encoded value greater than 256 and less than or equal to 384 indicates that the object spans two or more cards. The first v - 256 words in the current card belong to The end of the object, and all fields in this space are pointer fields.

The advantage of introducing encoding values ​​in this range is that the collector can directly determine that the fields in this space are all pointer fields without accessing the type information of the object.

But if the fields where the object falls into the card are mixed with pointer fields and non-pointer fields, this encoding method will fail, and the value of v will be greater than 384, which means that the collector should go back v-384 in the spanning map entry and continue searching. Alternatively, if an object spans two full spanning map slots, the object's address can be recorded in the 4-byte space made up of those two slots. This scheme assumes that each entry in the spanning map occupies two bytes. section, but if you use a 512-byte card and use 64-bit alignment, you can achieve the same encoding effect with only one byte.

insert image description here

11.8.8 Summary Card

某些分代回收器并不采用集体提升策略,因此如果回收器通过对脏卡的扫描发现了回收相关指针但并未将其目标对象提升,则回收器需要保留该卡的脏标记,以便后续过程再次进行扫描。

如果后续回收过程可以直接获取此类脏卡中的回收相关指针而不用再对卡表进行扫描,则可以提升回收效率。幸运的是,绝大多数脏卡中都只包含数量很少的回收相关指针,因此Hosking 和Hudson 建议在完成某个卡的扫描之后将其中的回收相关指针添加到哈希表中,同时清除该卡的脏标记。Hosking 等 也采用相同的策略,不同之处在于其使用的是顺序存储缓冲区。

Sun的Java 虚拟机中,清扫器会对清扫完成后依然包含回收相关指针的卡进行 汇总( summarise) ,并据此优化卡的再扫描过程。

由于卡表的实现方式是字节数组而非位数组,所以可以将卡的状态进一步划分为“干净”、“已修改”、“已汇总”。

  • 如果回收器在“已修改”的卡中发现不多于k个回收相关指针,则将该卡标记为“已汇总”,并将这些指针域的偏移量记录在 “汇总表”(summary table) 的对应条目中。

  • 如果卡中回收相关指针的数量大于k(例如k = 2),则该卡将依然保持“已修改“状态,同时其在汇总表中对应的条目将被标记为“已溢出”。

因此在下次回收过程中,回收器无需用跨映射对卡进行扫描便可直接找到其中的回收相关指针(除非该卡重新被写屏障标记为脏)。另外,由于卡表的实现方式是字节数组,所以如果卡相对较小,也可直接在卡表中记录少量的偏移量信息。

Reppy 在卡表的编码中加入额外的分代信息以降低扫描开销。当完成某个卡的清理后,其多分代回收器会获取该卡内部所有指针域引用的对象所处的分代,并将其中最年轻分代的编号(0代表新生代)记入汇总卡。因此在后续对第n个分代的回收过程中,如果某个卡在汇总卡中的对应条目的值大于n,则回收器可以快速将其跳过。

11.8.9 Hardware and Virtual Memory Technology

Some early generational garbage collectors depend on operating system and hardware support. Hardware architectures that support tagged values ​​can easily distinguish between pointers and non-pointers, and some hardware write barriers can also be set in the page table.

In the absence of special hardware support, it is also possible to track write operations with the help of the operating system. Shaw, for example, modifies the HP-UX operating system and uses its paging system for this purpose. The virtual memory manager usually needs to record dirty pages, and use this to determine whether a page needs to be written back to the swap file (swap file) when it is swapped out .

Shaw's modification intercepts the paging operation of the virtual memory manager and records the dirty mark status of the pages that are swapped out. He also adds several system calls to clear the dirty mark of a group of pages, or return the dirty mark since the last collection. The collection of pages that were modified. The advantage of this strategy is that it does not introduce any general overhead to the setter, but its disadvantages are also very obvious:

  • When the operating system marks a page as dirty, it is impossible to distinguish whether the written value is a pointer or not, so the precision of its memory set is low, and the overhead of paging traps and system calls cannot be ignored.

In order to avoid modifying the operating system, Boehm et al. will modify the write protection policy of the reclaimed memory pages after a round of recycling.

The first write operation that occurs on the page will trigger a write protection exception, and the trap handler will set the dirty flag of the page and release the write protection policy of the page to avoid re-triggering the trap on the page before the next round of recycling.

During the collection process, the collector obviously needs to remove the write protection policy of the target page to which the object will be promoted to avoid triggering the trap. The page protection strategy will not bring overhead to the setter, and similar to the card table, the overhead of the write barrier will be proportional to the number of modified pages, regardless of the number of write operations.

However, this strategy introduces other, more expensive overhead:

  • The overhead of reading dirty page information from the operating system is usually large
  • The page protection mechanism may cause the so-called "trap storms" problem, that is, after the collection is completed, the setter will trigger a large number of write protection exceptions to release the write protection of the program working set
  • The overhead of the page protection exception itself cannot be ignored, and if its processing function is executed in user space, the overhead will be even greater
  • Operating system pages are usually much larger than the card, so the page scan algorithm needs to be more efficient (perhaps using techniques like rollup cards to improve scan performance).

11.9 Address Space Management

Certain algorithms require the use of large contiguous address spaces, or are simpler to implement in this scenario. Under the 32-bit address space, if the static layout method is used, it is usually difficult for the system to ensure that the size of each space can meet the requirements of all applications.

To make matters worse, the operating system may load the dynamic link library (also known as the shared object file) to any location in the address space, resulting in space fragmentation, thereby further increasing the difficulty of obtaining a large continuous address space.

In addition, for security purposes, the operating system may load the dynamic library at a random location in the address space, so each time the program runs, the location of the dynamic library will be different. The 64-bit large address space is one of the solutions to this problem, but larger pointers will also increase the physical memory occupied by the application.

One of the main reasons for using a large contiguous address space layout strategy is to ensure the execution efficiency of address comparison-based write barriers, that is, write barriers can directly compare pointers with a fixed address or another pointer without additional table lookups operate. For example, if the nascent area in the generational system is arranged at one end of the heap space, the write barrier only needs a simple address comparison to determine whether the pointer written to the heap refers to an object located in the nascent area.


When designing a new system, you should try to avoid designing the heap as a large continuous address space, but design it as a frame-based (frame) form, or at least allow "holes" in the continuous address space. Unfortunately, this requirement may cause write barriers to resort to table lookup operations.

Assuming that the overhead of the table lookup operation is acceptable, the system can map the logical address space to the available virtual address space, thereby being able to manage a larger logical address space. Although this strategy cannot increase the size of the heap space, it can indeed avoid the system's dependence on the continuity of the address space, thus providing a certain degree of flexibility to the system design.

This policy divides available memory into power-of-2 aligned frames, each frame usually larger than a virtual memory page. The system uses a table to maintain all frames, and uses the frame number (usually the high bit of the frame header address) as an index to record its logical address, and various address-oriented write barriers can perform address comparison based on this table.

For intergenerational write barriers, the system can also record the number of the generation in which each frame is located in the table. Algorithm 11.11 gives the pseudocode of this type of write barrier. Each line in the code corresponds to an instruction on a general processor. If each entry in the frame table corresponds to a byte, the index operation of the array can be simplified. .

Note that the algorithm works even if the ref is empty, to achieve this we can simply assign the highest generation number to the frame with address zero, so that the code avoids Call the remember method on the frame with address zero.

insert image description here

We can even "merge" chunks of available memory within a larger address space into smaller contiguous spaces—this is how the operating system provides virtual memory to processes.

One implementation strategy is to use wide addresses and check each address space access operation, which is equivalent to using software to simulate the work of virtual memory management hardware, which may include translation lookaside buffers implemented at the software level. The performance penalty for this scheme can be quite high, and can of course be avoided by exerting influence on the virtual memory hardware, Section 11.10 introduces more details on this aspect.

It's a good idea to build your system so that the heap is able to migrate when the system starts up. Many systems have a starting heap or system image that is loaded when the system starts. The image usually assumes itself to be resident in a particular address space, but if that address is already occupied by a dynamically linked library, its loading process will have problems.

Therefore, if the system image internally uses a table to record which words need to be adjusted when it is moved (the implementation is very similar to the loading of many code segments), the image loader can relatively directly move it to the address other locations in space. Similarly, if the entire heap or part of the heap space has the ability to migrate, the flexibility of the system can also be improved.

In practical applications, when performing virtual memory management, we can only reserve a specific address space for the hosting system, but do not require the operating system to allocate real memory pages for it, so that the operating system can avoid the dynamic link library at runtime Mapped into reserved address space. These pages are usually requested binary zero pages. The overhead of this operation is relatively low, but may affect the resource reservation of the operating system (such as swap space), and all virtual memory mapping operations are generally expensive.

When the heap space is large, the program can also determine whether the remaining resources in the system are sufficient by allocating pages in advance, but the operating system usually does not allocate resources for the requested binary zero page before it is actually accessed, so the page is simply allocated Wrong predictions may be made.

11.10 Application of Virtual Memory Page Protection Policy

The garbage collection system can implement multiple detections with the help of the virtual memory page protection check mechanism. This detection method has low or even no overhead under normal circumstances, and does not need to add explicit conditional branches during the detection process.

However, when using this scheme, the overhead of falling into the page protection trap must be considered. The trap handling function needs to fall into the operating system first and then return to the user state, so the overhead may be quite large.

In addition, modifying the page protection policy will also have some overhead, especially in a multiprocessor environment, because the system may need to suspend all running processors and update their memory page mapping information. So in some cases, even when page protection traps can be used, it is often cheaper to use explicit checks. Traps are still useful when dealing with "uncooperative" code, because there is no other way for the system to implement barriers or detection.

Another situation that needs to be considered is that due to hardware performance reasons, the size of memory pages may further increase in the future. At the same time, the total amount of memory used by developers is also increasing, and the mappable memory of the system more and more.

However, due to speed and power consumption considerations, the size of the translation lookaside buffer is unlikely to increase further. Since the size of the translation backbuffer is more or less fixed, the probability of a translation backbuffer lookup miss increases if smaller pages are used, but some virtual memory Related techniques may no longer apply.

We assume that in a certain architecture, page protection policies include:

  • read and write access
  • read-only access
  • No Access

Here we do not focus on executable permissions, because we have not found a case of using non-executable protection in garbage collection technology, and some platforms may not support the control of executable permissions.

11.10.1 Secondary mapping

Double mapping technology: the system can use this technology to map the same page to different virtual addresses with different protection strategies.

Let's take the example of an incremental copy collector (see Chapter 17) that respects the destination space invariant. In order to prevent the setter from accessing the source space pointers in the memory pages that have not yet completed processing, the collector can set the protection policy of these pages to prohibit access, which is equivalent to creating a read barrier efficiently with the help of hardware support.

In a concurrent system, if the collector removes the no-access protection for these memory pages, the setter may access the contents of the page before the collector completes its processing. To solve this problem, the recycler can remap the page to be processed to another address with read and write access rights. At this time, the recycler can process the content based on its second mapping. After the processing is completed, the recycler can Unprotects the page from access and resumes execution of all setter threads waiting to access the page.

When the address space is small (even 32 bits can be counted as a small address space at present), it may be difficult to perform secondary mapping. One solution is to fork out a child process for processing. The child process will use different page protection strategies to map the pages to be processed to its own address space. The recycler can guide the child process to complete by communicating with the parent and child processes page processing.

It should be noted that the secondary mapping technique may encounter problems in some systems. If the cache is indexed by virtual address, a potential cache inconsistency problem may arise because the address being remapped at this time is likely to have a cache inconsistency problem. To avoid this problem, hardware systems usually do not allow aliased entries to reside in the cache at the same time, but this may cause additional cache misses.

However, in our application scenario, the setter and collector are generally two mappings that access the same memory page at adjacent times, and run on the same processor (so the cache miss problem is not so important-translation note).

In addition, if the system uses an inverted page table, each physical memory page can only be mapped to one virtual address at any time, and the system cannot support secondary mapping at this time. In this case, the operating system can quickly invalidate the virtual address of a physical memory page and associate it with another virtual address, but this may trigger a cache flush operation.

11.10.2 Application of prohibited access pages

In the description of secondary mappings, we have seen one application of the no-access protection strategy, the unconditional read barrier. There are at least two other common application scenarios for this strategy.

  1. Detect the dereference operation of the program to the null pointer (that is, the pointer whose target address is 0).

The system sets page 0 (and subsequent pages) as inaccessible pages. If the setter tries to access the field pointed to by the null pointer, it must perform read or write operations on the inaccessible pages.

Since the handling of null pointer dereference exceptions usually does not need to be fast, it is reasonable to use the no-access page protection strategy in this scenario. In rare cases, the program will access an address with a large offset from the 0 address, and the compiler can add explicit detection for this situation.

If the head or other fields of the object are arranged at the position of the negative offset of the object pointer address, the system can also prohibit access protection for the pages with the highest address (the negative offset of the 0 address will wrap around to the highest address ——Translator’s Note). But in most systems, the high address space is usually reserved for the operating system.

  1. Sentinel page (guard page) .

For example, a remembered set implemented with sequential store buffers goes through three steps when inserting a new element:

  • Determine whether the remaining space in the buffer is sufficient

  • write new elements to the buffer

  • increment buffer pointer

If an access-forbidden sentinel page is placed at the end of the buffer, the write barrier can save the operation of detecting the remaining space and calling the buffer overflow handling subroutine. Since the call frequency of the write barrier is usually high, and its code may be embedded in many locations, the sentinel page technology can speed up the execution speed of the setter and reduce the code size.

Some systems use the same strategy to detect stack or heap overflows by placing a sentinel page at the end of the stack (heap).

The best way to detect a stack overflow is to try to access the furthest end of the new stack frame it will create as soon as the subroutine starts executing. At this time, once the sentinel page trap is triggered, the instruction pointer will be located at a pre-determined position, so the trap handler can increase the stack space by reallocating, or add a new stack segment and adjust the stack frame pointer, and then resume Execution of the setter.

Similarly, when using sequential allocation buffers, the allocator can access the last word of a new object before performing the allocation, and once that word falls into a sentinel page at the end of the buffer, the allocator will trigger a trap.

不论在哪种情况下,如果新的栈帧或者对象过大以至于最远的一个字可能会越过哨兵页,则系统仍需使用显式的边界检查。但如此巨大的栈帧和对象在许多系统中都十分罕见,且大对象通常会花费更多的时间初始化使用,这一开销通常足以掩盖显式边界检查的开销。


我们还可以利用禁止访问页保护策略在较小虚拟地址空间中获取较大的逻辑地址空间,Texas持久对象存储,即是一个案例。

尽管该策略是针对数据持久化而设计的(程序下次执行时,堆中的数据依然保持着上一次运行结束时的状态),但其所用到的技术也同样适用于垃圾回收等非持久化场景。

该系统基于内存页进行工作,每个页的大小与虚拟内存页相同,或者是后者的 2 n 2^n 2n倍。系统通过一张表来记录每个逻辑页的状态,每个页不仅有其在(虚拟)内存中的地址,系统还会为其在磁盘上维护一个明确的托管交换文件。每一页都可以有以下四种状态:

  • 未分配(unallocated) :页为空,尚未得到使用。
  • 驻留(resident) :页中的数据已经加载到内存,并且可以访问。但其在磁盘上对应的交换文件不一定存在。
  • 非驻留(non-resident) :页中的数据在磁盘上,无法直接访问。
  • 保留(reserved) :页中的数据在硬盘上,无法直接访问,但已为其保留虚拟地址。

新创建的页的初始状态为“驻留”,且系统会为其分配新的逻辑地址(与虚拟内存地址无关)。随着虚拟内存的不断使用,某些页可能需要换出到磁盘。保存过程需要基于页的逻辑地址,系统需要将页中所有的指针转换成更长的逻辑地址,因而其在硬盘中的存在形式一般会比其在内存中的要大。

This process is called reverse conversion (unswizzling) in the literature , it requires the system must be able to accurately find out the pointer in each page.

After the "resident" page is swapped out, its state will become "reserved", and the system will further set its corresponding virtual address space to prohibit access. At this time, once the program accesses the swapped out page, the page protection trap will be triggered , the trap handler will reload the page into memory.

If the system needs to reuse the virtual address space of a "reserved" page, it must ensure that the "reserved" page is not referenced by any "resident" pages. To achieve this, the system can swap out all "reserved" pages that reference the page, and then modify the state of the page to "non-resident", at which point the system can reuse its address space.

It should be noted that a "resident" page can only refer to a "resident" page or a "reserved" page, but cannot directly refer to data in a "non-resident" page.


The system obtains the logical address of the page by looking up the table, and loads it from disk to memory. Then the system needs to traverse the logical addresses in it and convert them into shorter virtual addresses (this process is called pointer swizzling) . For the references in this page pointing to the "resident" page or the "reserved" page, the conversion operation can be directly completed by looking up the table

However, for references pointing to "non-resident" pages, the system must first reserve a virtual address for the target page (at which point the state of the target page will change from "non-resident" to "reserved") before moving these references from Logical addresses are translated into virtual addresses. Assigning virtual addresses to these new "reserved" pages may require other pages to be swapped out, which may further modify the status of the swapped out pages to "non-resident" and reclaim their virtual address space.

11.11 Choice of Heap Size

Under other conditions being equal, the larger the heap space, the higher the throughput of the setter and the smaller the overhead of garbage collection.

In some cases, however, a smaller heap may improve setter locality, reduce the chance of translation backbuffer misses, and thus improve setter throughput.

In addition, if the heap space is so large that physical memory cannot accommodate it, the execution of the program is prone to performance bumps, especially during garbage collection.

The "small enough" criterion usually varies by runtime system as well as by operating system, so there are strategies for automatic memory managers to adjust the heap size. (note omitted here)

In addition to adjusting the size of the heap space, strategies to reduce the physical memory occupied by the program also include swapping out some pages to disk (such as the bookmark collector), and saving some rarely accessed objects to disk.

appendix

[1] "Garbage Collection Algorithm Handbook The Art of Automatic Memory Management"
[English] Richard Jones (Richard Jones) [US] Anthony Hosking (Antony Hosking) Eliot Moss (Eliot Moss) Wang Yaguang
Xue Di Yi
[2] "Java Virtual Machine: JVM Advanced Features and Best Practices" by Zhou Zhiming

Guess you like

Origin blog.csdn.net/weixin_46949627/article/details/128207287