Design a Garbage Collection System (Part II)

This is the second post of Design a Garbage Collection System series. If haven’t read our first post, please go check it since we’ll continue our discussion from last time.

In our previous post, we’ve been talking about the basic concept of garbage collection, which is a system that automatically recycles unused memory in programming languages. What’s cool about garbage collection is quite obvious. It allows developers care less about memory management and write more robust code. On the flip side, it may affect the performance and provide less flexibility when working with memory.

To continue our discussion, I’ll talk more about the naive mark-and-sweep approach and how we can improve it. I’ll also introduce other common ways to design a garbage collection system.

Naive mark-and-sweep (continued)

The idea of the naive mark-and-sweep approach is quite straightforward. The system does a tree traversal following object references and mark all the visited objects. In the second phase, for all the unreachable objects, free their memory.

Apparently, the approach is easy to understand and implement. So what are the disadvantages?

The most notable problem is that the entire system must be suspended during garbage collection. In other words, once in a while, the problem will be frozen when doing garbage collection and no mutation of the working set can be allowed. Thus, it’ll significantly affect the performance of time-critical applications.

Improvement

Given the performance issue of mark-and-sweep, one modern garbage collection system takes a slightly different approach – Tri-color making.

Let me briefly introduce the algorithm. In a nutshell, the system marks all the objects into three colors:

White – objects that have no reference and should be recycled.
Black – reachable objects that shouldn’t be recycled. Black objects have no reference to white objects.
Gray – objects that are reachable from roots and yet to be scanned for references to white.

Initially, all the objects that are referenced from roots are in gray and the white sets include everything else (black is empty). Each time the system picks an object from gray to black and move all its references from white to gray. In the end, gray becomes empty and all white objects are recycled.

The most notable advantage is that the system can do garbage collection on the fly, which is accomplished by marking objects as they are allocated and during mutation. Thus, the program won’t be halted for long time and performance gets improved.

Reference counting

So what are other ways to design a garbage collection system that won’t freeze the program?

A natural solution is reference counting and the idea is extremely simple. The core concept of garbage collection is when an object has zero reference, we should recycle it as soon as possible. So why not just keep track of the reference count for each object?

The reference counting system will keep a counter for each object that counts the number of references it has. The counter is incremented when a reference to it is created, and decremented when a reference is destroyed. When the counter is 0, the object should be recycled. Obviously, the system can do the garbage collection on the fly since it’s able to release the memory at the right time.

What are the disadvantages of this approach?

Disadvantage of reference counting

Apparently, the reference counter adds space overhead to the whole system. Since every single object needs additional storage for its reference count, the overall space needed can be increased significantly without any optimization.

Another problem is the speed overhead. Since the system needs to keep updating the counter, every operation in the program requires modification of one or more reference counters. Another way to understand this is that instead of freeze the program to recycle objects, reference counting system divides the overhead into every small operation. Since you can’t get everything for free, you need to decide if you want every operation becomes slightly slower or stop the entire program once in a while.

In addition, cycles can also be a problem of reference counting. If two objects reference each other, they will never be recycled. If you have experience with obj-c, you should already know the solution. Some languages introduce the concept of “weak reference” for the back pointers that creates the cycle. It is a special reference object whose existence does not increment the reference count of the referent object.

Summary

Garbage collection is an interesting topic since every programmer should have some basic ideas of it. What’s more, it helps you to understand why some programs were designed in different ways.

For a system design interview, it’s important to note that you don’t necessarily need to have too much prior knowledge. Providing reasonable analysis when facing a new problem is much more important. With that in mind, you can still pass an interview about garbage collection even if you are not an expert in programming languages.

If you want to know more about this topic, I also recommend you take a look at the discussion Why doesn’t C++ have a garbage collector?

The post is written by Gainlo - a platform that allows you to have mock interviews with employees from Google, Amazon etc..