Demystifying Garbage Collectors

There seems to be a lot of confusion among developers on how garbage collectors actually work. They really aren’t as magical as some people think; in many ways, some garbage collectors are actually quite crude and — by modern developer standards — evil, unmaintainable, and full of subtle gotchas (the latter is certainly true no matter one’s perspective). In this post, I’ll try to shed some light on how GCs work in a way that (hopefully) any developer can understand. I do assume that the reader is at least familiar with basic computer memory concepts (C knowledge is preferred).

Note: This is a very long blog post. Grab some coffee.

Reachability Analysis

All garbage collectors are essentially based on one fundamental concept, regardless of whether they support reference counting or similar mechanisms: Reachability analysis (also known as liveness analysis, though this is less common and usually more associated with compiler lingo). The idea is that, given a set of roots, the collector will scan the entire heap (the set AllObjects) recursively, constantly adding to a set LiveObjects those objects that it finds to be live. When the scan is complete, the relative complement of LiveObjects with respect to AllObjects (that is, all objects that are in AllObjects but not in LiveObjects) are considered unreachable, and are thus deallocated (garbage collected). This process is typically referred to as a mark-sweep algorithm.

(The set theoretic terminology is quite intentional, by the way; there can be no duplicate elements.)

We can express the above algorithm as:

collect()
    // First find all live objects.
    mark()

    // Then reclaim dead objects.
    sweep()

mark()
    // The root set contains pointers to
    // static fields that in turn *can*
    // contain pointers to objects. They
    // can also simply contain a null
    // value.
    for Field in Roots do
        Object = *Field

        // If not a null pointer, push
        // it into the work list for
        // scanning below. Also mark it
        // as live immediately.
        if Object != null do
            push(LiveObjects, Object)
            push(WorkList, Object)

    // Now start processing the work
    // list. It'll eventually become
    // empty once the entire heap has
    // been scanned.
    while !empty(WorkList) do
        Object = pop(WorkList)

        // Now recursively scan all
        // objects for references and
        // mark them live, push them
        // into the work list, rinse;
        // repeat.
        for Field in Object do
            ReferencedObject = *Field

            if ReferencedObject != null and
                !contains(LiveObjects, ReferencedObject) do
                push(LiveObjects, ReferencedObject)
                push(WorkList, ReferencedObject)

sweep()
    // The relative complement of
    // LiveObjects with respect to
    // AllObjects = dead objects.
    DeadObjects = complement(LiveObjects, AllObjects)

    // Free all dead objects.
    while !empty(DeadObjects) do
        Object = pop(DeadObjects)
        free(Object)

    // Since LiveObjects contains
    // all the relevant objects
    // that are left, we can simply
    // assign it to AllObjects.
    AllObjects = LiveObjects

(This algorithm is a variation of the one Jones uses in his Garbage Collection Handbook – which, by the way, is a fantastic book that I fully recommend if you’re interested in garbage collectors.)

So, to reiterate, what this algorithm does is:

  1. Marks all roots as live.
  2. Populates the initial work list with root objects.
  3. Walks the heap recursively marking reachable objects live.
  4. Frees all dead objects.
  5. Repopulates the AllObjects set with the LiveObjects set.

That’s really all there is to reachability analysis and mark-sweep. Of course, a real GC is more complicated for a number of reasons:

  • Simply maintaining sets of objects is not very efficient at all.
  • Marking is often implemented via bitmaps or flipping bits directly on objects instead of maintaining an expensive set.
  • Scanning thread stacks, thread-local storage (TLS) locations, and registers is much more complex than the above.
  • I omitted allocation code entirely – allocation algorithms make up a lot of the complexity in a GC, particularly because lock-free allocation is desirable.
  • Most collector types will want to pause all threads when starting a collection, and resume them when done.
  • Some GC algorithms will even want to mutate the heap layout (moving and copying collectors).
  • Some concurrent (and sometimes even generational) GC algorithms require read and write barriers to be inserted in generated code by the compiler.
  • In some cases, a GC has to consider that an object reference can not only be null, but also completely invalid even if non-null.

… And many others. But one thing at a time.

Memory Layout

We’ll need to discuss the memory layout of garbage collected objects a bit before getting further into the details of garbage collectors.

The memory layout of collector-managed objects varies a lot depending on the language or virtual machine involved. To give some examples:

  • C: An object consists purely of what has been declared/explicitly allocated.
  • D: An object contains a virtual table, a monitor pointer, and its fields.
  • Mono: An object consists of a virtual table, a synchronization block pointer, and then its fields.
  • MCI: An object consists of a type info pointer, a word of GC-specific bits, a user-modifiable reference field, and its regular fields.

The reason C doesn’t throw any hidden fields into objects is, of course, because of its very close-to-the-metal memory model. This means that the collector cannot (trivially, at least) store data before an object’s contents.

In D, things are quite different: Since the language is built with GC as the default memory allocation mechanism, the compiler reserves two words of memory in the header of all objects. The first is a simple pointer to a virtual table (i.e. your run-of-the-mill virtual function dispatch table). This virtual table also happens to point to a type information structure that the GC can make use of. The second is a pointer to a so-called monitor. This is just fancy terminology for a mutex. This monitor structure, though, happens to also be used by the GC for registering object finalizers (more on that further into the post). So, all in all, the D compiler helps the GC a lot by providing type information and places to store information.

In the Mono virtual machine, things are quite similar to D. The first word of an object is a pointer to a virtual table, which contains a type information structure and function addresses like in D. The second word points to the synchronization block. This one is a bit of a beast: It’s used to store a monitor (the entire thing, not just a mutex pointer), cache hash values (necessary for moving/copying collectors), and a few other things. For objects that don’t implement GetHashCode, their hash is calculated from their address. This poses a problem when objects are moved around in memory (and thus change addresses). The purpose of the cached hash field is thus to remember the hash value even when the object is moved. This is a reasonable approach (really, the only not-incredibly-complicated one) because, at worst, some objects will have equal hash codes occasionally.

In MCI, the design of object headers is quite different from other virtual machines. Since the VM has no particular support for virtual function dispatch (but certainly lets you implement it in the code you pass to it), the first word simply points to a type information structure (which also contains a bitmap of the physical type layout – used for faster scanning). The second word is completely opaque and the use is specific to the garbage collector that the VM is started with. The third word is a user-modifiable field that can contain any reference type (managed objects, arrays, and vectors all allow this). This field is always scanned by the GC. The motivation for that was to make implementation of some higher-level language features like virtual dispatch easier.

In general, in an environment where objects contain information usable by the GC, the GC is going to be more efficient and/or more precise. The GC will have an even better time if the compiler emits load/store GC barriers (not to be confused with memory barriers). More on those and type precision further into the post.

Threading

For now, we’ll assume concurrent garbage collectors don’t exist (which is not the case, but ignorance is bliss). Let’s look at how so-called stop-the-world (STW) collectors work.

For non-concurrent, STW collectors, it is absolutely essential that all non-collector threads are stopped before a collection starts. This is so because if other threads were running while the collector is scanning the heap, it might miss objects that were recently made live and end up freeing them, which is clearly a Bad Thing (TM). For moving and copying collectors, things could get even worse – some parts of the heap (or roots) might end up with lingering references!

So, a non-concurrent GC generally performs a collection like this:

  1. A thread attempts to allocate memory, but none is available.
  2. The collector hijacks the allocating thread.
  3. All other threads are suspended (typically via POSIX signals; Windows has a SuspendThread function).
  4. A collection is performed.
  5. Memory is allocated, either via newly freed memory or by asking the OS for more memory.
  6. All other threads are resumed (again, POSIX signals; ResumeThread on Windows).
  7. Any pending finalizers are executed (either on the hijacked thread or on a dedicated asynchronous thread).
  8. The GC returns the hijacked thread to the allocation site, with the newly allocated memory.
  9. The program continues on…

There are some things worth noting here: First, step 7 may not actually return any memory; if the GC failed to allocate memory one way or another, it would likely throw an exception or return a null pointer. Second, in step 3, the GC may actually skip some threads that it uses internally. This is the case for collectors that do parallel marking of the heap. Third, never mind finalization for now. We’ll get to it.

Probably the most prominent criticism of STW collectors is exactly that: They’re STW. There are several problems with this property:

  • Programs can stop at (practically) arbitrary locations/times.
  • Pause times are unbounded (for most collectors – real time collectors do exist, but are rare).
  • The programmer must take care that low-level threading code cannot be disrupted by pauses.

All of these make concurrent collectors generally more preferable than STW collectors, but concurrent collectors are significantly harder to implement – we’ll also get to that.

Finally, regardless of what collector type is in use, whenever a thread is started, the collector must be notified in such a way that it can add the thread’s TLS areas to its roots (and remove them when the thread goes down). Some languages and virtual machines take this to a whole other level; for example, the Erlang VM exclusively does per-process garbage collection, which significantly speeds up collection overall (keep in mind that Erlang processes are extremely small). Erlang can get away with this because there is no global state that needs to be garbage collected. It is very likely that the Rust language will go with a similar model.

Finalization

There is often a need to perform some kind of user-defined cleanup when an object is collected. Most garbage collected languages have features that allow this in some form or another.

C#:

class Foo
{
    ~Foo()
    {
        PerformCleanup();
    }
}

D:

class Foo
{
    ~this()
    {
        performCleanup();
    }
}

Rust:

struct Foo
{
    drop
    {
        perform_cleanup();
    }
}

Some languages call them destructors, but the universal term used in GC lingo is finalizer; it finalizes the object’s lifetime.

Exactly when a finalizer runs is very specific to the language or virtual machine. To name a few approaches that exist:

  • An object is finalized whenever nothing has references to it. This means that two objects pointing to each other can create a cycle in which no finalizers are invoked on the two objects, ever. Worse yet, even an object holding a pointer to itself can prevent finalization.
  • An object is finalized whenever nothing has references to it, except itself. This is a variation of the above aforementioned approach which is slightly safer. Cycles between multiple objects are still problematic.
  • An object is finalized whenever nothing has references to it, except itself or other dead objects. This permits both cycles and self pointers.

The last approach is arguably the safest, but it does also mean that a finalizer cannot assume anything about the state of other objects that the finalizer’s object refers to. This is by far the most common kind of finalization used in programming languages, due to its safety traits.

As I mentioned earlier, finalizers are executed just before the GC transfers control of the calling thread back to the application. Again, depending on the language and virtual machine, finalizers are executed either on the calling thread or on an asynchronous finalization thread (or multiple, in some cases). For implementations where finalizers are executed on the calling thread, the finalized objects can be trivially deallocated right after finalization. For those that enqueue finalizers on a separate thread, things go like this:

  1. The object is enqueued on a separate thread.
  2. The GC resumes program execution.
  3. The finalization thread finalizes the object.
  4. The object is deallocated.

If the garbage collector implements the third (and safest) finalization technique mentioned before, it can choose to do some extra work to make finalizers less prone to looking at objects with an invalid state (because they have been finalized before). This is fairly trivial: Given a finalizable object, go over all of its reference fields and set them to a null value. This will in all likelihood result in some kind of null reference exception, should the finalizer attempt to access these fields, thereby making it clear to the programmer what went wrong. This is not commonly done, however; for instance, the Java and C# GCs do not do this.

Finally, consider this piece of C# code:

static Foo F;

class Foo
{
    ~Foo()
    {
        F = this;
    }
}

When Foo’s finalizer is executed, it assigns itself to a global variable, effectively making itself reachable again – this is called resurrection. In most implementations, this makes the object perfectly reachable and live, but when the object again becomes unreachable, its finalizer will not run again. Typically, an implementation will provide a function to register an object for finalization again for these cases.

That’s it. Finalization is actually rather simple at its core; the only confusing aspect of it is that everyone does it differently in some way. I’m of the opinion that the third finalization approach is always the right one, but changing implementations to use that once they’ve already settled on another technique would of course be a very breaking change…

Weak References

When memory is scarce, a common technique in garbage collected languages is to use weak references to refer to non-essential resources. A weak reference is much like a regular reference except that the collector is free to collect the pointed-to object. That may seem insane, but it’s important to note that when the collector deallocates a weakly referenced object, it sets weak references to a null value such that no lingering references will stick around. Objects that are strongly referenced and weakly referenced will not be collected.

When an allocation attempt happens and the GC is out of memory, it can either do a collection cycle, then check if some memory has become free and return that, or it can choose to allocate from the OS immediately. Either way, if no memory is available, it’ll have to consult the OS. Now, the point of weak references is that after having scanned the heap, the collector can choose to deallocate a few weakly referenced objects in order to make space, instead of consulting the OS for more memory.

A common use of weak references is to maintain in-memory caches that are allowed to lose data.

Moving and Copying

One problem that typical mark-sweep collectors face is heap fragmentation. GCs typically allocate and deallocate memory from the OS in pages (which usually translates to 4096 bytes). So, imagine that you allocate a bunch of objects which all get put into a page. All but one of these objects eventually become unreachable and are deallocated. Now you have one object, possibly not bigger than a couple of bytes, keeping an entire page (4096 bytes) alive! This is clearly bad when thousands (or even millions) of objects are allocated over time in a program: Not only is redundant memory reserved, the OS will eventually run out of memory and resort to swapping because it can’t allocate more pages – meanwhile the GC is holding onto lots of pages with plenty of space, but a couple of objects happen to keep them committed.

Most garbage collectors deal with this problem by using heap compaction. Heap compaction algorithms can largely be separated into two categories: Moving algorithms, which move objects around to maximize use of page space, and copying algorithms, which copy between two separate heaps on every collection cycle, automatically gaining compaction. The latter is the one typically used in production garbage collectors today (especially in the young generation of generational collectors).

Describing all of the various compaction algorithms is out of scope for this blog post, but I’ll describe the fundamental ideas behind each category.

Moving compaction is based on simply pushing objects back into the beginning of the heap on every collection cycle. The goal is to make as efficient use of page space as possible to (try to) eliminate gaps. Once done, the collector can choose to either return excess empty pages to the OS or keep them around to serve allocation requests.

The purpose of copying compaction is to be faster than moving compaction at the expense of space. The heap is split up into two so-called semispaces called from-space and to-space. All allocation requests happen into to-space. When no more memory is available and the collector has to do a collection cycle, it flips the roles of the two semispaces; that is, the to-space is now from-space and vice versa. Then, the collector starts scanning the from-space. All objects that are found to be live are copied over to to-space in a simple bump pointer style. All remaining objects in from-space are effectively dead and can be deallocated in bulk. The advantage of this algorithm lies in the copying being very fast due to simply incrementing a pointer in to-space, while the obvious disadvantage is that two equally-sized heaps must be maintained even though one is largely unused until a collection cycle occurs.

It’s clear that both moving and copying algorithms have overhead compared to simple mark-sweep algorithms, but they are essential in long-running applications that must not go down due to a fragmented heap.

The above seems simple enough, but consider that when we copy and move objects, we effectively invalidate all pointers that point to them! This is not an unsolvable problem: The collector must simply update all pointers accordingly. This, as it turns out, is very simple in a type precise environment, while in an uncooperative environment (e.g. in C), it is practically impossible. More on that when we get to type precision.

Pinning

When interoperating with external, non-garbage collected code, it is essential to be able to tell the garbage collector that an object must not be collected even if it appears to be dead. This process is called pinning; that is, the object is pinned in place. The term actually originates from moving and copying collectors, because there it also means that the collector is not allowed to move or copy the object.

Let’s go with an example in pseudocode:

work()
    managed_func_1()
    managed_func_2()

managed_func_1()
    Object = new()

    // Pin the object since it will
    // be hidden in a global variable
    // not scanned by the GC.
    pin(Object)

    unmanaged_func_1(Object)

unmanaged_func_1(Object)
    // The object is logically live
    // because a reference exists,
    // but the GC can't see it, hence
    // the pinning above.
    GlobalVariable = Object

managed_func_2()
    Object = unmanaged_func_2()

    // Unpin the object so the GC
    // knows the object is eligible
    // for collection again.
    unpin(Object)

unmanaged_func_2()
    Object = GlobalVariable

    // Null out the global variable
    // so we don't have a lingering
    // reference hanging around.
    GlobalVariable = null

    return Object

This is a little pathological, but it illustrates the point of pinning: Keeping an object alive when the GC can’t see it.

Type Precision

Type precision is probably one of the most important aspects of modern garbage collectors. Without precise type information, it is impossible to tell whether something in memory is a plain integer or a pointer. Conservative garbage collectors (such as the Boehm-Demers-Weiser collector for C and C++) have no type information to work with, so they must conservatively assume that any integer that could be a pointer to a managed object is a pointer to a managed object. This can result in so-called false pointers that keep dead objects alive. This isn’t a huge problem on 64-bit systems due to the significantly larger address space, but on 32-bit systems, it can be a showstopper for some types of applications.

But that’s not the only use for type information. Moving and copying collectors are, in practice, impossible to implement without type information because they don’t know where to update pointers in the program. If no type information is available, the collector could easily end up overwriting a plain integer, actively altering program semantics. This would obviously (also) be a Bad Thing (TM).

Type information does not have to be completely precise about exactly what kind of data type is in every possible memory location – it suffices to know what memory locations hold managed references (and should thus be scanned and possibly rewritten by the collector). A collector isn’t concerned with any other data types (such as integers and floats), so those can be left out in the information provided by the compiler or virtual machine.

Now, type information makes precise heap scanning trivial, but what about the stack and machine registers? In practice, keeping track of when a reference is in a register is very hard and often not worth the effort, so scanning registers conservatively is a universally acceptable approach. The same goes for all roots (global variables, TLS data, and so on), though providing type information for those would be trivial too.

The stack is where things get interesting: The compiler can work out what slots in a stack frame can contain references when it generates code. It (or a virtual machine, for that matter) can then emit a so-called stack map that the garbage collector can use. The garbage collector will walk the call stack of stopped threads, working out exactly what’s present in the various slots. This makes stack scanning much more precise than usual – consider that, given a 2 MB stack, there are 524288 unique storage slots (on a 32-bit machine). That’s quite a few false pointers if the garbage collector is not precise.

On the other hand, the likelihood of these slots holding values that would actually keep objects alive is relatively low. It’s also worth considering that stack maps consume a lot of space by themselves, so using them may not always be desirable. This is particularly true in a JIT-compiling virtual machine where a lot of data structures already have to be maintained, eating quite some memory.

Concurrency and Barriers

In applications that run on workstations, it is desirable for the program not to have long pause times. This is not a problem for server applications and the like, but a user doesn’t want to sit and wait for an application to react to their input. Concurrent garbage collectors solve this problem by performing garbage collection while the program is running and never actually pausing any threads. This does mean that some processing time is lost due to synchronization between program threads and GC threads, but on the other hand, the application will be much smoother to work with.

Since concurrent GCs run on one or several separate threads from the main program, it is clear that there would not be much to gain if the machine the program is running on doesn’t have SMP. While that wouldn’t prevent a concurrent GC from working with the same semantics as on an SMP machine, it would be significantly slower than just using a plain STW collector. Thankfully, lack of SMP isn’t much of a problem in this day and age.

Probably the biggest problem with concurrent garbage collectors is maintaining collector invariants (that is, conditions that must hold throughout garbage collector logic). Since the program runs concurrently with the collector, it could easily race against the collector, resulting in lingering references and other nasty things. The solution to this problem is GC barriers: These are very small code stubs that are inserted into the program at locations where various memory operations are performed. They perform small, usually atomic, operations that help maintain invariants in the face of concurrent mutation and scanning of the heap.

Barriers require compiler or virtual machine support. One problem with barriers and static compilers is that they essentially form a very strict ABI between the program and the garbage collector, giving little flexibility to run the same program with different collectors without recompiling. JIT-compiling virtual machines are more flexible in this regard because they can simply mutate the intermediate representation of the program to insert barriers just before emitting native code.

What barriers do varies a lot depending on the GC implementation, and where they are inserted varies even more. Most commonly, they can be inserted for memory loads/stores, array loads/stores, field loads/stores, and so on.

To see an example of a real world barrier implementation, take a look at this paper.

Optimization

A lot of people are skeptical of the performance of garbage collectors – and rightfully so. Given all of the above, it’s not hard to imagine how much work an actual GC implementation has to do when performing collections. But, as with any other technology, you must optimize for the technology you’re using, not the technology you wish you were using.

A few guidelines to go by:

  • If your data structure is smaller than 16 (or so) bytes, it probably shouldn’t be an object.
  • Recycle objects that you create many instances of (object pooling).
  • Never allocate during intensive work where throughput is important.
  • Avoid boxing like the plague if you can.
  • Prefer mutable data structures over immutable ones when performance is essential.
  • Prefer a concurrent GC for workstation programs; an STW GC for server programs.
  • Use weak references for non-essential data.
  • Avoid pinning too many objects as it can severely inhibit the garbage collector’s work.
  • Don’t use finalizers if you don’t have to.

Conclusion

While this post is fairly abstract in comparison to actual GC implementations, I hope it has shed some light on how garbage collectors work. If you understand the things in this post, that’s enough to understand how to work with a garbage collector and optimize for it, in practice – you don’t have to know every little implementation detail.

If there’s anything you find unclear in this post or perhaps completely unanswered, please leave a comment! It’s entirely possible that I forgot something due to the lengthy nature of this post.

But, But, Internals?!

Sorry, but actual garbage collector implementations are complex beasts full of clever and subtle optimizations and algorithms. Even describing the algorithms behind a single implementation would have made this blog post — quite literally — 10 times as long as it is now.

If you’re interested in garbage collector implementations, I recommend taking a look at some open source code:

Many others exist out there. These are just the ones I happen to be familiar with.

Also, again, I really recommend the Garbage Collection Handbook.

Three Reasons I Love Functional Programming

I realize a lot of people reading my blog are imperative programmers who haven’t looked into functional programming before. I figured I’d write a bit about why I find functional programming interesting and worthwhile.

You might think that I like functional programming because of first-class functions. You might think that I like it because of purity and immutability. Hell, you might even think that I like it because of natural recursion and tail calls. But no. Below are the three reasons I like the paradigm.

Everything is an expression: Typing return everywhere is a pain in the ass. The fact that the last expression in a function will be its result really cuts down on typing and reading. Additionally, this means that what you would usually call an if statement is actually an if expression. This implies that the last expression in the true/false paths are the result of the if expression. The same goes for match expressions, loop expressions, and so on. This makes it possible to write very clear code compared to the imperative equivalent.

Pattern matching: Basically the switch statement on steroids. Some (me being among them) will argue that this is what the switch statement always should have been, and that the switch statement in its current form is a language design mistake. Being able to match any value against any value makes for much more readable code, as compared to an if/else forest.

Discriminated unions: Ever wanted C unions but with less insanity and more safety? DUs allow you to use any type for your cases and with well-defined behavior. That alone is already great, but DUs truly shine when combined with fluent pattern matching syntax.

Most of the features I described here are present in all functional languages in one form or another, but they are particularly important in F# and Erlang which are the functional languages I take most interest in.

It might surprise some people that I didn’t list first-class functions and immutability. I like those concepts too, but I don’t think they’re what makes functional programming truly beautiful. This is ironic, because they’re probably the most basic traits of a functional programming language. But, in the time I’ve worked with functional languages, I haven’t found those two features to be the most appealing to me (don’t take that the wrong way; they are still great).

.NET and Distributed Concurrency

This is just going to be a bit of a brain dump.

I’ve been talking with my friend from Microsoft for the last couple of days about .NET and concurrency. Me being the language geek that I am, I was arguing that Erlang is, and will continue to be, the better choice for highly concurrent systems. .NET currently does not offer any reasonable equivalent to Erlang’s concurrency model. Sure, we have message-passing frameworks, we have locks, we have concurrent collections, we have the TPL (and TDF), but it all boils down to the same thing: Threads. Threads are evil.

A thread is no better than an OS process. In order to create a thread, a 1 MB stack (on Windows, anyway) needs to be allocated, and the OS needs to schedule the thread like it would schedule a process. This is heavier than it might sound. With Erlang, you can have millions of processes running at almost no cost; with OS processes you’d die at a few thousands.

Well, this happens to be so because Erlang processes are designed to be small. A few hundred bytes is all it takes to run code in an Erlang process. This is huge difference from OS threads/processes. There is one primary reason why Erlang is so economic with process resources: It uses the Actor model. If you’re an OO developer, you could think of actors as objects. In Erlang, to invoke a method on these objects, you’d send a message that the objects process completely isolated from each other, and thus, concurrently. In other words, by using messaging as the means of communication between objects, you get concurrency for free. This is quite different from how we view objects and threads from a language such as C#. In C#, we often have some sort of primary object that runs in a separate thread, or in the thread pool, which takes care of running a bunch of child objects, because it’s simply too expensive and inefficient to spawn too many threads or add too many callbacks to the thread pool. In Erlang, this is not the case; in fact, you’re encouraged to run thousands or even millions of small processes in order to be able to scale your application. There is a clear difference between Erlang and, for example, .NET here: Distributed concurrency is first-class in Erlang.

.NET is a fine platform and I love to use it. But the problem with .NET is that concurrency is something that wasn’t really regarded as being important. Sure, .NET has threads, as well as high-level abstractions on top of them, such as TPL, TDF, PLINQ, and so on, but there is virtually no support in the runtime itself for distributed concurrency. We don’t have location-transparent processes, we don’t have load-balancing, we don’t have lightweight processes, we don’t have continuations, we don’t have code hot swapping, we don’t have fault-tolerance… The list goes on. And it doesn’t look like we’re getting it anytime soon, either. To Microsoft, this apparently seems too “academic” and “unrealistic”. Apparently, Ericsson, Facebook, RabbitMQ, T-Mobile, and Telia are not realistic companies.

I know exactly what the problem is, however. The problem is that Erlang is not popular. If it’s not popular, surely nobody needs it to solve their problems, and it’s just an academic toy. The usual excuse for not implementing something. You can probably guess what I’m getting at: Microsoft has recently done almost nothing innovative when it comes to development tools and frameworks. They spend their time mimicking what the open source community or other companies have provided for years. ORMs have existed for a long, long time, yet Microsoft is still playing the catch-up game with Entity Framework. Cloud services have existed for years, yet Microsoft had to push Azure forward. IoC and extension APIs have existed for almost a decade, yet Microsoft is working on MEF. MVC web frameworks have existed for years, yet Microsoft is working on ASP.NET MVC. The only Microsoft division I can say is truly rolling out new tech is Microsoft Research, with their work on F#, Code Contracts, Pex/Moles, Singularity, etc.

It’s sad. Microsoft doesn’t seem to dare encourage any sort of paradigm shift these days, and shamelessly dismisses new technologies as “academic” if they’re not popular and in wide use. What they don’t seem to realize is that all they’re doing is reinventing wheels and that innovation is what needs to happen. Now.

The Actor Model and Message-Passing

I’m sure people who have been following me and talking to me have noticed how I’ve been looking so much into actor/agent-based concurrency and message-passing – how I’ve been talking about Axum, TPL Dataflow, Erlang, and some of my homebrewn equivalents.

I believe the future of concurrency lies in message-passing. My reason for saying this actually boils down to OOP. In OOP, every method call is a message – both the request and the result.

For example, whenever you do:

var result = calc.Add(40, 2);

You are, in OOP terms, saying:

  1. Prepare a message to inform calc that I want to call its Add method
  2. Add the values 40 and 2 to the message
  3. Send the message to calc
  4. Wait for calc.Add to do its thing
  5. Receive the result through a reply message

Something should become obvious to you when I spell it out like this: Why should the thread be stalled waiting for the result to be sent back before going on with other tasks? Surely it could just go on doing something else while calc finishes doing its very heavy computation.

So the point I’m trying to get across is: Whenever you call a method and whenever you get the value of a property, you’re sending a message (or several). This very simple fact has been largely ignored by mainstream developers and frameworks up until recently. Previously, it was all about trying to parallelize algorithms and using locks to synchronize where you just couldn’t parallelize; this obviously doesn’t scale. There’s also asynchrony. For a long time, people have been writing asynchronous code in the form of lambda expressions or anonymous methods, or even worse, as separate methods, passed to some API that starts an asynchronous operation.

You probably saw this one coming: C# 5.0 pretty much solves this entire problem. With the new async and await keywords, we can easily express asynchronous code in its logical form. However, these two keywords alone don’t give us all the power we need. While they allow us to treat every method call as sending a message and retrieving the result, we have no real way to control how execution happens, at what degree of parallelism, in what synchronization context, etc. This is where the new API in .NET vNext, TPL Dataflow, comes in.

TPL Dataflow allows you to set up a bunch of so-called source and target blocks, and then link them together, so that whenever some data is posted to the initial block, the data will automatically be passed through this “dataflow network”. That’s all nice and fancy, but alone, it isn’t much different from any other message-passing framework. C# 5.0 augments TPL Dataflow with the async and await keywords, so that we can write code like this:

agent.Post(Tuple.Create(40, 2));
var result = await agent.Receive();

This looks a bit more awkward than the first code example above, but the advantage here is that we can set up our agents just the way we want to. We can specify the task scheduler they use, how many messages they process per task, what degree of parallelism they utilize, and so on, giving us absolute control over program flow. For more information, see the docs on the TC Labs page.

F# has solved the problem of asynchrony and concurrency differently, through asynchronous workflows and generic agents. For more about that, check Don Syme’s post on async and parallel design patterns in F#.

The cool thing about actors/agents and message-passing is that we get implicit concurrency without a whole lot of effort. We just write our code the way we naturally would, and watch it scale. I say naturally because in the real world, communication happens through messages – why wouldn’t it in programming, too?

Still, we’re not quite there yet when it comes to language support. Erlang/OTP is probably the single language/platform that has the best support for actors, but we have no means of interacting with Erlang/OTP from .NET currently. There’s Axum too, but given that the language effort has stopped, that’s a no-go.

It really is incredible that the most sane approach to concurrency has been ignored for so long. It seems as though, historically, we went from message-passing to parallel algorithms, and then back to message-passing again. How anyone could have believed that parallel algorithms were going to be more scalable is beyond me, but languages like Erlang and Axum have shown us just how important message-passing is.

A collection of links related to this and previous posts: