SRP and Primes with OpenSSL

So, I’ve been implementing the Secure Remote Password protocol (SRP) for a game project I’m working on. As you can see on the protocol design page, the protocol needs two values, N and g, in order to function. N (the modulus) must be a large, safe prime number and g (the generator) must be a primitive root of N (and thus also a coprime of it) also called a generator modulo N. In typical SRP speak, there is a key size KS which is the byte length of N. This key size is used in other computations in the protocol (such as when generating random numbers of a certain size) so we want it to be relatively large.

In my case, I’m configuring SRP with a hash function H of SHA-512. I want a prime number with a byte length of 256 (i.e. 2048 bits). I’ll set the generator to 7 (the size of the generator really doesn’t matter in practice, but its value does matter in some regard as we’ll see).

I wanted to learn the OpenSSL API and so I figured I might as well generate my prime using OpenSSL. To that end, I wrote a tool that generates a prime with a specified bit length and verifies that a given generator is indeed a primitive root of it.

For example:

$ pgen 7 2048
N = 30817315993488094950063876611235224995197015486777838966039582184466989601737137785061150319729619861304023701874943954720410365155449212378509728297079378161875051028799177659132903881263214197958240731547326531423454481362477841674050050915185206506619708099215010108502376591014173871205694210918500927079807342355799375661000520374063019571782387361197808323622083769298736910284318017757455603510572185239046633686272701011272606841336216587476945751457343136023480297100196085405628980635988198354477091187313063439790470202359125187362725727097509333399354569332837280881731107660517239937731269477580686853163
g = 7
gcd(g, N) = 1

OpenSSL verifies that the final value of N is probably (i.e. very likely; see the documentation for BN_generate_prime) a prime (that is, it is not composite). The last line showing the result of applying the GCD operation on N and g verifies that g is indeed a generator modulo N (see here for why).

With N and g defined, all random numbers are generated with a byte length of KS such that they are as long as N. It is worth noting that both ends should agree on how the return value of the hash function H is interpreted (consider that the hash function most likely returns a raw byte array). The typical way is to interpret it as an unsigned, little-endian integer with the same byte length as the digest. Everything else in the protocol follows mathematical convention and is trivially implemented.

Demystifying Garbage Collectors

There seems to be a lot of confusion among developers on how garbage collectors actually work. They really aren’t as magical as some people think; in many ways, some garbage collectors are actually quite crude and — by modern developer standards — evil, unmaintainable, and full of subtle gotchas (the latter is certainly true no matter one’s perspective). In this post, I’ll try to shed some light on how GCs work in a way that (hopefully) any developer can understand. I do assume that the reader is at least familiar with basic computer memory concepts (C knowledge is preferred).

Note: This is a very long blog post. Grab some coffee.

Reachability Analysis

All garbage collectors are essentially based on one fundamental concept, regardless of whether they support reference counting or similar mechanisms: Reachability analysis (also known as liveness analysis, though this is less common and usually more associated with compiler lingo). The idea is that, given a set of roots, the collector will scan the entire heap (the set AllObjects) recursively, constantly adding to a set LiveObjects those objects that it finds to be live. When the scan is complete, the relative complement of LiveObjects with respect to AllObjects (that is, all objects that are in AllObjects but not in LiveObjects) are considered unreachable, and are thus deallocated (garbage collected). This process is typically referred to as a mark-sweep algorithm.

(The set theoretic terminology is quite intentional, by the way; there can be no duplicate elements.)

We can express the above algorithm as:

collect()
    // First find all live objects.
    mark()

    // Then reclaim dead objects.
    sweep()

mark()
    // The root set contains pointers to
    // static fields that in turn *can*
    // contain pointers to objects. They
    // can also simply contain a null
    // value.
    for Field in Roots do
        Object = *Field

        // If not a null pointer, push
        // it into the work list for
        // scanning below. Also mark it
        // as live immediately.
        if Object != null do
            push(LiveObjects, Object)
            push(WorkList, Object)

    // Now start processing the work
    // list. It'll eventually become
    // empty once the entire heap has
    // been scanned.
    while !empty(WorkList) do
        Object = pop(WorkList)

        // Now recursively scan all
        // objects for references and
        // mark them live, push them
        // into the work list, rinse;
        // repeat.
        for Field in Object do
            ReferencedObject = *Field

            if ReferencedObject != null and
                !contains(LiveObjects, ReferencedObject) do
                push(LiveObjects, ReferencedObject)
                push(WorkList, ReferencedObject)

sweep()
    // The relative complement of
    // LiveObjects with respect to
    // AllObjects = dead objects.
    DeadObjects = complement(LiveObjects, AllObjects)

    // Free all dead objects.
    while !empty(DeadObjects) do
        Object = pop(DeadObjects)
        free(Object)

    // Since LiveObjects contains
    // all the relevant objects
    // that are left, we can simply
    // assign it to AllObjects.
    AllObjects = LiveObjects

(This algorithm is a variation of the one Jones uses in his Garbage Collection Handbook – which, by the way, is a fantastic book that I fully recommend if you’re interested in garbage collectors.)

So, to reiterate, what this algorithm does is:

  1. Marks all roots as live.
  2. Populates the initial work list with root objects.
  3. Walks the heap recursively marking reachable objects live.
  4. Frees all dead objects.
  5. Repopulates the AllObjects set with the LiveObjects set.

That’s really all there is to reachability analysis and mark-sweep. Of course, a real GC is more complicated for a number of reasons:

  • Simply maintaining sets of objects is not very efficient at all.
  • Marking is often implemented via bitmaps or flipping bits directly on objects instead of maintaining an expensive set.
  • Scanning thread stacks, thread-local storage (TLS) locations, and registers is much more complex than the above.
  • I omitted allocation code entirely – allocation algorithms make up a lot of the complexity in a GC, particularly because lock-free allocation is desirable.
  • Most collector types will want to pause all threads when starting a collection, and resume them when done.
  • Some GC algorithms will even want to mutate the heap layout (moving and copying collectors).
  • Some concurrent (and sometimes even generational) GC algorithms require read and write barriers to be inserted in generated code by the compiler.
  • In some cases, a GC has to consider that an object reference can not only be null, but also completely invalid even if non-null.

… And many others. But one thing at a time.

Memory Layout

We’ll need to discuss the memory layout of garbage collected objects a bit before getting further into the details of garbage collectors.

The memory layout of collector-managed objects varies a lot depending on the language or virtual machine involved. To give some examples:

  • C: An object consists purely of what has been declared/explicitly allocated.
  • D: An object contains a virtual table, a monitor pointer, and its fields.
  • Mono: An object consists of a virtual table, a synchronization block pointer, and then its fields.
  • MCI: An object consists of a type info pointer, a word of GC-specific bits, a user-modifiable reference field, and its regular fields.

The reason C doesn’t throw any hidden fields into objects is, of course, because of its very close-to-the-metal memory model. This means that the collector cannot (trivially, at least) store data before an object’s contents.

In D, things are quite different: Since the language is built with GC as the default memory allocation mechanism, the compiler reserves two words of memory in the header of all objects. The first is a simple pointer to a virtual table (i.e. your run-of-the-mill virtual function dispatch table). This virtual table also happens to point to a type information structure that the GC can make use of. The second is a pointer to a so-called monitor. This is just fancy terminology for a mutex. This monitor structure, though, happens to also be used by the GC for registering object finalizers (more on that further into the post). So, all in all, the D compiler helps the GC a lot by providing type information and places to store information.

In the Mono virtual machine, things are quite similar to D. The first word of an object is a pointer to a virtual table, which contains a type information structure and function addresses like in D. The second word points to the synchronization block. This one is a bit of a beast: It’s used to store a monitor (the entire thing, not just a mutex pointer), cache hash values (necessary for moving/copying collectors), and a few other things. For objects that don’t implement GetHashCode, their hash is calculated from their address. This poses a problem when objects are moved around in memory (and thus change addresses). The purpose of the cached hash field is thus to remember the hash value even when the object is moved. This is a reasonable approach (really, the only not-incredibly-complicated one) because, at worst, some objects will have equal hash codes occasionally.

In MCI, the design of object headers is quite different from other virtual machines. Since the VM has no particular support for virtual function dispatch (but certainly lets you implement it in the code you pass to it), the first word simply points to a type information structure (which also contains a bitmap of the physical type layout – used for faster scanning). The second word is completely opaque and the use is specific to the garbage collector that the VM is started with. The third word is a user-modifiable field that can contain any reference type (managed objects, arrays, and vectors all allow this). This field is always scanned by the GC. The motivation for that was to make implementation of some higher-level language features like virtual dispatch easier.

In general, in an environment where objects contain information usable by the GC, the GC is going to be more efficient and/or more precise. The GC will have an even better time if the compiler emits load/store GC barriers (not to be confused with memory barriers). More on those and type precision further into the post.

Threading

For now, we’ll assume concurrent garbage collectors don’t exist (which is not the case, but ignorance is bliss). Let’s look at how so-called stop-the-world (STW) collectors work.

For non-concurrent, STW collectors, it is absolutely essential that all non-collector threads are stopped before a collection starts. This is so because if other threads were running while the collector is scanning the heap, it might miss objects that were recently made live and end up freeing them, which is clearly a Bad Thing (TM). For moving and copying collectors, things could get even worse – some parts of the heap (or roots) might end up with lingering references!

So, a non-concurrent GC generally performs a collection like this:

  1. A thread attempts to allocate memory, but none is available.
  2. The collector hijacks the allocating thread.
  3. All other threads are suspended (typically via POSIX signals; Windows has a SuspendThread function).
  4. A collection is performed.
  5. Memory is allocated, either via newly freed memory or by asking the OS for more memory.
  6. All other threads are resumed (again, POSIX signals; ResumeThread on Windows).
  7. Any pending finalizers are executed (either on the hijacked thread or on a dedicated asynchronous thread).
  8. The GC returns the hijacked thread to the allocation site, with the newly allocated memory.
  9. The program continues on…

There are some things worth noting here: First, step 7 may not actually return any memory; if the GC failed to allocate memory one way or another, it would likely throw an exception or return a null pointer. Second, in step 3, the GC may actually skip some threads that it uses internally. This is the case for collectors that do parallel marking of the heap. Third, never mind finalization for now. We’ll get to it.

Probably the most prominent criticism of STW collectors is exactly that: They’re STW. There are several problems with this property:

  • Programs can stop at (practically) arbitrary locations/times.
  • Pause times are unbounded (for most collectors – real time collectors do exist, but are rare).
  • The programmer must take care that low-level threading code cannot be disrupted by pauses.

All of these make concurrent collectors generally more preferable than STW collectors, but concurrent collectors are significantly harder to implement – we’ll also get to that.

Finally, regardless of what collector type is in use, whenever a thread is started, the collector must be notified in such a way that it can add the thread’s TLS areas to its roots (and remove them when the thread goes down). Some languages and virtual machines take this to a whole other level; for example, the Erlang VM exclusively does per-process garbage collection, which significantly speeds up collection overall (keep in mind that Erlang processes are extremely small). Erlang can get away with this because there is no global state that needs to be garbage collected. It is very likely that the Rust language will go with a similar model.

Finalization

There is often a need to perform some kind of user-defined cleanup when an object is collected. Most garbage collected languages have features that allow this in some form or another.

C#:

class Foo
{
    ~Foo()
    {
        PerformCleanup();
    }
}

D:

class Foo
{
    ~this()
    {
        performCleanup();
    }
}

Rust:

struct Foo
{
    drop
    {
        perform_cleanup();
    }
}

Some languages call them destructors, but the universal term used in GC lingo is finalizer; it finalizes the object’s lifetime.

Exactly when a finalizer runs is very specific to the language or virtual machine. To name a few approaches that exist:

  • An object is finalized whenever nothing has references to it. This means that two objects pointing to each other can create a cycle in which no finalizers are invoked on the two objects, ever. Worse yet, even an object holding a pointer to itself can prevent finalization.
  • An object is finalized whenever nothing has references to it, except itself. This is a variation of the above aforementioned approach which is slightly safer. Cycles between multiple objects are still problematic.
  • An object is finalized whenever nothing has references to it, except itself or other dead objects. This permits both cycles and self pointers.

The last approach is arguably the safest, but it does also mean that a finalizer cannot assume anything about the state of other objects that the finalizer’s object refers to. This is by far the most common kind of finalization used in programming languages, due to its safety traits.

As I mentioned earlier, finalizers are executed just before the GC transfers control of the calling thread back to the application. Again, depending on the language and virtual machine, finalizers are executed either on the calling thread or on an asynchronous finalization thread (or multiple, in some cases). For implementations where finalizers are executed on the calling thread, the finalized objects can be trivially deallocated right after finalization. For those that enqueue finalizers on a separate thread, things go like this:

  1. The object is enqueued on a separate thread.
  2. The GC resumes program execution.
  3. The finalization thread finalizes the object.
  4. The object is deallocated.

If the garbage collector implements the third (and safest) finalization technique mentioned before, it can choose to do some extra work to make finalizers less prone to looking at objects with an invalid state (because they have been finalized before). This is fairly trivial: Given a finalizable object, go over all of its reference fields and set them to a null value. This will in all likelihood result in some kind of null reference exception, should the finalizer attempt to access these fields, thereby making it clear to the programmer what went wrong. This is not commonly done, however; for instance, the Java and C# GCs do not do this.

Finally, consider this piece of C# code:

static Foo F;

class Foo
{
    ~Foo()
    {
        F = this;
    }
}

When Foo’s finalizer is executed, it assigns itself to a global variable, effectively making itself reachable again – this is called resurrection. In most implementations, this makes the object perfectly reachable and live, but when the object again becomes unreachable, its finalizer will not run again. Typically, an implementation will provide a function to register an object for finalization again for these cases.

That’s it. Finalization is actually rather simple at its core; the only confusing aspect of it is that everyone does it differently in some way. I’m of the opinion that the third finalization approach is always the right one, but changing implementations to use that once they’ve already settled on another technique would of course be a very breaking change…

Weak References

When memory is scarce, a common technique in garbage collected languages is to use weak references to refer to non-essential resources. A weak reference is much like a regular reference except that the collector is free to collect the pointed-to object. That may seem insane, but it’s important to note that when the collector deallocates a weakly referenced object, it sets weak references to a null value such that no lingering references will stick around. Objects that are strongly referenced and weakly referenced will not be collected.

When an allocation attempt happens and the GC is out of memory, it can either do a collection cycle, then check if some memory has become free and return that, or it can choose to allocate from the OS immediately. Either way, if no memory is available, it’ll have to consult the OS. Now, the point of weak references is that after having scanned the heap, the collector can choose to deallocate a few weakly referenced objects in order to make space, instead of consulting the OS for more memory.

A common use of weak references is to maintain in-memory caches that are allowed to lose data.

Moving and Copying

One problem that typical mark-sweep collectors face is heap fragmentation. GCs typically allocate and deallocate memory from the OS in pages (which usually translates to 4096 bytes). So, imagine that you allocate a bunch of objects which all get put into a page. All but one of these objects eventually become unreachable and are deallocated. Now you have one object, possibly not bigger than a couple of bytes, keeping an entire page (4096 bytes) alive! This is clearly bad when thousands (or even millions) of objects are allocated over time in a program: Not only is redundant memory reserved, the OS will eventually run out of memory and resort to swapping because it can’t allocate more pages – meanwhile the GC is holding onto lots of pages with plenty of space, but a couple of objects happen to keep them committed.

Most garbage collectors deal with this problem by using heap compaction. Heap compaction algorithms can largely be separated into two categories: Moving algorithms, which move objects around to maximize use of page space, and copying algorithms, which copy between two separate heaps on every collection cycle, automatically gaining compaction. The latter is the one typically used in production garbage collectors today (especially in the young generation of generational collectors).

Describing all of the various compaction algorithms is out of scope for this blog post, but I’ll describe the fundamental ideas behind each category.

Moving compaction is based on simply pushing objects back into the beginning of the heap on every collection cycle. The goal is to make as efficient use of page space as possible to (try to) eliminate gaps. Once done, the collector can choose to either return excess empty pages to the OS or keep them around to serve allocation requests.

The purpose of copying compaction is to be faster than moving compaction at the expense of space. The heap is split up into two so-called semispaces called from-space and to-space. All allocation requests happen into to-space. When no more memory is available and the collector has to do a collection cycle, it flips the roles of the two semispaces; that is, the to-space is now from-space and vice versa. Then, the collector starts scanning the from-space. All objects that are found to be live are copied over to to-space in a simple bump pointer style. All remaining objects in from-space are effectively dead and can be deallocated in bulk. The advantage of this algorithm lies in the copying being very fast due to simply incrementing a pointer in to-space, while the obvious disadvantage is that two equally-sized heaps must be maintained even though one is largely unused until a collection cycle occurs.

It’s clear that both moving and copying algorithms have overhead compared to simple mark-sweep algorithms, but they are essential in long-running applications that must not go down due to a fragmented heap.

The above seems simple enough, but consider that when we copy and move objects, we effectively invalidate all pointers that point to them! This is not an unsolvable problem: The collector must simply update all pointers accordingly. This, as it turns out, is very simple in a type precise environment, while in an uncooperative environment (e.g. in C), it is practically impossible. More on that when we get to type precision.

Pinning

When interoperating with external, non-garbage collected code, it is essential to be able to tell the garbage collector that an object must not be collected even if it appears to be dead. This process is called pinning; that is, the object is pinned in place. The term actually originates from moving and copying collectors, because there it also means that the collector is not allowed to move or copy the object.

Let’s go with an example in pseudocode:

work()
    managed_func_1()
    managed_func_2()

managed_func_1()
    Object = new()

    // Pin the object since it will
    // be hidden in a global variable
    // not scanned by the GC.
    pin(Object)

    unmanaged_func_1(Object)

unmanaged_func_1(Object)
    // The object is logically live
    // because a reference exists,
    // but the GC can't see it, hence
    // the pinning above.
    GlobalVariable = Object

managed_func_2()
    Object = unmanaged_func_2()

    // Unpin the object so the GC
    // knows the object is eligible
    // for collection again.
    unpin(Object)

unmanaged_func_2()
    Object = GlobalVariable

    // Null out the global variable
    // so we don't have a lingering
    // reference hanging around.
    GlobalVariable = null

    return Object

This is a little pathological, but it illustrates the point of pinning: Keeping an object alive when the GC can’t see it.

Type Precision

Type precision is probably one of the most important aspects of modern garbage collectors. Without precise type information, it is impossible to tell whether something in memory is a plain integer or a pointer. Conservative garbage collectors (such as the Boehm-Demers-Weiser collector for C and C++) have no type information to work with, so they must conservatively assume that any integer that could be a pointer to a managed object is a pointer to a managed object. This can result in so-called false pointers that keep dead objects alive. This isn’t a huge problem on 64-bit systems due to the significantly larger address space, but on 32-bit systems, it can be a showstopper for some types of applications.

But that’s not the only use for type information. Moving and copying collectors are, in practice, impossible to implement without type information because they don’t know where to update pointers in the program. If no type information is available, the collector could easily end up overwriting a plain integer, actively altering program semantics. This would obviously (also) be a Bad Thing (TM).

Type information does not have to be completely precise about exactly what kind of data type is in every possible memory location – it suffices to know what memory locations hold managed references (and should thus be scanned and possibly rewritten by the collector). A collector isn’t concerned with any other data types (such as integers and floats), so those can be left out in the information provided by the compiler or virtual machine.

Now, type information makes precise heap scanning trivial, but what about the stack and machine registers? In practice, keeping track of when a reference is in a register is very hard and often not worth the effort, so scanning registers conservatively is a universally acceptable approach. The same goes for all roots (global variables, TLS data, and so on), though providing type information for those would be trivial too.

The stack is where things get interesting: The compiler can work out what slots in a stack frame can contain references when it generates code. It (or a virtual machine, for that matter) can then emit a so-called stack map that the garbage collector can use. The garbage collector will walk the call stack of stopped threads, working out exactly what’s present in the various slots. This makes stack scanning much more precise than usual – consider that, given a 2 MB stack, there are 524288 unique storage slots (on a 32-bit machine). That’s quite a few false pointers if the garbage collector is not precise.

On the other hand, the likelihood of these slots holding values that would actually keep objects alive is relatively low. It’s also worth considering that stack maps consume a lot of space by themselves, so using them may not always be desirable. This is particularly true in a JIT-compiling virtual machine where a lot of data structures already have to be maintained, eating quite some memory.

Concurrency and Barriers

In applications that run on workstations, it is desirable for the program not to have long pause times. This is not a problem for server applications and the like, but a user doesn’t want to sit and wait for an application to react to their input. Concurrent garbage collectors solve this problem by performing garbage collection while the program is running and never actually pausing any threads. This does mean that some processing time is lost due to synchronization between program threads and GC threads, but on the other hand, the application will be much smoother to work with.

Since concurrent GCs run on one or several separate threads from the main program, it is clear that there would not be much to gain if the machine the program is running on doesn’t have SMP. While that wouldn’t prevent a concurrent GC from working with the same semantics as on an SMP machine, it would be significantly slower than just using a plain STW collector. Thankfully, lack of SMP isn’t much of a problem in this day and age.

Probably the biggest problem with concurrent garbage collectors is maintaining collector invariants (that is, conditions that must hold throughout garbage collector logic). Since the program runs concurrently with the collector, it could easily race against the collector, resulting in lingering references and other nasty things. The solution to this problem is GC barriers: These are very small code stubs that are inserted into the program at locations where various memory operations are performed. They perform small, usually atomic, operations that help maintain invariants in the face of concurrent mutation and scanning of the heap.

Barriers require compiler or virtual machine support. One problem with barriers and static compilers is that they essentially form a very strict ABI between the program and the garbage collector, giving little flexibility to run the same program with different collectors without recompiling. JIT-compiling virtual machines are more flexible in this regard because they can simply mutate the intermediate representation of the program to insert barriers just before emitting native code.

What barriers do varies a lot depending on the GC implementation, and where they are inserted varies even more. Most commonly, they can be inserted for memory loads/stores, array loads/stores, field loads/stores, and so on.

To see an example of a real world barrier implementation, take a look at this paper.

Optimization

A lot of people are skeptical of the performance of garbage collectors – and rightfully so. Given all of the above, it’s not hard to imagine how much work an actual GC implementation has to do when performing collections. But, as with any other technology, you must optimize for the technology you’re using, not the technology you wish you were using.

A few guidelines to go by:

  • If your data structure is smaller than 16 (or so) bytes, it probably shouldn’t be an object.
  • Recycle objects that you create many instances of (object pooling).
  • Never allocate during intensive work where throughput is important.
  • Avoid boxing like the plague if you can.
  • Prefer mutable data structures over immutable ones when performance is essential.
  • Prefer a concurrent GC for workstation programs; an STW GC for server programs.
  • Use weak references for non-essential data.
  • Avoid pinning too many objects as it can severely inhibit the garbage collector’s work.
  • Don’t use finalizers if you don’t have to.

Conclusion

While this post is fairly abstract in comparison to actual GC implementations, I hope it has shed some light on how garbage collectors work. If you understand the things in this post, that’s enough to understand how to work with a garbage collector and optimize for it, in practice – you don’t have to know every little implementation detail.

If there’s anything you find unclear in this post or perhaps completely unanswered, please leave a comment! It’s entirely possible that I forgot something due to the lengthy nature of this post.

But, But, Internals?!

Sorry, but actual garbage collector implementations are complex beasts full of clever and subtle optimizations and algorithms. Even describing the algorithms behind a single implementation would have made this blog post — quite literally — 10 times as long as it is now.

If you’re interested in garbage collector implementations, I recommend taking a look at some open source code:

Many others exist out there. These are just the ones I happen to be familiar with.

Also, again, I really recommend the Garbage Collection Handbook.

.NET: Thoughts on Portability

I was recently having a discussion with a few folks over at #wcell on QuakeNet about portability of programming languages (specifically the classic C/C++ versus .NET languages like C# and F#).

One could argue that C/C++ is the most portable because it will compile anywhere as long as it adheres to the standard. But then, C# and .NET are standardized too, are they not? And the standards are open, are they not? Just like C/C++! Now, of course, Microsoft, who came up with the standards, did implement the C# compiler and the .NET runtime on Windows, which then became the “de facto” standard implementations.

But, what hinders anyone from implementing the .NET runtime on any platform? Think about it; the only reason C/C++ is cross-platform is because the standards are implemented on different platforms (albeit with slight deviations at some points).

In the end, you’re still relying on tools native to the platform for your code to work on it. The only difference is that .NET programs can be compiled on Windows and still execute on Linux (for instance), while C/C++ programs need to be compiled on the specific platform. And honestly, which process is the most convenient? I don’t need to point that out, do I? Personally, I absolutely hate having to cross-compile.

Just some food for thought.