Array Slices and Interior Pointers

The D programming language has the concept of a slice. Slices are a neat way to do data processing with arrays but have a fundamental flaw, related to garbage collection, that I’ll explain here. I’ll also present a way that this flaw can be fixed.

This post assumes some familiarity with garbage collection.

Crash Course on Slices

I’m not going to explain slices in all their glory – see this instead. The short story is this: A slice is simply a length + pointer pair. The pointer points to some arbitrary memory and the length specifies how many contiguous elements are at that location in memory. D does not really have arrays in the traditional sense. Consider this code:

int[] xs = [1, 2, 3, 4, 5];

The xs variable is not actually an array. It’s a slice with a length of 5 and a pointer to somewhere in memory. We can now do this:

int[] ys = xs[1 .. 4];

This is what slicing means and where the name slice comes from. The ys variable is now a slice with a length of 3 and a pointer to the second element in xs. Slices are especially beautiful in string processing code since you can pick off parts of a string easily.

The Problem with Slices

First, a thing to note: Slices are passed by value. This means that they are effectively a two-word struct being passed around. This is of course for efficiency reasons, since allocating a heap object for every slice would be unreasonably wasteful.

Now, consider that the pointer part of a slice can point to GC-managed memory. This means that the garbage collector has to scan it. This is not bad in itself, but let’s review the code above:

int[] ys = xs[1 .. 4];

As said, ys’s pointer is now pointing inside xs’s memory block. We have an interior pointer. This forces the garbage collector to not only scan slices conservatively, but also to assume that their pointer part can be an interior pointer. And this is for every single array/slice in a D program.

This is clearly a huge problem for type-precise garbage collection.

The Proposed Solution

Given an arbitrary slice pointing to any non-null memory, we want to be able to immediately tell whether that block of memory is GC-managed without having to do a heap range check on the slice’s pointer.

There is fundamentally only one way to solve this problem (without introducing unacceptable overhead) with two variations. It boils down to storing a pointer to the original memory block in the slice. This means that slices will become 3 words long. This is a significant change because it means an extra word to copy around when slices are passed between functions. On the other hand, I think the advantages it gives in garbage collection complexity (or rather, reduction thereof) outweigh that.

So, previously, slices look like this (assuming a word size of 8):

------  ------------
Offset  Value
------  ------------
0       Length
8       Pointer
------  ------------

In my scheme, slices look like this:

------  ------------
Offset  Value
------  ------------
0       Length
8       Base Pointer
16      Pointer
------  ------------

The base pointer here is the pointer to the start of the memory block. This ensures that the GC can trivially figured out that the memory is live. The pointer is then the (possibly interior) pointer to the relevant part of the memory block as in the old slice representation.

The other variation looks like this:

------  ------------
Offset  Value
------  ------------
8       Base Pointer
16      Offset
------  ------------

Here, offset means the offset into the base pointer to get the actual region of memory that the slice points to. This does mean a little extra work when doing indexing. It also assumes that the location the base pointer points to stores the memory block’s length. This scheme allows slices to work with arrays in the Common Language Infrastructure and other virtual machines. It would not work for plain D that compiles to native code, however.

I hope one of these variations can be worked into the D language. The current situation is highly problematic for garbage collection — and also prevents a D port to the .NET ecosystem — and needs to be addressed.

Comments very welcome – if you spot a flaw in the proposed solution, do say so. If something about the new scheme seems confusing or unsound, speak up too – it probably indicates a flaw. Suggestions for a better solution are also appreciated!

12 thoughts on “Array Slices and Interior Pointers

    • Yes, but those interior pointers are only allowed to exist on the stack or in registers – not in the heap.

      That’s the idea, anyway. If you don’t set forth that rule, precise GC cannot be implemented in a way that brings any gain at all.

      • You are confusing interior pointers with reference uncertainty. Pointers and therefore interior pointers may point inside anything in D, not just heap allocated arrays, irrespective of the slice representation. And while changing the slice representation would certainly reduce the number of interior pointers, it would not eliminate the interior pointers from the heap to arrays, let alone all interior pointers. And tracing interior pointers only represents a possible performance hit, depending on the GC.

        Getting back to reference uncertainty, this is caused by the union of a value and reference type. Take this example:
        union UncertainPointer {
        int myInt;
        int* myPtr;
        }
        which is stored on the heap as:
        —— ————
        Offset Value
        —— ————
        0 myPtr
        0 myInt
        —— ————
        Now, the GC has no idea if the value stored at the 0 offset is actually an int or an pointer and so it must conservatively trace the uncertain reference. If the value is actually an int, it is a false pointer and may keep dead objects alive, a.k.a. a memory leak. In either case the object pointer to can’t be moved (i.e. it becomes pinned), which reduces the effectiveness of a moving GC. Note that the entire stack has to be traced conservatively.

        And regarding porting D to .Net, the issue lies not in D itself, but in the fact D arrays are D slices (and vice versa), while .Net arrays are not slices. This wouldn’t be an issue, except that Microsoft wrote the majority of.Net library functions to take arrays instead of slices. Sadly, this means that slices have become secondary citizens in .Net. So if you want D arrays/slices to inter-operate with .Net libraries you end up with lots of performance killing conversions under the hood. This is a major design flaw with .Net, not D.

        Lastly, I’d point out that 2-words is the limit of in-register data passing for x86, so moving to 3-words has a larger performance hit than you’d normally expect.

      • @Sandford: I’m not confusing anything. I know the difference, but the two concepts are closely related. And yes, interior pointers are allowed in D today, both on the stack, in registers, in globals/TLS, and in the heap, but that’s a terrible design that will effectively block precise GC.

        By the way, if you know of any efficient algorithms for handling interior pointers, do tell. I’m actually interested to see if any possible algorithms can beat the need for precise GC.

        Anyway, the entire point of storing the base pointer in the slice representation is to eliminate the need to scan for interior pointers in the *heap*. Like you said, the stack and registers must still be scanned conservatively and with regard for interior pointers (even Mono does this with its precise GC by default).

        Yes, unions are problematic, but not showstoppers. The compiler simply needs to emit a type info bit saying that the union must be scanned conservatively. No big deal – it’s the exception rather than the rule.

        (By the way, keeping a dead object alive is not necessarily a memory leak; dead objects will be reclaimed once that int’s bit representation changes or its containing object is collected.)

        I think you are quite mistaken in saying that D does not need a change to its slice representation to be ported to the CLI. In an ideal world, the CLI would support slices, but it doesn’t (because they have the interior pointer problems I pointed out here – the CLI does not permit interior pointers in the heap). Regardless of our opinions on that being a design flaw in the CLI (I don’t think it is), the fact of the matter is that D needs to change to be ported to the CLI. You are not likely to convince the CLI designers that arbitrary interior pointers in the heap is a good idea.

  1. (I really hate how I can’t reply to deeply nested comments.)

    Yes, it’d essentially be a “this is undefined behavior” thing. ‘Course, it could come with a language change and some compiler heuristics to try to prevent it.

    • Ah, okay. I was curious because I had read a lot about Rust’s borrowed pointers and was wondering if you were planning something similar :) Though with a type system as messy as D’s, it’d probably be difficult or impossible to integrate something like that.

  2. Why not do it like (0: length, 8: pointer, 16: base pointer), i.e. different order.
    That might easy transition in a number of places as old code can still use slices exactly the same way as before (only you need to make sure you copy the extra pointer now)

  3. D has already been ported to the CLI: http://dnet.codeplex.com/ . Besides, _if_ the CLI doesn’t support interior pointers, then _all_ pointers, including object/class references, must be fattened to include a base pointer. There is nothing special about slices in that regard.

    To generalize, some GCs hide the block info inside the object/memory block itself.Others, like the DMD GC, use internal data structures to store the memory block info. In the latter case, the cost of interior pointers is between negligible and zero. However, in the former case, interior pointers are almost impossible to support.

    • That project has been stalled for years _due_ to slices. And whether the CLI supports interior pointers is not an “if” thing; it’s specifically disallowed in I.8.2.1.1 of Ecma 335. I don’t know why you think that means class pointers must be fat pointers, but it doesn’t.

      • Thank you for clarifying the CLI specification. I did not know it, hence the conditional if. But you can understand my confusion: C++/CLI, the managed version of C++ supports interior pointers. ( See http://www.codeproject.com/Articles/17817/C-CLI-in-Action-Using-interior-and-pinning-pointer for examples).
        As for why class pointers must be fat pointers, I will direct your attention to std.conv.emplace and the casting pointers to objects in general. It is quite possible in D to have a class reference that points to an interior of a memory block. As for pointers in general, its trivial to generate an interior pointer.
        As for D.Net, at the time that the project stalled, you’re right that slices were the cause, although their representation wasn’t the issue: .Net has a slice type which was used to represent all D arrays. However, the .Net libraries were not designed to use them, so under the hood conversions had to be made, This sparked a large and long discussion on adding a separate array type in D. The T[new] syntax was settled upon, but like a lot of ideas it never made into the language.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s