Arrays, Part II: Possible solution: copy-on-write

Array of satellite dishes

Part I is here.

In part I, we established that we want to be able to use arrays in the functional paradigm, and we want doing so to be efficient; but we ran into a problem, because efficient array algorithms require in-place modification. That conflicts with the functional style, and particularly with functional purity.

In this part, I want to start laying out what I think is the solution.

The problem with in-place modification of arrays is one of aliasing. The functional paradigm is based around <a href=""persistent data structures, which are semantically immutable values. Because they are immutable, the distinction between call-by-reference (fast) and call-by-value (easy to reason about) gets collapsed. Persistent data structures can be passed around as pointers, but the semantics is the same as if they were copied. But, unfortunately for us, that means you get pointers to the same array scattered around program memory. Modifying the array in-place through one of them changes the value seen by every other pointer, which breaks the value semantics.

The traditional solution to a problem like this in imperative languages is copy-on-write. You add a copy-on-write bit to the array header, and turn it on when you copy a pointer to the array. Then, before modifying the array in place, you check the copy-on-write bit and, if it’s set, you copy the array to new memory and modify that instead.

This is particularly natural in functional languages like Global Script, because an expression like xn @@ x is specified as returning a new array value. There’s no expectation that it returns the same array pointer that was passed in, but, as long as the xn argument was the only pointer to the array, there’s no violation of the semantics if it does, since there aren’t any pointers left to the old array value. Copy-on-write is an optimization, reusing the storage for xn when there are no other pointers to it.

This works nicely with another array optimization, related to appending. “Traditional” arrays, implemented in languages like Fortran and C, have a fixed size; you have to allocate a new array yourself if you need something longer than the array you have. There’s a simple optimization for growing arrays in-place, which gives us amortized O(1) append behavior, at the cost of occasionally moving the array in memory. (The copies associated to these moves are amortized O(1) because we space them out so that each item added costs us an O(1) portion of a future copy operation.) For mutable arrays – where identity is important, and you can have multiple points to the same mutable array – this requires a memory layout like this:
where every pointer to the same mutable array points at the same header, and moving the data in memory changes the pointer to the data inside the header.

By contrast, with persistent but copy-on-write arrays, we can use this memory layout:
where the data is always stored immediately after the header. If we want, we can still have extra blank space at the end of the data, for the array to grow into, but we can handle that space being missing or exhausted, too. The @@ (or ) function can modify the array in-place provided two conditions are met: the copy-on-write bit is turned off, and the array has extra capacity to grow into. In that case, it uses the first slot in the extra capacity to store the new element, and returns a pointer to the same array that was passed in. If the array is marked as copy-on-write, or it needs to be re-allocated before it can grow, we just allocate a new array (ensuring the new array has extra capacity for the next call to @@ or ), copy the old array into it along with the new element(s), and return a pointer to it. So we get the ability to grow arrays for free, using the same code path as copy-on-write.

(There’s no point in having extra capacity in an array that has the copy-on-write bit turned on, so the garbage collector can remove extra capacity from such an array, or add extra capacity to any array it notices is only the target of one pointer.)

(This ‘nicer’ representation isn’t 100% compatible with the slice optimization proposed in Part I for view (@), but maybe we can pick a representation at compile-time – or even dynamically – depending on how the array variable is used.)

Note that the array returned by @@ never has the copy-on-write bit turned on. The only ‘array constructor’ function that ever returns an array with that bit turned on is nil, and copying that array takes O(1) time anyway. So we can guarantee that a loop that just calls @@ or repeatedly on ‘the same’ array will copy the input array at most once.

In Part III, we will consider when to actually turn on the copy-on-write bit, and propose a flag to allow programmers to get compile-time errors when the copy-on-write strategy will fail for their code.