Arrays, Part II: Possible solution: copy-on-write

Array of satellite dishes

Part I is here.

In part I, we established that we want to be able to use arrays in the functional paradigm, and we want doing so to be efficient; but we ran into a problem, because efficient array algorithms require in-place modification. That conflicts with the functional style, and particularly with functional purity.

In this part, I want to start laying out what I think is the solution.

The problem with in-place modification of arrays is one of aliasing. The functional paradigm is based around <a href="https://en.wikipedia.org/wiki/Persistent_data_structure"persistent data structures, which are semantically immutable values. Because they are immutable, the distinction between call-by-reference (fast) and call-by-value (easy to reason about) gets collapsed. Persistent data structures can be passed around as pointers, but the semantics is the same as if they were copied. But, unfortunately for us, that means you get pointers to the same array scattered around program memory. Modifying the array in-place through one of them changes the value seen by every other pointer, which breaks the value semantics.

The traditional solution to a problem like this in imperative languages is copy-on-write. You add a copy-on-write bit to the array header, and turn it on when you copy a pointer to the array. Then, before modifying the array in place, you check the copy-on-write bit and, if it’s set, you copy the array to new memory and modify that instead.

This is particularly natural in functional languages like Global Script, because an expression like xn @@ x is specified as returning a new array value. There’s no expectation that it returns the same array pointer that was passed in, but, as long as the xn argument was the only pointer to the array, there’s no violation of the semantics if it does, since there aren’t any pointers left to the old array value. Copy-on-write is an optimization, reusing the storage for xn when there are no other pointers to it.

This works nicely with another array optimization, related to appending. “Traditional” arrays, implemented in languages like Fortran and C, have a fixed size; you have to allocate a new array yourself if you need something longer than the array you have. There’s a simple optimization for growing arrays in-place, which gives us amortized O(1) append behavior, at the cost of occasionally moving the array in memory. (The copies associated to these moves are amortized O(1) because we space them out so that each item added costs us an O(1) portion of a future copy operation.) For mutable arrays – where identity is important, and you can have multiple points to the same mutable array – this requires a memory layout like this:
dynamic-array,
where every pointer to the same mutable array points at the same header, and moving the data in memory changes the pointer to the data inside the header.

By contrast, with persistent but copy-on-write arrays, we can use this memory layout:
copy-on-write-array,
where the data is always stored immediately after the header. If we want, we can still have extra blank space at the end of the data, for the array to grow into, but we can handle that space being missing or exhausted, too. The @@ (or ) function can modify the array in-place provided two conditions are met: the copy-on-write bit is turned off, and the array has extra capacity to grow into. In that case, it uses the first slot in the extra capacity to store the new element, and returns a pointer to the same array that was passed in. If the array is marked as copy-on-write, or it needs to be re-allocated before it can grow, we just allocate a new array (ensuring the new array has extra capacity for the next call to @@ or ), copy the old array into it along with the new element(s), and return a pointer to it. So we get the ability to grow arrays for free, using the same code path as copy-on-write.

(There’s no point in having extra capacity in an array that has the copy-on-write bit turned on, so the garbage collector can remove extra capacity from such an array, or add extra capacity to any array it notices is only the target of one pointer.)

(This ‘nicer’ representation isn’t 100% compatible with the slice optimization proposed in Part I for view (@), but maybe we can pick a representation at compile-time – or even dynamically – depending on how the array variable is used.)

Note that the array returned by @@ never has the copy-on-write bit turned on. The only ‘array constructor’ function that ever returns an array with that bit turned on is nil, and copying that array takes O(1) time anyway. So we can guarantee that a loop that just calls @@ or repeatedly on ‘the same’ array will copy the input array at most once.

In Part III, we will consider when to actually turn on the copy-on-write bit, and propose a flag to allow programmers to get compile-time errors when the copy-on-write strategy will fail for their code.

Advertisement

Arrays, Part I: the problem

Array of satellite dishes

The way to write map on arrays in the functional style is

'map :: ∀ 'α. ∀ 'β. (α → β) → array.t α → array.t β;
'map 'f 'a = w nil a, where:
    'w 'b nil :- b;
    'w 'b ('x@'a1) :- w (b @@ f x) a1;
;

While this is natural, it’s unfortunately O(n^2): the @ view and @@ function in the last line are both O(n) in the array they’re applied to, because they need to make (nearly-) complete copies of the input.

One response functional programmers take to that is to give up on arrays and use linked lists instead. Linked lists are easy to build as algebraic data types and fit quite nicely with the functional paradigm:

'map :: ∀ 'α. ∀ 'β. (α → β) → list.t α → list.t β;
'map 'f nil :- nil;
'map 'f ('x@'xn) :- f x @ map f xn;

On the other hand, while lazy linked lists are quite nice, arrays are (arguably) a better translation of the concept of the free monoid into functional programming. An array.t α is either an element of the free monoid over α or non-terminating, which is similar to record types, function types, and sum types, whereas a list.t α could be a finite sequence of αs, an infinite sequence, or a finite sequence of αs followed by undefined.

Furthermore, linked lists use substantially more memory than arrays do (not asymptotically, but a linked list will consume something like 3n pointer words for n elements, vs. something like n + 4 pointer words for an n-element array) and have (potentially) worse cache behavior – something that is becoming more important over time, not less.

This leads some functional programmers to use arrays in spite of their inefficiency. The most popular technique (AFAIK) is to use mutable arrays within the ST monad, which means abandoning the functional paradigm entirely in favor of the imperative for array manipulations. Fortunately, that can be cleverly hidden within array combinators, but, for Global Script, it would be nice if there was a better approach; some way to do array programming within the functional paradigm.

For view (@), the O(n) problem is easy to fix: replace arrays with slices. Essentially, rather than storing a pointer to a region of memory, store a region and offset. Then view (@) can return a slice with the start offset pointing one element further into the array than in the input.

The @@ function is a harder nut to crack. The only way to implement it in O(1) is to modify the input array in place, but this is obviously impossible in a pure-functional language.

That leaves arrays in an awkward place in functional programming languages: basically a foreign data structure, used for efficiency but not really belonging.

Part II, on copy-on-write, is here.