Jack Morris
Efficient Collections with Copy on Write
4 Jan 2020

Value Types

In Swift, structs are value types. What does that mean in practice? Let's consider the following:

struct Recipe {
  var name: String
  var instructions: String
  var image: UIImage
}

let a = Recipe(
  name: "Omelette", 
  instructions: "Break some eggs",
  image: UIImage(named: "omelette")!)
var b = a  // Copy!

When initializing b from a, copying a, the struct is copied in its entirety, resulting in a copy of each of its stored properties. A full, recursive, copy for any properties that are structs (name and instructions), and a quick copy of the reference for reference-type properties (image, which is an instance of a UIImage, a class).

Now, a and b have their own name/instructions, but share the same underlying UIImage instance for image.

b.name = "A different omelette"
print(a.name) // "Omelette"
print(b.name) // "A different omelette"
print(a.image === b.image) // "true"

Copy on Write

What about collections? Array is a struct, and the idea of copying it in its entirety frequently doesn't sound particularly efficient.

let values1 = Array(repeating: 0, count: 1000000)
var values2 = values1  // Is this copying _everything_ ?? 😱😱

Thankfully, that's not the case. Whilst they still follow the rules above, standard library collections also use an optimization pattern called copy on write.

Internally, Array is backed by a reference-type buffer, so the initialization of values2 just copies this reference. A deep copy of this buffer is only performed when the Array is actually mutated (and only if the buffer is actually shared).

var values1 = Array(repeating: 0, count: 1000000)
values1.append(1)      // No copy here, the buffer isn't shared
var values2 = values1  // values1, values2 share the same buffer
values2.append(2)      // Entire buffer copied before appending!
print(values1)         // "[0,0,...,1]"
print(values2)         // "[0,0,...,2]"

A similar optimisation is implemented for Set and Dictionary.

Implementing Copy on Write

It's important to note that copy on write does not come for free. There's no magic in Swift that delays copying the contents of your struct until a mutation is performed. On a struct copy, all stored properties are copied. But for stored properties that are reference types, copying the reference is very quick).

So how can we implement this ourselves? Let's imagine that we're implementing our own struct to store a collection of Ints. In practice you'd use an Array<Int> for this (and you definitely wouldn't just wrap an NSMutableArray) but bear with me.

struct IntCollection {
  /// Private backing store for our values.
  private let values = NSMutableArray()

  /// Appends a value to this `IntCollection`.
  func append(_ value: Int) {
    values.add(value)
  }

  /// Subscript access, similarly to `Array`.
  subscript(index: Int) -> Int {
    get { return values[index] as! Int }
    set { values[index] = newValue }
  }
}

Right now, our IntCollection doesn't even properly implement value semantics. Performing a copy just copies the values reference, so any copies always share the same underlying NSMutableArray. This isn't the expected behaviour for a struct; each copy should give us a semantically unique instance to work with.

var collection = IntCollection()
collection.append(1)
print(collection[0])  // "1"

var copy = collection
copy[0] = 999
print(copy[0])        // "999"
print(collection[0])  // Also "999"!

One way to fix this would be to copy our values into a fresh NSMutableArray on every mutation, however this results in a load of wasteful copy operations. For our copy on write implementation, we want to copy all of our values into a fresh NSMutableArray, but only if that NSMutableArray is currently being shared by multiple IntCollections. How do we keep track of that sharing?

Thankfully, the standard library gives us isKnownUniquelyReferenced(_:), which serves this exact purpose. It'll return true if it's argument only has a single strong reference. 🧙‍♂️

Let's modify IntCollection to take advantage:

struct IntCollection {
  /// Private backing store for our values.
  private var values = NSMutableArray()

  /// Appends a value to our collection.
  /// Must now be mutating, as it may modify values.
  mutating func append(_ value: Int) {
    copyValuesIfNecessary()
    values.add(value)
  }

  /// Subscript semantics, similar to Array.
  subscript(index: Int) -> Int {
    get { return values[index] as! Int }
    set {
      copyValuesIfNecessary()
      values[index] = newValue
    }
  }

  /// Copies values if it's shared. Must be done before mutation
  private mutating func copyValuesIfNecessary() {
    if !isKnownUniquelyReferenced(&values) {
      values = values.mutableCopy() as! NSMutableArray
    }
  }
}

Now, we get the best of both worlds. Efficient copies (since only our reference to values is copied routinely), and the ability to mutate our copies separately (since a full copy of values will be performed at this point).

var collection = IntCollection()
collection.append(1)
print(collection[0])  // "1"

var copy = collection // A quick operation.
copy[0] = 999         // Full copy performed here.
print(collection[0])  // Still "1".

inout Parameters

There's one interesting point when passing copy on write types as inout parameters. If you check out the Swift docs, inout parameter passing is described as follows:

- When the function is called, the value of the argument is 
  copied.
- In the body of the function, the copy is modified.
- When the function returns, the copy’s value is assigned to 
  the original argument.

This would seem to imply that the following performs a full copy.

func mutate(_ inoutCollection: inout IntCollection) {
  inoutCollection.append(5) // Does this perform a full copy?
}

var collection = IntCollection()
collection.append(1)
mutate(&collection)

Inside mutate(_:), the copy is mutated, and at this point the underlying NSMutableArray is referenced twice. Once by collection, and once by inoutCollection (the copy). Wouldn't copy on write kick in here?

Turns out that the model described in the Apple docs is more of a mental model, rather than a description of the actual implementation. Further docs state:

As an optimization, when the argument is a value stored at a 
physical address in memory, the same memory location is used 
both inside and outside the function body.

Therefore, the initial copy is skipped in certain cases. No copy, no multiple references, and no full copy of the underlying NSMutableArray. Neat!

If you test this scenario, printing out when a full copy takes place, you'll find that the call to mutate will not copy the underlying NSMutableArray. It's as if our IntCollection were truly passed by reference.