Say you have a collection of elements, and you want to group them into a number of sets (that are each disjoint). One example could be grouping nodes within a graph that are connected by a path of edges.
There are a number of different data structures you could use to do this, but one which I find particularly elegant is a disjoint set.
Our data structure supports three operations:
insert(element)
: Inserts element
, putting it in its own set. You may see this called "makeSet
" elsewhere.union(lhs, rhs)
: Union the two sets containing lhs
and rhs
(which should have been previously inserted).sameSet(lhs, rhs) -> Bool
: Determine whether lhs
and rhs
(which should have been previously inserted) are in the same set. You may see other implementations have a "findSet
" method instead, that returns a "representative object" for the set containing an element (which can be compared with another to see if two elements are in the same set). I prefer going with sameSet
directly.Each set will be represented internally as its own tree. With this representation, the three operations are pretty simple to implement.
insert(element)
element
.union(lhs, rhs)
lhs
.rhs
.sameSet(lhs, rhs) -> Bool
lhs
.rhs
.true
if they're the same, otherwise false
.The above is all we need to get something working, but there are two heuristics that we can use to improve performance.
union
, if one tree is shorter than the other, then move that one. This makes sense if you think about it; we want to try and reduce the length of the path we may need to travel when finding the root of a tree, so we want our trees to be as short as possible.The right tree is much shorter than the left, so root it under the left tree on union.
union
, or sameSet
operation), move the nodes in the path to point at the root. This way, the next time you need to travel the path, there's only be a single hop to go for each node. The rationale here is similar to that for union by rank: we want our trees to be as short as possible.After performing an operation such as sameSet(4, other), all nodes on the path from 4 to the root are moved directly under the root. This makes the paths shorter next time around.
(Here comes the punchline.)
With those two heuristics in place, it can be shown (using a proof that I admittedly don't understand) that any sequence of m
insert
, union
and sameSet
operations can be performed in O(mα(n))
time (where n
is the number of inserted elements). α(n)
is the inverse of the Ackermann function, which grows so slowly that α(<number of atoms in the universe>) < 5
, so this term is effectively a constant.
Therefore, the amortized complexity of each operation is O(1)
; each operation is effectively performed in constant time.
How cool is that? I love how such a simple data structure with a few optimizations results in a time complexity based on a function so weird as the Ackermann function.
An example implementation in Swift is below (or in a Gist, if you prefer). I've tried to thoroughly document the implementation but feel free to reach out if something doesn't make sense.
/// Maintains a number of sets, each containing a number of
/// `Element`s.
///
/// Supports union-ing two sets (using an `Element` contained in
/// each set), as well as determining whether two `Element`s are
/// contained within the same set.
///
/// Note that this is a reference type, rather than a struct
/// with value semantics. This is due to the complexity of
/// copying the `Node`s on write, which whilst possible,
/// detracts from the actual behaviour of the type in this
/// example.
public final class DisjointSet<Element: Hashable> {
// MARK: - Types
/// `Error` thrown when an operation expects than an element
/// has already been inserted with `insert(_:)`.
public struct MissingElementError: Error {}
/// A single node in a set.
///
/// Note that this is a reference type, since each `Node` will
/// be stored in the `nodes` dictionary (to lookup `Node`s by
/// an `Element`), and will be referred to by other `Node`s in
/// its set.
private final class Node {
/// The parent `Node`.
///
/// Will be `nil` if this `Node` is the root node of a set.
var parent: Node?
/// An upper-bound on the number of child nodes of this
/// `Node`.
var rank: Int = 0
}
// MARK: - Properties
/// A mapping of `Element`s to `Node`s, containing a `Node`
/// for each `Element` that has been inserted by `insert(_:)`.
private var nodes: [Element: Node] = [:]
// MARK: - Initializers
/// Initializes a `DisjointSet` that initially contains 0
/// elements.
public init() {}
// MARK: - Public
/// Inserts the specified `element`.
///
/// - Returns: `true` if `element` was inserted, or `false`
/// otherwise (e.g. if `element` has already been inserted).
@discardableResult public func insert(
_ element: Element
) -> Bool {
guard nodes[element] == nil else {
// Element already inserted.
return false
}
nodes[element] = Node()
return true
}
/// Unions the two sets containing `lhs` and `rhs`.
///
/// - Parameters:
/// - lhs: An element contained in the first set.
/// - rhs: An element contained in the second set.
/// - Throws: A `MissingElementError` if either `lhs` or `rhs`
/// has not previously been inserted.
public func union(_ lhs: Element, _ rhs: Element) throws {
guard
let lhsNode = nodes[lhs],
let rhsNode = nodes[rhs]
else {
throw MissingElementError()
}
let lhsRoot = findRoot(node: lhsNode)
let rhsRoot = findRoot(node: rhsNode)
guard lhsRoot !== rhsRoot else {
// `lhs` and `rhs` are already in the same set.
return
}
// Union by rank heuristic.
if lhsRoot.rank > rhsRoot.rank {
// Note: `lhsRoot.rank` is intentionally not incremented
// here. `lhsRoot.rank` is still valid, since it is just
// an upper bound (and the new child, `rhsRoot`, has a
// lower rank).
rhsRoot.parent = lhsRoot
} else if lhsRoot.rank < rhsRoot.rank {
// Note: `rhsRoot.rank` is intentionally not incremented
// here (same reasoning as above).
lhsRoot.parent = rhsRoot
} else {
// `lhsRoot.rank == rhsRoot.rank`. Use `rhsRoot` as the
// new root arbitrarily.
lhsRoot.parent = rhsRoot
// `rhsRoot.rank` needs to be updated here to ensure that
// the upper bound is still valid.
rhsRoot.rank += 1
}
}
/// Returns `true` if `lhs` and `rhs` are members of the same
/// set.
///
/// - Parameters:
/// - lhs: The first `Element` to check for co-membership.
/// - rhs: The second `Element` to check for co-membership.
/// - Throws: `true` if `lhs` and `rhs` are contained in the
/// same set (as a result of a `union(_:_:)` operation.
/// - Returns: A `MissingElementError` if either `lhs` or
/// `rhs` has not previously been inserted.
public func sameSet(
_ lhs: Element,
_ rhs: Element
) throws -> Bool {
guard
let lhsNode = nodes[lhs],
let rhsNode = nodes[rhs]
else {
throw MissingElementError()
}
/// Use `===` to check for instance equality, rather than
/// value equality.
return findRoot(node: lhsNode) === findRoot(node: rhsNode)
}
// MARK: - Private
/// Returns the root `Node` pointed to be `node`, or `node`
/// itself if it is a root node.
private func findRoot(node: Node) -> Node {
if let parent = node.parent {
// Path compression heuristic.
node.parent = findRoot(node: parent)
}
// If `node.parent` is `nil`, `node` is a root node.
// Therefore, return it instead.
return node.parent ?? node
}
}