Articles

Serialize a binary tree in minimal space

Natural serialization

The natural way to serialize an arbitrary binary tree is to simply traverse the tree: when a node is encountered, store the value of the node; when a branch is encountered, store one bit that records whether the branch leads to a leaf.

Theorem: If a binary tree has n nodes, and v bits are needed to store the value of each node, then the size of the natural serialization is s(n) = nv + 2n bits.

Proof by Induction: Let s(n) be the number of bits needed to serialize an arbitrary binary tree with n nodes. Let v be the number of bits required to store the value of a single node. For a tree with n = 1 node, s(1) = v + 2, since the tree has one value and two leaves. For any binary tree with 1 <= n nodes, if s(n) = nv + 2n, then s(n+1) = s(n) + s(1), because adding another node doesn’t remove any existing nodes or branches, but does add one node. Consequently, s(n+1) = (n+1)v + 2(n+1). Therefore, by induction, s(n) = nv + 2n for all natural numbers 1 <= n.

The natural serialization algorithm requires n v + 2 n bits of storage space to serialize an arbitrary tree with n nodes. However, this is not the best we can do. It is possible to reduce the serialization size further, with the help of the Catalan numbers!

Indexed serialization

What if, instead of using 2n bits to store the structure of each binary tree with n nodes, we instead store an index value that uniquely represents each tree? If we start counting trees at i = 0, then this index value is guaranteed to be the smallest possible serialization of an arbitrary tree structure. But is this index smaller than the natural serialization—or merely equivalent to it? And, if the index is smaller, then how much smaller is it?

Theorem: The number of unique binary tree structures that can be constructed using n nodes is C(n), where C(n) is the nth Catalan number.

Proof by Induction: Let f(n) be the number of unique tree structures with n nodes, and let C(n) be the nth Catalan number. For n = 0 nodes, only one tree is possible (the tree with a leaf as the root), so f(0) = 1 = C(0). For any integer 0 <= n, we can build all possible tree structures with n + 1 nodes by placing one node at the root, and then using the remaining nodes to build all possible subtrees. Consequently, f(n+1) = Sum[i=0..n] f(i)*f(n-i) = C(n+1). If f(m) = C(m) for all integers 0 <= m <= n, then this is a recurrence relation for the Catalan numbers, so f(n+1) = C(n+1). Therefore, by induction, f(n) = C(n) for all integers 0 <= n.

There are C(n) possible structural configurations of a binary tree with n nodes, so the largest index value that we might need to store is C(n) - 1. That means storing the index value could require up to 1 bit for n <= 2, or ceil (log_2 (C(n) - 1)) bits for n > 2.

Asymptotically, the Catalan numbers grow as:
C(n) ~ (4^n)/(n^(3/2)*sqrt(pi))

So the number of bits needed to serialize a tree structure with n nodes is approximately:
log_2(C(n)) ~ 2n - (3/2)*(log_2 n) - (1/2)*(log_2 pi)

This is a significant improvement over the natural serialization algorithm! The natural serialization algorithm consumes a full 2n bits of storage space, whereas the index serialization uses the theoretical minimum amount of storage space, which is roughly (3*(log_2 n) + (log_2 pi)) / 2 bits smaller than the natural serialization.

This allows us to take one binary tree structure out of 4.3 billion, and store it uniquely as a 32-bit unsigned integer value!

Implementation

In order to implement this indexed serialization, we need to define a specific bijection from the whole numbers to the tree structures with n nodes. We can do this by creating a generating algorithm that outputs each possible tree structure exactly once—and then halts. We can label the trees sequentially, as they are produced by the generating algorithm, starting with i = 0 for the first tree and ending at i = C(n) - 1 for the last tree.

To make the generating algorithm clearer, we’ll establish some useful definitions first:

Definition: A binary tree is either a leaf, or an ordered pair of trees (left_tree, right_tree).

According to this definition, a leaf is an empty tree with no children. In many programming languages, a leaf is implemented as a null pointer.

Definition: A list is an ordered sequence of elements, A = [a_0, a_1, ..., a_n].

Definition: The sum, +, of two lists is given by simple concatenation: [a_0, ..., a_m] + [b_0, ..., b_n] = [a_0, ..., a_m, b_0, ..., b_n].

Definition: The cross product, X, of an element and a list is e X [a_0, ..., a_n] = [(e, a_0), ..., (e, a_n)], or [a_0, ..., a_n] X e = [(a_0, e), ..., (a_n, e)], and the cross product of two lists is [a_0, ..., a_n] X B = Sum[i=0..n] a_i x B.

Note: This cross product of lists, A X B, is a left cross product, because it pairs the first element of A with each element of B before moving on to the next element of A.

If the elements of the list are trees, then (a_i, b_j) is a new tree, with a_i as its left child, and b_j as its right child.

Armed with these definitions, we can now define a generating algorithm:

Definition: A generating algorithm, for n nodes, is any algorithm that generates the list of trees T_n given by:

Each T_n is an ordered list of every possible tree that can be built with n nodes. The index serialization of any tree with n nodes is its zero-based index in the list T_n.

Here are the first few T_n, for n <= 3. You can see that the cross products and summations are actually constructing every possible tree.

T_0= [leaf]
T_1= T_0 X T_0
= [leaf] X [leaf]
= [(leaf, leaf)]
T_2= Sum[i=0..1] T_i X T_(1-i)
= T_0 X T_1 + T_1 X T_0
= [leaf] X [(leaf, leaf)] + [(leaf, leaf)] X [leaf]
= [(leaf, (leaf, leaf))] + [((leaf, leaf), leaf)]
= [(leaf, (leaf, leaf)), ((leaf, leaf), leaf)]
T_3= Sum[i=0..2] T_i X T_(2-i)
= T_0 X T2 + T_1 X T_1 + T_2 X T_0
= [leaf] X [(leaf, (leaf, leaf)), ((leaf, leaf), leaf)] + [(leaf, leaf)] X [(leaf, leaf)] + [(leaf, (leaf, leaf)), ((leaf, leaf), leaf)] X [leaf]
= [(leaf, (leaf, (leaf, leaf))), (leaf, ((leaf, leaf), leaf))] + [((leaf, leaf), (leaf, leaf))] + [((leaf, (leaf, leaf)), leaf), (((leaf, leaf), leaf), leaf)]
= [(leaf, (leaf, (leaf, leaf))), (leaf, ((leaf, leaf), leaf)), ((leaf, leaf), (leaf, leaf)), ((leaf, (leaf, leaf)), leaf), (((leaf, leaf), leaf), leaf)]

Problem: Serialize the tree ((leaf, leaf), (leaf, leaf)).

Solution: The tree ((leaf, leaf), (leaf, leaf)) has n = 3 nodes, so it must be present in the list T_3. Scanning through this list, we find the tree at position i = 2. Therefore, the serialization of the tree ((leaf, leaf), (leaf, leaf)) is the value 2.

We can always serialize a tree with n nodes by generating T_n and searching for our tree in the list. However, this is a rather inefficient process. If possible, we’d like a fast serialization algorithm, that can compute an index directly—without having to construct or manipulate the enormous T_n lists.

Here’s an example of a direct serialization algorithm, written in PHP:

// Catalan numbers: C(0), C(1), ..., C(19)
$C = array(1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012,
           742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190);

// Usage: list($nodes, $index) = serialize_tree($tree);
function serialize_tree($tree) {
	if ($tree === null) {
		return array(0, 0);
	}

	global $C; // Catalan numbers

	list($left_nodes, $left_index) = serialize_tree($tree['left']);
	list($right_nodes, $right_index) = serialize_tree($tree['right']);

	$index = 0;
	for ($i = 0, $n = $left_nodes + $right_nodes; $i < $left_nodes; ++$i) {
		$index += $C[$i] * $C[$n - $i];
	}

	$index += ($left_index * $C[$right_nodes]) + $right_index;
	return array($n + 1, $index);
}

// Usage: $tree = deserialize_tree($nodes, $index);
function deserialize_tree($nodes, $index) {
	if ($nodes < 1) {
		return null;
	}

	global $C; // Catalan numbers

	for ($i = 0, $n = $nodes - 1; $index >= $prior_trees = $C[$i] * $C[$n - $i]; ++$i) {
		$index -= $prior_trees;
	}

	$tree['left'] = deserialize_tree($i, (int)floor($index / $C[$n - $i]));
	$tree['right'] = deserialize_tree($n - $i, $index % $C[$n - $i]);

	return $tree;
}

This is not an example of the most efficient coding style in PHP; this code is merely intended to illustrate the algorithm. In actual practice, this PHP code would be modified to perform well and to handle large trees.