Serialize a binary tree in minimal space

Natural serialization

The natural way to serialize an arbitrary binary tree is to simply traverse the tree: when a node is encountered, store the value of the node; when a branch is encountered, store one bit that records whether the branch leads to a leaf.

Theorem: If a binary tree has nodes, and bits are needed to store the value of each node, then the size of the natural serialization is bits.

Proof by Induction: Let be the number of bits needed to serialize an arbitrary binary tree with nodes. Let be the number of bits required to store the value of a single node. For a tree with node, , since the tree has one value and two leaves. For any binary tree with nodes, if , then , because adding another node doesn’t remove any existing nodes or branches, but does add one node. Consequently, . Therefore, by induction, for all natural numbers .

The natural serialization algorithm requires bits of storage space to serialize an arbitrary tree with nodes. However, this is not the best we can do. It is possible to reduce the serialization size further, with the help of the Catalan numbers!

Indexed serialization

What if, instead of using bits to store the structure of each binary tree with nodes, we instead store an index value that uniquely represents each tree? If we start counting trees at , then this index value is guaranteed to be the smallest possible serialization of an arbitrary tree structure. But is this index smaller than the natural serialization—or merely equivalent to it? And, if the index is smaller, then how much smaller is it?

Theorem: The number of unique binary tree structures that can be constructed using nodes is , where is the th Catalan number.

Proof by Induction: Let be the number of unique tree structures with nodes, and let be the th Catalan number. For nodes, only one tree is possible (the tree with a leaf as the root), so . For any integer , we can build all possible tree structures with nodes by placing one node at the root, and then using the remaining nodes to build all possible subtrees. Consequently, . If for all integers , then this is a recurrence relation for the Catalan numbers, so . Therefore, by induction, for all integers .

There are possible structural configurations of a binary tree with nodes, so the largest index value that we might need to store is . That means storing the index value could require up to bit for , or bits for .

Asymptotically, the Catalan numbers grow as:

So the number of bits needed to serialize a tree structure with nodes is approximately:

This is a significant improvement over the natural serialization algorithm! The natural serialization algorithm consumes a full bits of storage space, whereas the index serialization uses the theoretical minimum amount of storage space, which is roughly bits smaller than the natural serialization.

This allows us to take one binary tree structure out of 4.3 billion, and store it uniquely as a 32-bit unsigned integer!

Implementation

In order to implement this indexed serialization, we need to define a specific bijection from the whole numbers to the tree structures with nodes. We can do this by creating a generating algorithm that outputs each possible tree structure exactly once—and then halts. We can label the trees sequentially, as they are produced by the generating algorithm, starting with for the first tree and ending at for the last tree.

To make the generating algorithm clearer, we’ll establish some useful definitions first:

Definition: A binary tree is either a leaf, or an ordered pair of trees (leftTree, rightTree).

According to this definition, a leaf is an empty tree with no children. In many programming languages, a leaf is implemented as a null pointer.

Definition: A list is an ordered sequence of elements, .

Definition: The sum, , of two lists is given by simple concatenation: .

Definition: The cross product, , of an element and a list is , or , and the cross product of two lists is .

Note: This cross product of lists, , is a left cross product, because it pairs the first element of with each element of before moving on to the next element of .

If the elements of the list are trees, then is a new tree, with as its left child, and as its right child.

Armed with these definitions, we can now define a generating algorithm:

Definition: A generating algorithm, for nodes, is any algorithm that generates the list of trees given by:

, for all integers .

Each is an ordered list of every possible tree that can be built with nodes. The index serialization of any tree with nodes is its zero-based index in the list .

Here are the first few , for . You can see that the cross products and summations are actually constructing every possible tree.



		, , , ,

Problem: Serialize the tree .

Solution: The tree has nodes, so it must be present in the list . Scanning through this list, we find the tree at position . Therefore, the serialization of the tree is the value .

We can always serialize a tree with nodes by generating and searching for our tree in the list. However, this is a rather inefficient process. If possible, we’d like a fast serialization algorithm, that can compute an index directly—without having to construct or manipulate the enormous lists.

Here’s an example of a direct serialization algorithm, written in PHP:

// Catalan numbers: C(0), C(1), ..., C(19)
$C = array(1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796,
           58786, 208012, 742900, 2674440, 9694845,
           35357670, 129644790, 477638700, 1767263190);

// Usage: list($nodes, $index) = serialize($tree);
function serialize($tree) {
	if ($tree === null) {
		return array(0, 0);
	}

	global $C; // Catalan numbers

	list($leftNodes, $leftIndex) = serialize($tree['left']);
	list($rightNodes, $rightIndex) = serialize($tree['right']);

	$index = 0;
	$n = $leftNodes + $rightNodes;

	for ($i = 0; $i < $leftNodes; ++$i) {
		$index += $C[$i] * $C[$n - $i];
	}

	$index += ($leftIndex * $C[$rightNodes]) + $rightIndex;
	return array($n + 1, $index);
}

// Usage: $tree = deserialize($nodes, $index);
function deserialize($nodes, $index) {
	if ($nodes < 1) {
		return null;
	}

	global $C; // Catalan numbers

	$n = $nodes - 1;

	for ($i = 0; $index >= $priorTrees = $C[$i] * $C[$n - $i]; ++$i) {
		$index -= $priorTrees;
	}

	$tree['left'] = deserialize($i, (int)floor($index / $C[$n - $i]));
	$tree['right'] = deserialize($n - $i, $index % $C[$n - $i]);

	return $tree;
}

This is not an example of the most efficient coding style in PHP; this code is merely intended to illustrate the algorithm. In actual practice, this PHP code would be modified to perform well and to handle large trees.