# Serialize a binary tree in minimal space

## Natural serialization

The *natural* way to serialize an arbitrary binary tree is to simply traverse the tree: when a node is encountered, store the value of the node; when a branch is encountered, store one bit that records whether the branch leads to a leaf.

**Theorem:** If a binary tree has \(n\) nodes, and \(v\) bits are needed to store the value of each node, then the size of the natural serialization is \(s(n) = n v + 2 n\) bits.

**Proof by Induction:** Let \(s(n)\) be the number of bits needed to serialize an arbitrary binary tree with \(n\) nodes. Let \(v\) be the number of bits required to store the value of a single node.
For a tree with \(n = 1\) node, \(s(1) = v + 2\), since the tree has one value and two leaves.
For any binary tree with \(1 \le n\) nodes, if \(s(n) = n v + 2 n\), then
\(s(n+1) = s(n) + s(1)\), because adding another node doesn’t remove any existing nodes or branches, but does add one node.
Consequently, \(s(n+1) = (n + 1) v + 2 (n + 1)\).
Therefore, by induction, \(s(n) = n v + 2 n\) for all natural numbers \(1 \le n\).

The natural serialization algorithm requires \(n v + 2 n\) bits of storage space to serialize an arbitrary tree with \(n\) nodes.
However, this is *not* the best we can do. It is possible to reduce the serialization size further, with the help of the Catalan numbers!

## Indexed serialization

What if, instead of using \(2 n\) bits to store the structure of each binary tree with \(n\) nodes, we instead store an *index value* that uniquely represents each tree?
If we start counting trees at \(i = 0\), then this index value is guaranteed to be the smallest possible serialization of an arbitrary tree structure.
But is this index smaller than the natural serialization—or merely equivalent to it? And, if the index is smaller, then how much smaller is it?

**Theorem:** The number of unique binary tree structures that can be constructed using \(n\) nodes is \(C(n)\), where \(C(n)\) is the \(n\)th Catalan number.

**Proof by Induction:** Let \(f(n)\) be the number of unique tree structures with \(n\) nodes, and let \(C(n)\) be the \(n\)th Catalan number.
For \(n = 0\) nodes, only one tree is possible (the tree with a leaf as the root), so \(f(0) = 1 = C(0)\).
For any integer \(0 \le n\), we can build all possible tree structures with \(n + 1\) nodes by placing one node at the root, and then using the remaining nodes to build all possible subtrees.
Consequently, \(f(n+1) = \sum_{i=0}^n f(i) f(n-i)\).
If \(f(m) = C(m)\) for all integers \(0 \le m \le n\), then this is a recurrence relation for the Catalan numbers, so \(f(n+1) = C(n+1)\).
Therefore, by induction, \(f(n) = C(n)\) for all integers \(0 \le n\).

There are \(C(n)\) possible structural configurations of a binary tree with \(n\) nodes, so the largest index value that we might need to store is \(C(n) - 1\). That means storing the index value could require up to \(1\) bit for \(n \le 2\), or \(\lceil \log_2 \left( C(n) - 1 \right) \rceil\) bits for \(n > 2\).

Asymptotically, the Catalan numbers grow as:

$$C(n) \sim \frac{4^n}{n^{\frac{3}{2}}\sqrt{\pi}}$$

So the number of bits needed to serialize a tree structure with \(n\) nodes is approximately:

$$\log_2 C(n) \sim 2 n - \frac{3}{2} \log_2 n - \frac{1}{2} \log_2 \pi$$

This is a significant improvement over the natural serialization algorithm!
The natural serialization algorithm consumes a full \(2n\) bits of storage space, whereas the index serialization uses the *theoretical minimum* amount of storage space, which is roughly
\(\frac{3}{2} \log_2 n + \frac{1}{2} \log_2 \pi\) bits smaller than the natural serialization.

This allows us to take one binary tree structure out of 4.3 billion, and store it uniquely as a 32-bit unsigned integer!

### Implementation

In order to implement this indexed serialization, we need to define a specific bijection from the whole numbers to the tree structures with \(n\) nodes.
We can do this by creating a *generating algorithm* that outputs each possible tree structure exactly once—and then halts.
We can label the trees sequentially, as they are produced by the generating algorithm, starting with \(i = 0\) for the first tree and ending at \(i = C(n) - 1\) for the last tree.

To make the generating algorithm clearer, we’ll establish some useful definitions first:

**Definition:** A binary tree is either a `leaf`, or an ordered pair of trees `(leftTree, rightTree)`.

According to this definition, a `leaf` is an empty tree with no children.
In many programming languages, a `leaf` is implemented as a `null` pointer.

**Definition:** A list is an ordered sequence of elements, \(A = [a_0, a_1, \ldots, a_n]\).

**Definition:** The sum, \(+\), of two lists is given by simple concatenation:
\([a_0, \ldots, a_m] + [b_0, \ldots, b_n]\) \(=\) \([a_0, \ldots, a_m, b_0, \ldots, b_n]\).

**Definition:** The cross product, \(\times\), of an element and a list is
\(e \times [a_0, \ldots, a_n] = [(e, a_0), \ldots, (e, a_n)]\), or
\([a_0, \ldots, a_n] \times e = [(a_0, e), \ldots, (a_n, e)]\), and
the cross product of two lists is \([a_0, \ldots, a_n] \times B = \sum_{i=0}^n a_i \times B\).

**Note:** This cross product of lists, \(A \times B\), is a *left* cross product, because it pairs the first element of \(A\) with each element of \(B\) before moving on to the next element of \(A\).

If the elements of the list are trees, then \((a_i, b_j)\) is a *new* tree,
with \(a_i\) as its left child, and \(b_j\) as its right child.

Armed with these definitions, we can now define a generating algorithm:

**Definition:** A generating algorithm, for \(n\) nodes, is any algorithm that generates the list of trees \(T_n\) given by:

- \(T_0 = [\text{leaf}]\)
- \(T_{n+1} = \sum_{i=0}^n T_i \times T_{n-i}\), for all integers \(0 \le n\).

Each \(T_n\) is an ordered list of every possible tree that can be built with \(n\) nodes. The index serialization of any tree with \(n\) nodes is its zero-based index in the list \(T_n\).

Here are the first few \(T_n\), for \(n \le 3\).
You can see that the cross products and summations are actually *constructing* every possible tree.

\(T_0\) | \( = \) | \([\text{leaf}]\) |

\(T_1\) | \( = \) | \(T_0 \times T_0\) |

\( = \) | \([\text{leaf}] \times [\text{leaf}]\) | |

\( = \) | \([(\text{leaf}, \text{leaf})]\) |

\(T_2\) | \( = \) | \(\sum_{i=0}^1 T_i \times T_{1-i}\) |

\( = \) | \((T_0 \times T_1) + (T_1 \times T_0)\) | |

\( = \) | \([(\text{leaf}, (\text{leaf}, \text{leaf})), ((\text{leaf}, \text{leaf}), \text{leaf})]\) |

\(T_3\) | \( = \) | \(\sum_{i=0}^2 T_i \times T_{2-i}\) |

\( = \) | \((T_0 \times T_2) + (T_1 \times T_1) + (T_2 \times T_0)\) | |

\( = \) | \([\)\((\text{leaf}, (\text{leaf}, (\text{leaf}, \text{leaf})))\), \((\text{leaf}, ((\text{leaf}, \text{leaf}), \text{leaf}))\), \(((\text{leaf}, \text{leaf}), (\text{leaf}, \text{leaf}))\), \(((\text{leaf}, (\text{leaf}, \text{leaf})), \text{leaf})\), \((((\text{leaf}, \text{leaf}), \text{leaf}), \text{leaf})\)\(]\) |

**Problem:** Serialize the tree \(((\text{leaf}, \text{leaf}), (\text{leaf}), \text{leaf}))\).

**Solution:** The tree \(((\text{leaf}, \text{leaf}), (\text{leaf}), \text{leaf}))\) has \(n = 3\) nodes, so it must be present in the list \(T_3\).
Scanning through this list, we find the tree at position \(i = 2\).
Therefore, the serialization of the tree \(((\text{leaf}, \text{leaf}), (\text{leaf}), \text{leaf}))\) is the value \(2\).

We can always serialize a tree with \(n\) nodes by generating \(T_n\) and searching for our tree in the list.
However, this is a rather inefficient process. If possible, we’d like a *fast* serialization algorithm, that can compute an index directly—*without* having to construct or manipulate the enormous \(T_n\) lists.

Here’s an example of a direct serialization algorithm, written in PHP:

```
// Catalan numbers: C(0), C(1), ..., C(19)
$C = array(1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796,
58786, 208012, 742900, 2674440, 9694845,
35357670, 129644790, 477638700, 1767263190);
// Usage: list($nodes, $index) = serialize($tree);
function serialize($tree) {
if ($tree === null) {
return array(0, 0);
}
global $C; // Catalan numbers
list($leftNodes, $leftIndex) = serialize($tree['left']);
list($rightNodes, $rightIndex) = serialize($tree['right']);
$index = 0;
$n = $leftNodes + $rightNodes;
for ($i = 0; $i < $leftNodes; ++$i) {
$index += $C[$i] * $C[$n - $i];
}
$index += ($leftIndex * $C[$rightNodes]) + $rightIndex;
return array($n + 1, $index);
}
// Usage: $tree = deserialize($nodes, $index);
function deserialize($nodes, $index) {
if ($nodes < 1) {
return null;
}
global $C; // Catalan numbers
$n = $nodes - 1;
for ($i = 0; $index >= $priorTrees = $C[$i] * $C[$n - $i]; ++$i) {
$index -= $priorTrees;
}
$tree['left'] = deserialize($i, (int)floor($index / $C[$n - $i]));
$tree['right'] = deserialize($n - $i, $index % $C[$n - $i]);
return $tree;
}
```

This is *not* an example of the most efficient coding style in PHP; this code is merely intended to illustrate the algorithm.
In actual practice, this PHP code would be modified to perform well and to handle large trees.