Lab note #030 The Trick to Unicity of Prolly Trees

Lab note #030 The Trick to Unicity of Prolly Trees

In my previous lab note #028, I was worried that Prolly trees were not unicit. When I inserted keys in different orders, there was a small chance that splits would propagate to the end of the dataset. I was mistaken. Prolly Trees are unicit (the same data yields the same tree structure).

The advantage of working in public is that the public will point out incongruence. In this case, it was the inventor of Prolly Trees.

Twitter isn't a great medium to convey technical misunderstandings, so Aaron just suggested I check out Dolt's codebase. So where was the disconnect?

I had mistakenly thought chunking started at the beginning of the leaf node that got the new key on insertion. Because of that key mistaken assumption, I couldn't square two opposing forces: the chucker depends on the number of examined keys since the last split to control node sizes, but depending on the number of examined keys meant previous insertions affected the decision to split on the current insertion. Because the splits from previous inserts weren't re-evaluated on every insert, they would have a chance of affecting future splits.

In retrospect, I think I assumed this because in most other types of trees, balancing during insertion mostly happens locally and propagates up to the root. That's not the case here.

What I did understand was two facts:

  1. The chunker is a pure function of the number of examined keys since the last split and the hash of the currently examined key.
  2. The tree contains unique totally ordered keys. Hence, there is only a single way to order a given set of keys.

Taken together, this means for a set of keys, there is only a single way to split them all into leaf nodes. Conceptually, I understood that if I ran the chunker from the beginning of the data in the tree, I'd get a unicit tree.  

But surely you don't run the chunker from the beginning of the dataset on every insert? That'd take too long and be too much work. I had summarily dismissed this as a possibility because it seemed wrong. But in fact, that's exactly what's happening–at least conceptually.

However, several observations allow us to skip over doing a lot of work.

  1. The chunker has no memory after a split. Every examined key makes the possibility of splitting more probable. But once the chunker decides to split, previously examined keys before a split doesn't affect how future splits are determined.
  2. We can run independent chunkers at every level of a tree to help determine whether we are splitting as we were before.

Taken together, this means we can skip over chunking entire sub-trees. If our chunker is making the same split decisions in the left sibling nodes before the insertion path, then our chunker at this level is synchronized. That also means any sub-trees in the left sibling nodes are also synchronized. So we recurse up and do it again, until we find a branch node where we aren't synchronized, and then recurse back down.

As we recurse back down, we only need to feed the chunker keys or entries from the beginning of the last synchronized split, because a chunker has no memory after a split! Hence, we can skip over trying to chunk keys from entire sub-trees by which we already know we're synced.

The big-O seems to be O(2M log N), where M is the average node size, and N is the number of keys. You do need extra allocations for chunkers at every level, but this seems minor. Trees aren't supposed to be too tall, and chunkers don't retain much state.

Another way to think about it: instead of comparing Merkle hashes of sub-trees, you compare their splits to determine whether something has changed.

All of this is obvious in hindsight, but I was missing a key piece of information that would have helped to bridge the gap. How could I have reached this conclusion faster? It's not obvious to me this would be a key part of the design from the high-level posts, and this fact disappears in the details of the code.

I did try using Cursor.io to leverage LLMs to help explain part of the code to me. But it wasn't beneficial with the core insight. I think honestly, it's helpful to have an IDE that can jump to definitions quickly, so you don't lose the context in your head while reading code. But probably the most helpful to get code up and running so you can see what it's doing in a debugger.