Lab note #029 Balancing nodes in a unicit tree
Last time, I found balancing nodes seemed to be in conflict with the property of a unicit tree. A balanced tree is desirable for efficient queries. A unicit tree is desirable for Merkle Trees for fast equality comparisons. Was there a way to get both? I noticed this problem was similar to how CRDTs could handle insertions in any order. Could we leverage the properties of a join-semilattice to achieve an unicit, balanced tree?
I'm wary of wasting time to satisfy technical curiosity, as I'm currently deep in a rabbit hole and I'm itching to pop everything off the stack. That said, my gut thought it might be fruitful if it worked.
Using math to constrain the behaviour of software as a design principle is a tool I've reached for before, but only using high school math. I wasn't sure even how to start thinking about approaching the problem other than "join-semilattice". My mind wandered in circles thinking about it in the shower. Putting pen to paper, I wasn't sure where to start.
So I elicited GPT-4's help. It did get some things wrong, which I had to catch and correct, and the solutions it suggests aren't elegant. That said, it was immensely valuable to rubber duck with it. GPT-4 helped me gather the requirements I needed, structured the way to think about the problem, suggested solutions, and clarified questions. It helped me reach for tools and ecosystems I wouldn't have otherwise tried to leverage because the initial time investment would have been too much to be worth it.
I created two GPTs for the task, Jeff Dean Aura, a senior engineer with Jeff Dean's acumen to bounce things off of [^1], and a simpler MathJax Writer, to help translate ASCII renditions of formulas to MathJax without having to comb through documentation. I was able to leverage the properties of join-semilattices as a framework for thinking about the problem to come up with a solution. And I could leverage MathJax without ever having written it before.
Rather than exploding math onto the page, I'll leave the gory details to a full post. But to give a glimpse of the result so far, we
- Defined a multi-split description (MSD) to describe how the entire totally ordered set of keys in the tree is split between leaf nodes.
- Defined a partially ordered set (poset) of MSDs by defining a binary relation (\(\leq\)) to compare two MSDs. Proved this relation is reflexive, antisymmetric, and transitive.
- Defined a merge function to traverse the join-semilattice. Proved this function is commutative, associative, and idempotent. The key to getting this to work is to notice that the nature of a totally ordered set of unique keys takes care of insertion order, and functional balancing will take it home.
There are still large loose ends to tie up. What happens on deletions? Would we need to keep tombstones? Would we need to keep bookkeeping around like in CRDTs or can we throw them away? Can we use hashes as a way to reduce work done during a merge?
With an eye on the time, I'll look at all this and more next week. Merry Christmas!
[^1] Not sure how much like Jeff Dean it is, since I've never worked with him. It was just one of the most effective engineers by reputation that I know about.