Lab notes #025 Split condition for Merkle Search Trees
I started outlining how to compose different CRDTs together to make an Immutable Conflict-free Resolution Relations [1] construction, based on the Merkle Search Tree. When designing it, I first tried sketching out the entire construction by drawing, but I was left confused. I wasn't sure if all the pieces actually fit and the larger construction still maintained all the desirable properties.
Taking a different approach, I started writing out the types of the structures and type signatures of the methods that operated on them. This time, I was confident of the details as well as the larger picture.
I've never really designed by types first, though I understood its advantage. The difference this time was that CRDTs had an algebra with properties that I wanted to compose and maintain. Most user-facing software doesn't have that shape, so I never did it before.
When I started construction of the Merkle Search Tree [1], I ran into an underspecified question: What is the condition of a split? The paper glosses over this. Do I take the midpoint between values, or half of the elements? Neither seemed correct, since it would violate the unicity property that was guaranteed by using an MST. The paper wasn't cited much, so maybe this was an oversight? Did I reach an edge?
Asking ChatGPT wasn't much help. It mostly had vague suggestions about using the hash to consistently split a node. Intuitively, I didn't see how this would work. The nodes order the elements by their value, and I couldn't see how this could be maintained while finding a consistent property of the hash to split upon.
Looking at the only nearby reference in the paper, it pointed to Persistent Authenticated Dictionaries [2]. While it opened up an entire category of data structures, it also didn't describe a coherent split strategy with its treap implementation. Perhaps there's something in the tuple implementation to borrow that answers this question.
Searching online for existing implementations, I found two, the first being relatively well-documented.
A cursory glance at the former didn't yield an answer to my question, and it introduced concepts that weren't in the original paper, like pages. Will put a pin in this and dive deeper into reading the code base this next week.
I tried to see if there was anything written on Merkle Search Trees. There frankly isn't a lot, but it was a trodden path. JZhao had already written a survey on it, which led a trail to what seems to be an answer to my question of a split condition.
It proposes that we split based on the hash value. Given an ordered list of keys, we also get a corresponding list of hash of the keys. Every hash in layer 1 has a leading zero. Every hash in layer 2 has two leading zeros, etc. Then with the 2k non-zero bits in this layer of hashes, we should have 2k/B bins available in this layer. So in an ordered list of keys, we simply find the first value with a hash that's less than 2k/B. Then on average, the expected value for the number of keys in a node is B.
Hence, the node to insert in a layer is independent of the key, and only dependent on its hash. This is a bit of a sketch, there are a bunch of details missing. My main concern is that this method doesn't retain unicity. I'll have to work this out to see if it's viable.
[1] Conflict-Free Replicated Relations for Multi-Synchronous Database Management at Edge
[2] Merkle Search Trees: Efficient State-Based CRDTs in Open Networks
[3] Super-Efficient Aggregating History-Independent Persistent Authenticated Dictionaries
[4] Efficient Algorithm for Sorting and Synchronization