Lab notes #043 Merkle CRDT protocol

Lab notes #043 Merkle CRDT protocol

My adherence to immutability has put me on the hard path for everything so far. Not just for the unicit Prolly tree, but also for the CRDT. Most practical CRDT libraries use an op-based CRDT or a delta-based CRDT, but many do not keep the entire history around. I think it should, but also leverage the history for immutability. To support that, I settled on a kind of Merkle CRDT which uses the Merkle-DAG to maintain a partial order.

But what should the protocol of the Merkle CRDT be? Should it be deltas, diffs, events, or commands?

Naïve state-based CRDTs aren't practical, because they assume the entire state is sent over the wire. If a CRDT is state-based at all, they send deltas or diffs. I distinguish between a delta (δ) and a diff (Δ) because there are actually two flavors of CRDTs here. The δ-state-based CRDT sends composable deltas of between its last sent state and the new state. By contrast, the Δ-state-based CRDTs send diffs tailored to each remote replica.

Most CRDT implementations are op-based, so they send either events or commands. Events are declarative facts about what has happened in the system, so different observers can take different actions based on an observed fact. By contrast, commands are explicit messages to the remote replica to do something.

It seems events are favored when history is a message bus between various clients and you don't control all the clients now and in the future. Thus, you won't know their requirements and what they'll do. Commands are usually chosen when you control all the clients and know their requirements up front.

I can choose either Op-based-like protocols (events/commands) or State-based-like protocols (deltas/diffs). Given that a Merkle DAG is a history of state, I had a piece of the design puzzle that didn't fit together.

The advantage of delta/diffs over events/commands is when there is a large gap between versions to sync, there's no need to recount the entire history of events/commands to converge on the same version. A delta/diff can be sent instead. But if we send a delta/diff to skip to the latest version, how will we get all the intermediate states in the history of a Merkle DAG?

This history of commits is important to the way a Merkle CRDT would work. If there are commits missing, we wouldn't be able to sync with any replica at that commit, because we can't do partial order comparisons.

We have several options:

  • Run a background process to fill them in later.
  • Create placeholders for unfetched commits.
  • Use a reference to skip over missing commits, and lazily fetch them when needed.

It didn't seem like there were many design options, so I landed on a hybrid approach. Just like how accounts close the books every month or every quarter, there should be a threshold beyond which replicas can't be expected to sync within a reasonable amount of time. Once a replica lags behind the lead replica beyond the threshold, it'll use delta/diffs to sync. But if it's within the threshold, it'll use commands to sync.

There are two reasons a replica has to look back into history. One is to bring other replicas up to sync, and the other is as a basic for comparison for a query. Given that I intend to have a centralized server, clients have less of a responsibility to keep history for sync. Hence, once a replica catches up via a delta/diff sync, it's unlikely to need the historic commit states. This is true for client replicas on browsers.

But for client replicas on servers, I think it would make sense to replicate this history to enable syncing between servers. It'd be a sync backwards in history, but only until the most lagged replica. After that, it'd be a low-priority sync to store as much history as possible, as a way to do backups.

The current open question is how to do efficient Merkle Clock comparisons? How can I tell if one node is an ancestor of another without walking the DAG? I haven't looked into this deeply, but as I was learning about cache-oblivious b-trees, I learned about ordered file maintenance problem and the list labeling problem, and they seem to be related. I'll fill in this detail later.