A simple way to understand CRDTs

A simple way to understand CRDTs

There's a simple way to understand CRDTs: It leverages algebra to unmix the inevitable mixing of data when syncing over an unreliable network.

Why are networks unreliable? For one, it'd be really expensive to build a network that assured 100% only-once delivery. But also, the internet was designed to allow variable routing paths and dynamic topologies to adapt to failures and congestion. This flexibility allows the network to stay up. The downside of this robustness is the variability of time delays and mixing of data when sending it from node to node.

There are three ways a network can mix up your data, and CRDTs constrain operations on the data to be algebraic to address each of them when syncing data.

  1. Two pieces of data can arrive out of sequential order.
  2. Two pieces of data can arrive and we can't tell which one came first.
  3. A piece of data could be repeated more than once.

CRDTs neutralize this if an operation has three algebraic properties.

As an aside, For state-based CRDTs, the operation is called a merge, where given a data packet coming in, we can merge it with the existing state. For op-based CRDTs, the operation is called an apply, but it's effectively a merge with one less required property. We'll get to that.

When data comes in out of sequential order, we can neutralize this mixing if our merge op is associative. I'll denote our operation as a dot.

(A . B) . C = A . (B . C)

means that it doesn't matter whether B is merged first or C is merged first. The result will be the same.

When data comes in, and we can't tell which one came first, we can neutralize this mixing if our merge op is commutative.

A . B = B . A

means it doesn't matter whether they occurred at the same time. We can apply them in either order and get the same result.

And when data gets repeated more than once, we can neutralize it if our merge op is idempotent.

A . B = A . B . B

means we can apply B as much as we want, but it'll remain the same as applying it once. Not all operations we do on data will have these properties. But if we can constrain ourselves to only operations that have these algebraic properties over the data, then we have a much easier time syncing the data.

State-based CRDTs rely on their merge operation to have all three properties. But they aren't used in practice because their protocol requires sending the entire replica state, which can be prohibitive. That is unless they send deltas or diffs.

Op-based CRDTs sends cmds/events instead. The apply/merge operation here only needs to be commutative and idempotent. The associativity isn't required because Op-based CRDTs requires the protocol to enforce causal delivery of data. It simply shifted complexity from merge op to protocol.

All the details around CRDTs are just how to do this effectively. There's lots of details for sure, but just as a way to wrap your head around what it's doing, this is the core. The main idea is that by constraining the power of our operations so that we get the same answer regardless of how they're applied, we can neutralize the unreliable nature of the network. I've seen lots of CRDT intros, but no one ever spells this out.