Lab note #056 Using version vectors
I'm back, after a little hiatus from writing the notes. Last time, I discovered that I should have been looking more deeply into other reactive models before I embarked on a functional component (React-like) model to model data pipelines. The impedance mismatch should have been indicative that something wasn't quite right.
I spent the last couple of weeks just digging into signals and autotracking, with some light toe dips into differential dataflow. During that time, I felt guilty about not writing the lab notes, because all I had were questions, rather than answers. But that might be illuminating for those following along. Unlike looking backward from a solution, the path forward is always full of dead ends and confusion. I'll try to set aside my ego and lean into more of that. Expect lab notes to be more raw, first drafts, filled with questions going forward.
I ended up embarking on a rewrite, using the version vector method that autotracking does. It has a couple of advantages.
- Version vector reactivity doesn't have a push phase, unlike typical hybrid push-pull signal reactivity. In push-pull signals, it has to propagate dirty bits down the computational graph in the push phase. With version vectors, it can skip this. During a signal change, it just updates a global version vector. Then, during the pull phase, it compares the global version vector with the version vector each node in the computational graph last saw to determine whether or not to re-execute.
- Version vectors at each node seem to form a partially ordered set from source to sink. So I think we can detect sibling nodes if version vectors between two nodes are incomparable. This is a nice segue into concurrent execution in the future.
- The computational graph can be seen as an extension of a causal log of changes synced over the network. I'm far less sure about whether this makes sense. But the motivation is to find a single representation that can help support reactivity and syncing at the same time.
The other aspect made more clear in my mind after looking into this is how different parts of a reactive system are employed to minimize work. If relationship between source state and derived state is modeled as a computational graph, we can minimize the work in increasing levels of filters, by:
- Assuming derived node calculations are pure functions. With pure functions, we can reason about whether or not to re-execute to minimize work without executing the function.
- Only recompute the dependencies that have changed. We don't need to re-execute any source-change unaffected dependency. We can tell which ones these are by comparing their version vector with the global version vector.
- Only recompute the current node if any dependencies have changed. This is akin to pruning branches during the pull phase from sink to source.
- Short-circuiting downstream execution if the current re-execution value is the same as the previously executed value. This is akin to pruning downstream computations.
- Partially computing the compound data, like a list or a map, if we must recompute the node and the resulting value will be different.
There are a few fuzzy parts, but I think it can be cleared up soon with an implementation. The big question mark right now is how the effect system will fit into this. Without a good effect system, the whole exercise is a bit moot. So I'll be turning over the questions I have on that next time.