Lab note #068 Dealing with collections incrementally

Had bunch of distractions like doing taxes this week, so I fell behind with the weekly update. However there's been progress these past two weeks.
- Generated Deep Research report on the strengths and weaknesses of different notebooks
- Learned about E-graphs and Interaction Nets
- Read papers on Incremental Lambda Calculus
- Read paper on DBSP
- Started implementation on DBSP
Strengths and Weaknesses
One of the worries hanging over my head is that no one will care what I built, even after I built it. I spent my entire career building things that nobody wants. To buttress against that, I've made sure that I really want the resulting product, and that I'm not just having fun doing the building. So I did a little bit of work to test that out.
I built a little sample RAG app that downloads the Zig documentation, chunks it, converts them into embeddings, and then stuffs them in to a vector Postgresql database. And then I used some vibe-coding to generate a simple Flask app for it. It was actually quite fun to see it working, even though I've built this by hand before. To me, the fun was seeing the end result, and not having too many blockers on the AI pipeline side. So that tells me I'm on the right track.
Additional buttressing was to use OpenAI Deep Research to understand the current state of the art competitors out there, both their strengths and their weaknesses. Jupyter notebooks is the dominate one out there by far, but Google Colab runs a tight second, albeit for different reasons. There are others, but these two are the poster children.
There were no surprises for me on the weaknesses of a Jupyter notebook. But an important question I had was: "Despite these weaknesses, why does it still remain so popular?"
The short answer is that momentum goes a long way to keeping winners on top. It's also validation that building on Python is the right call, despite all its limitations. But at the end of the day, the confirmation that startup truisms still holds. In order to win against an incumbent, you need to be 10x better along some dimension that customers care about, but the incumbents don't see it or they can't address it (due to some structural barrier).
For Google Colab, it was access to GPUs to run these heavy ML training workloads. That was something Jupyter notebooks wouldn't be able to follow, because it's not set up to provide that kind of cloud infrastructure.
My answer to this is probably deployability and collaboration as a further second. There's a wide gap between doing experimental work on a notebook, and tossing it over the wall for engineering to put into production. Never mind the logistics of converting the notebook to code, but that there's no organizing philosophy to begin with. And without that, it's harder to guarantee its runtime properties in security, performance, and affordance, all important to products besides capability.
Rewrite Graphs
Diving back into the applied research, I figured it was time to take a day to dip a toe into figuring out what E-graphs and Interaction Nets were.
- E-graphs: Compact ways of representing rewrite rules in a graph data structures.
- Interaction Nets: Representation of computation and its evaluation with constraints, so that it's inherently parallelizable.
Of the two, interaction nets seem more interesting to me. They're another instance of how we gain superpowers if we enforce very specific constraints. It also is an intriguing fit to the wrapped types I was talking about in previous lab notes. If I get to a point where I'd need to think about parallelizing, I might turn to this again.
Incremental Lambda Calculus
I wanted to give incremental lambda calculus a good shake as a possible solution for the incremental collections problem I'm looking at right now. Having GPT around to explain papers again has been immensely helpful. And over the last year, I've found later models to be less hallucinogenic about facts in the paper.
ILC is basically a way of structuring incremental computation where all the operators a user can use have derivatives. That way, if a new piece of data comes in, we can compute the incremental update. Importantly, users won't have to re-derive the derivatives if they only use the given operators. With composition, the derivatives compose too, and they can be derived automatically without user intervention.
ILC is more generic than DBSP, giving the user flexibility for writing functional code. However, it implementation also requires a wrapped type. Earlier, I had decided not to go down this route, but evidence is mounting that there's lots of advantages to being able to control your own AST and how to evaluate it.
But for my specific problem at the moment, which is how to deal with incremental computation when collections are passed around, I think DBSP is the way to go. The paper currently conceives of it as building blocks for an incremental SQL. But it's not strictly true that it wouldn't let you write functionally. It's just that I can slot this inside of something else, without resorting to asking the user to adopt a slightly different execution mental model for functions--although it's getting there.
DBSP
DBSP doesn't actually stand for anything. It's a play on DB (Database) and DSP (Data Signal Processing). It's an incremental view maintenance technique to come out of Frank McSherry (the only one I know) and his collaborators. It's simpler and less general than Differential Dataflow, but with that simplicity and basis in DSP gives a clarity to its mechnaics that Differential Dataflow wasn't able to achieve.
DBSP is a way to do incremental view maintenance by building a dataflow computation based only on a few operators. Namely, delay
and map
. With delay
, you can construct differentiate
and integrate
. Note how the derivative crops up here again. But unlike ILC, DBSP doesn't have a general function for derivatives. It can only compare against the previously known value of the current data to compute derivatives.
So from this formulation of map, differentiation, and integration components, you can rewrite it as an incremental computation using mechanical rewrite rules. These rewrite rules are constrained by the algebra of the underlying data (they have to be an abelian group). And with that you can build all the basic parts of the SQL language, from join
to filter
. And it's all incrementalized.
I started on doing an implementation this week. It's harder than I thought, but I think doable. Once again, without GPT to explain a lot of the details and nuances that I skipped over the first two readings, this would either take me a long time, or just be really hard.
I don't intend on implementing all the SQL operators. Only just enough so I can use the typical stuff I use in the RAG Zig pipeline. I hope to make enough progress to be able to move on.
This week's links:
- Incremental λ-Calculus The overview page for incremental lambda calculus. It's full of links.
- A Theory of Changes for Higher-Order Languages The first paper on ILC. full of little goodies.
- DBSP: Automatic Incremental View Maintenance for Query Languages The DBSP paper. Pretty clearly written, but a little scant on some details.
- Database Protocols Are Underwhelming | byroot’s blog I've often felt databases don't consider their design far enough into the application layer. As a result, there's just a completely impedence mismatch. Application devs can't resolve it by themselves with bad ORMs.
- Were React Hooks a Mistake? | jakelazaroff.com I'm always checking my understanding of the design of React. This was another good perspective on its design decisions.
- Black Mirror’s pessimism porn won’t lead us to a better future | Louis Anslow | The Guardian We need more Star Trek. Dystopian Sci-Fi has failed to warn us anyway, as we're living in one of the worst possible timelines. So I think we need better visions of the future, instead of emo naval-gazing for cool points.
- Query Engines: Push vs. Pull
- Mechanical Sympathy: Coding for CPU Performance Seph Gentle has a piece called 3 Tribes of Programmers, which I relate to often. I think programmers need to hold both functional arches in their heads, along with mechanical sympathy. Most seem to only have one or the other.
- Sleep-time Compute: Beyond Inference Scaling at Test-time An interesting take on having the LLM use compute while not active to learn stuff.