Lab note #032 Pair programming with GPT-4 and Isabelle

Lab note #032 Pair programming with GPT-4 and Isabelle

This past week, I'd been working on exploring a new construction for a unicit B+tree. I wanted to know if the mechanisms for uncoordinated synchronization of CRDT could be applied here. Given the research nature of the task, I indulged myself with a limited timebox to the task. First using GPT to reason about it, and then with Isabelle, a theorem prover, for about 3 weeks total.

I didn't complete the proof and construction, but I now understand what would be involved. The experience was a little bit like living in the future. A couple of thoughts about it, and then I'll summarize my incomplete results.

GPT-4 as a Star Trek computer

As amazing as GPT-4 is today, working with it is more like "computer" from Star Trek or Jarvis from Iron Man than it is like working with Data from Star Trek. It's errie how prescient this scene with Jordi LaForge is from The Next Generation.

While the interface is not at all spatial or embodied like the holodeck, Star Trek nailed two things about the workflow.

First, the iterative nature of the workflow is spot on. The refinement process is unavoidable. One might forget to mention implied constraints initially or not even understand the problem space fully. In addition, even bad suggestions have value. Explicitly faced with bad solutions forces clarity as to why we'd prune those suggestions from the solution space. Just like writing, exploring a problem space with GPT-4 has a way of clarifying muddy concepts.

Second, your intuition for guiding the direction is as important as ever. In the scene above it's still the human that's driving the conversation and the direction of exploration. The times when I tried to hand off all the thinking to GPT-4 was when I was the least successful in making progress. But when I took a step back to formulate a plan of attack, that's when I'd make the most headway.

Life with theorem-provers

As for using Isabelle, it was mostly positive, though at times frustrating. GPT-4 enabled me to skip reading tutorials and documentation and just jump right into a problem I cared about. Also, it generates recursive functions pretty easily. I rarely had to write the functions myself, but I just told it what I wanted and tweaked some corrections.

My initial strategy was to implement insert and balancing on a tree data structure and then prove its unicity. Balancing required applying a metric on the tree after insert, then picking an operation to balance the node sizes. I spent a couple of days on it, but I got stuck on proofs. Turns out, proofs of even mildly complicated functions are hard.

After thinking about it, I simplified the problem to just a list of node sizes for a single layer and modelled each of the operations on this list. Isabelle was able to quickly find counterexamples where unicity did not hold unless I included a no-op.

Working with trees, most of the proofs are inductive. GPT-4 isn't very good at inductive proofs, [^0] and I'm not great at them myself. I got to a point where I got a sense of how to formulate a strategy to attack the proof with GPT-4's help.

I'd like to keep going, but I can now see this would take up a significant amount of time, even though a solution feels close. I've simply run out of time. So for now, I'll need to table this and go back to Prolly Trees.

All in all

Theorem provers have a place in the programmer's toolkit, but it's something that you'd use to hit high notes, rather than something you'd employ daily. I think it'd be very helpful for designing and reasoning about concurrent and distributed systems.

But to be even more effective, I think GPT needs to be able to use Isabelle itself without a human cutting and pasting between the two. The combination of reasoning for a high-level strategy and fact-checking with the tool use of a theorem-prover should be quite powerful. Just as Retrieval Augmented Generation (RAG) grounds LLMs with data to find the truth, Theorem-prover Augmented Generation (TAG) grounds LLMs with reason to find the truth. But we needn't go that far for every problem. I suspect pairing LLMs with just datalog, where we can generate implied facts from existing facts, would be pretty powerful.

I'm working on a blog post about exercising optimism to see into the future, and a partial order of time. Let me know which topic you're interested in more.


[^0] I haven't tried to make a custom GPT to only do inductive proofs. That should be a low-hanging fruit to see if it yields results.