Lab note #036 Bottom-up Breadth-first Traversal
A problem was that traversal in a layer would be backward. I'd need to traverse left-to-right to match the build process when building the Prolly Tree from scratch. No problem, I'll just loop through the children backward when processing top-down.
This last week has been a bit hectic as I didn't get time management under control. I started at the Recurse Center, an enclave of programmers to program for its own sake, either as a sabbatical or to relearn the love the programming. On the personal front, we are starting to look at preschools, as well as embarking on the long journey that is potty training. I wasn't going to mention much about my personal life in a technical update, but I now think that it gives context and color to the journey.
My intention for lab notes is for it to be a little bit messy. It should also detail not just results, but also my setbacks. It helps keep me honest, and it keeps me from shying away from consistently writing.
Because honestly, I've been frustrated with my lack of progress in the last two weeks, adjusting to the new schedule, meeting new people, and dealing with changes in the home. However, seeing that I had skipped a single week, I didn't want this to become a habit, as I was pretty good about maintaining a good pace and cadence with the updates. As I was trying to clear out my tabs, I ran across one about blogging, which maintains that being consistent is more important than worrying about quality. Because like that anecdote about making pottery, by writing a lot and listening to feedback, I'll get better.
Even with all the goings on, I've made some progress on the incremental insert in a Prolly Tree. I feel like I have a much better understanding of the nuances of the problem in my head now. There are basically two tricky parts:
- An efficient bottom-up breadth-first traversal.
- Replacing nodes in the layer above after grouping them differently.
Bottom-up Bread-first Traversal
I ended up writing the bottom-up breadth-first search three times. My first implementation of a bottom-up breadth-first traversal was pretty simplistic. I'd do a regular iterative top-down breadth-first search, and at the moment I need to process the node, I'd dump it into a stack. After the traversal, I'd just pop everything off the stack to get a bottom-up breadth-first traversal.
A problem was that traversal in a layer would be backward. I'd need to traverse left-to-right to match the build process when building the Prolly Tree from scratch. No problem, I'll just loop through the children backward when processing top-down. However, I found this confusing and often leads to off-by-one errors when trying to determine boundaries.
But the deal breaker was simply that I was holding the entire tree in memory (the stack) before getting rid of it. So it would be untenable for large trees typically in databases. In fact, it made me appreciate the typical iterative depth-first and breadth-first searches more. None of those held the entire tree in the stack or queue to traverse it. I had to find a different way.
The second time, I figured I could just attach sibling hashes to each node. That way, I can just do a depth-first traversal along the left side of the tree, and just follow sibling pointers all the way to the end of the layer before doing the same as I recursed back up.
In some ways, this did simplify the traversal code a lot. It was easy to tell when the end of the layer was (important for closing out a group of nodes, even if we didn't see a boundary), and we no longer had to hold the entire tree in memory.
Yet, this didn't work. When I went to write the incremental insertion with nodes including sibling hashes, should the sibling hashes affect the content hash of the node? But the deal breaker was the realization that to write the sibling hash, I needed to create the next sibling. And to write the sibling hash for the next sibling, I needed the sibling after that, until the end of the layer, where the next sibling hash would be null. Then we can finally write the sibling hashes from the end of the layer all the way back to the node at the seam. This is terrible. This means that we need to keep all these nodes in memory before being able to write them.
The method I ended up choosing was to keep a cursor that's represented by a seam through the tree, from root to the currently emitted node. We would do a depth-first search down to the left-most node, then start emitting the parent's children, the immediate siblings. Once we reach the end of the immediate siblings, we go up to the parent level, find the next parent, and come back down to keep getting the next child of the parent. If there is no next parent because we reached the end of the parent's immediate siblings, then we recursively go up to the grandparent level to get the next grandparent, come down to find the next parent, then come back down to find the next child. We do this up to the root, and we repeat this until the end of the layer.
The space complexity we need is O(log n) for the cursor. The computational complexity is that most of the time, to get the next sibling, it's just looking at the next element in the array. The number of times you'd have to traverse up the seam decreases exponentially as you go up a layer. In some hand-wavy fashion, it seems to be also on the order of O(n log n)
Getting this to work was hard somehow. I ended up having to write some pseudocode that had the recursive algorithm unwound to figure out how to write it. I think the difficulty was that there were additional steps to be taken after the recursion which made it so weird.
Oh, and ChatGPT was of no help. At least for now, programmer jobs are safe. If you find that ChatGPT can write the code that you often write, it's time to level up.
This is getting long, so I'll stop here and talk about replacing the nodes next time.