Lab notes #004 Index construction with trees
Watched lectures and read papers on persistent trees, and come away feeling like it's a rabbit hole. I initially found it difficult to implement a B+-tree in Rust, due to ownership rules, so I stopped. As much of an advantage it would be to have these things implemented, it's a commodity.
Decided to go ahead with Rust. Compared to Zig, there's just a lot of machinery, and they don't do double-duty like concepts in Haskell. However, given that I'm a new systems programmer, I need the guard rails. Once I get better habits, then perhaps I can switch to something with less guard rails.
Watched lectures and read papers on persistent trees, and come away feeling like it's a rabbit hole. I initially found it difficult to implement a B+-tree in Rust, due to ownership rules, so I stopped. As much of an advantage it would be to have these things implemented, it's a commodity. The decision was to move forward with im crate, since it implements much of what I'm reading about.
When reading the source for im, I found that ranged queries aren't implemented as a linked list of leaf nodes for ranged queries. It's actually a vec of paths being generated from an iterator. Normally, this might be considered slow, because we'd have to traverse up and down the trees, but given that the height of the tree will be low, it's probably negligible.
However, as I started to piece together the architecture, the detail of what to put as an element of the leaf came to head. If it's an entire segment of datoms that's stored at the leaves, that means that multiple keys are pointing to the same segment. And in order to store that entire segment into some external KV store, I'd need access to it. As far as I can tell, the im crate provides access to the values pointed to by the leaf node, but not the leaf node itself.
The data structure of the index will need the following properties
- B+-tree like, with high fan out to reduce height and access times
- a fast way to do merges to get live index and storage-backed index together quickly to do queries.
- A way to do ranged queries across the leaves.
- Access to leaf node to ship its contents to the KV store.
Am starting to learn about Relaxed Radix Balanced Trees, and hopefully, it'll have all these properties intact. im crate uses RRB vectors for the vector/arrays inside a node of its B-Tree implementation, but doesn't actually use RRB trees.
Also found out that RRB trees are also useful in writing text editors. So it seems like a fast persistent tree is foundational to getting this thing working.
One possible route was to use iterators to return the leaf nodes of the persistent tree. But that looks like an impossibility because access to the Node<A>
data structure is pub (crate)
, so it's not accessible from the outside. Looks like I'll just have to re-implement it.