Lab notes #008 Details are in the ID
I'm usually not hung up about the programming language I use, but given Jamie's take, I did some deliberation about whether to use Zig or Rust. I ended up going with Rust, because I'm not a seasoned systems programmer, so I figured having the compiler guide me to writing safe code would help me become better. However, it is pretty discouraging when it's yelling at you all the time, and as good as the error messages are compared to other languages, it's still sometimes cryptic.
I was chugging along with the basic implementation of the RRB trie when I panicked about the details of the ID. Given a Radix descent, the ID has to be a bounded number of bytes. But the ID in a Datom can be any length. Do I need to hash it? If it's hashed, then how would iteration work? And also, I somehow missed that all the RRB papers are implementing vectors. What I really want is a Map. Wait, is a Hash Array Map Trie more appropriate?
After some desperate re-reading of some of the papers, I determined that RRB was on the right track. While the implementation could be for a vector, it's sparse and self-balanced that it could be used as a Map. And in order to keep some semblance of order for IDs to be next to each other for better cache locality, just generate sequential UUIDs, as well as an entry from the UUID to the full string representation of the ID. That way, similar keys will be stored in leaf nodes close to each other, which groups similar keys together, as well as making iteration a lot easier.
Now, it's just time to write some tests, and also work on concatenation and slicing.
I picked Persistent Relaxed Radix Balanced Trie because it had a bunch of properties I was looking for.
- The data structure allowed for persistence. Descriptions of a B+tree usually involve a doubly linked list between all the leaf nodes. Not only does this wreck with Rust's ownership model, but it's also not conducive to path copying for persistence.
- Its Big-O complexity seems pretty good across the board. Either effective constant time or amortized constant time.
- Relaxed nodes allow for sparsity, which makes it ok to treat the vector indices as keys in a map.
- Concatenation is an O(M log n) operation, where M is at most 32. This should be useful for merging the storage-backed index with the live index.
- It allowed for ranged queries and iterating over similar keys, rather than iterating over insertion order, like most ordered maps.
- For ranged queries to work, IDs have to the generated in a somewhat monotonically increasing way. UUIDs based on time fit the bill. And then an extra fact tying the UUID to the full string version of the ID can be used to resolve it.
- To differentiate between facts with the same entry ID in the EAVT index, I think we should be able to use an ID in the attribute as well.
So it seems like all the pieces are in place, and the rest of the time should just be on implementation.