Lab notes #014 Revisiting Requirements

The persistent data structure in Datomic is used as a live index for querying. However, it's unique in that it saves itself to a storage-backed key/value store in segments–continuous pieces of data akin to the data in the leaf nodes of a trie.

By fetching a key/value pair, the entire set of key/values around it are fetched as well. This is done for efficiency, because a typical query is fetching a record with its attributes, and hence getting the values around a single key in the covering index.

Therefore, I needed a persistent ordered map where I:

  • Can find a segment to send to storage given a key
  • Can refer to a segment of data truncated key
  • Can retrieve the segment from storage into a cache index given the truncated key.
  • Can merge the cached storage index into the live index.

Originally, I thought I needed to implement my own persistent data structure. As a result, I went down a rabbit hole on Hashed Map Array Tries and Relaxed Radix Balance Tries for weeks. This week, I revisited this subject again, "So am I not using an off-the-shelf OrderedMap again?" The answer stung a bit. I didn't need to write my own.

Why did I make this decision? There seemed to be a couple of reasons. This is the retrospective.

I was too restrictive in my analysis at the time.

In my research on how Datomic worked, I had the implementation details exposed fully, and I was thinking about it as if I had full control and access to the persistent data structure. Hence, if I use im's off-the-shelf OrderedMap, I would need the internal leaf data structure of the OrderedMap implementation in order to:

  1. Get the representation of the key all the way down to the internal node right before the leaf node in order to reference a segment.
  2. Get the segment as represented by the OrderedMap internally to insert it back into an index after retrieval from storage.

Thinking about it this week, I think an alternative hack would have been serviceable to move everything forward:

  1. Just truncate the bottom 32 bits of the keys and treat the result as truncated keys used to reference segments stored in the storage-backed key/value store. You don't need to know how OrderedMap represents the key internally at the internal node right before the leaf node.
  2. Therefore, there's no need to know the boundaries of what a leaf node stores internally. Just use an iterator to walk from in the range of 00 to 1F for a truncated key, and there's our segment.

I think the desire to do something technically hard and the desire to match the implementation details was blinding me to the fact that I could leverage someone else's work.

Buying a bit too much into Handmade ideals

A recency bias from watching a little too much Casey Muratori. He's a better programmer, so he can say things like, "let's rebuild everything from scratch". In addition, I found myself nodding to things like reuse can be hard when you use libraries because it does a whole bunch of stuff you don't need or want, which ends up bloating your system.

But on the other hand, it takes a long time to build everything from scratch, especially if you've never done this sort of stuff before. So the allotted time is far greater than the usual 3x padding of estimates. This can be time well spent if it's well known the benefit will be gained after completion. That is not the case here, because we still have additional risks on top of completing this component here.

The important thing is to identify the core assumption and do what I can to validate or invalidate it. And only after validation, go back and fill out the components. Because otherwise, I may have learned a lot about something, but I'd end up having nothing to show for it.

In more of a research and learning mindset

At the time I started this, I was doing too much reading, less writing, and no coding. I had thought that it was a good opportunity to learn Rust and dig into how these more advanced data structures worked.

While I did learn Rust and Relaxed Radix Balanced Tries, my mind turned back to the nagging thought--and then? What happens after that? I've been down this road before, and while I learn a lot, I don't think I have any visible impact or increase in leverage. So hence, if I'm going to change my trajectory, I'm going to have to do something different.

Going forward

Going forward, I'll need to find a way to state a hypothesis and validate/invalidate it, before coming back to build out the rest of the database. That'll be the work this next week.