Lab notes #013 Inserting keys beyond capacity

Lab notes #013 Inserting keys beyond capacity
programmer inserting keys beyond the capacity of a relaxed radix balanced trie

Spent the week fixing issues with insertion, mostly with different edge cases. Was able to solve an issue with inserting at the beginning and the end of the trie. The edge case was that take_left and take_right should take into account when we take everything or take nothing. It also exposed a bug with the height returned from concatenation. As implemented, merge_nodes for the left, center, and right tries when zipping up a concatenation always added a height to the resulting merged result. This may not be optimal, but for now, it's good enough, and returning the height should future-proof against any future changes.

There was an oversight that blew up the time to finish the insert_at implementation. Most of the papers were focused on using RRBTs to implement a Vector, which is an ordered list of elements with sequential indexes.

But if the keys need to support hashes, then we'd need to be able to insert keys beyond the current capacity of the trie. Can we, and do we need to?

Let's look at the requirements for the in-memory store for the data. The keys of each triplet in the covering index need to be:

  • Equivalent O(1) access time given a key.
  • Able to iterate over a range of keys in O(n) time.

Should the key be a hash or an index? The first requirement suggests the former, and the second requirement suggests the latter. In Lab note #008, I stumbled upon this issue with the ID already. Sequential UUIDs should suffice, and I would need a way to insert keys beyond capacity.

Following the papers carefully, I assumed that any reference outside of the capacity of an internal or leaf node would be an Err. This includes iterations over the node path when following a key.

Instead, the node path iterator should never error out when being created, but instead, return as much of the node path from the root as possible given the key. It's then up to the caller to detect whether the first node is a leaf or an internal node. If it's a leaf, then the full node path was found for the key. Otherwise, the key is looking for nodes outside of its current capacity, and the caller will have to then start creating room to hold the new value, hopefully using the already written append.

This shouldn't affect lookup terribly, but it's not clear to me that it will still work with insert. insert consists of take_left, take_right, append, and concat_sub_trie. Of the four, concat_sub_trie is the most complicated, and it's not clear to me that it can handle keys beyond its current capacity, because now the slot size table takes on a different meaning now.

Instead of being a list of the cumulative number of elements in the child sub-tries, it now will be a list of max keys in each child sub-trie. This breaks len(), which we rely on heavily.

The first alternative is to traverse down into each leaf node of a child sub-trie to determine its length. This is untenable since we'd have to traverse the entire triangle of internal nodes for every leaf node.

The second alternative is to store both the cumulative max key and the slot sizes of each child sub-trie. However, that doubles the amount of storage needed.

This bears some thought. Off the top of my head, perhaps just don't use Relaxation with the Map implementation. Length should have no meaning in a map, and without relaxation, we can just traverse down specific child tries, knowing that's where we need to be. If the child trie is missing, we just keep creating nodes until we reach zero height and just write a leaf node and then build a tower of internal nodes as we recurse back up and attach it to the child trie.

If we keep Relaxation, we can dispose of alternative two, if we find the correct child index given the current key's truncated segment. The other children would have to be empty, and we'd also have to update the slot size table for the entire node path back up to the root, which may consist of M writes per internal node.

The reason why we want Relaxation is that it allows concat in O(log n). But I just realized that I conflated concat with merge. In the database, we want to be able to merge the live index with the remote index. Concat is not going to cut it because it'll shift all the keys. It'll only work in the special case of non-overlapping tries that are concatenated in order.

This is a major setback. I'll need to go back and review my notes on Datomic and see where I went wrong with my evaluation of data structure to use, and see if there's a way to salvage the work.