As part of a recent personal journey to better understand databases and better learn Rust, I have recently took on the project of writing a simple key-value storage engine. Crazy, right? Lets get started!
B-Trees vs LSM Trees
The first thing one must think of when writing a key-value storage engine is -”How should I store my data so that I can perform reads and writes efficiently?”; The two most common approaches for storing data on disk are B-Trees and LSM Trees.
BTrees have been around for five decades and are used in “traditional” SQL databases such as PostgreSQL as an index on tables. A BTree is essentially a generalization of a Binary Search Tree, by allowing a “branching factor” larger than two (parameterized as t). Thus, BTrees are well suited for reads; reads run in O(tlogₜ(N)) for a tree with N keys (This can be improved using binary search), but less for writes which at worst case require O(tlogₜ(N)).
On the other hand, LSM-Trees (Log Structured Merge Trees) are a seemingly newer invention (at least, newer to me) usually used by newer “NoSQL” databases for example: “Scyalla”, “RocksDB”, “BigTable” etc. An LSM-Tree essentially takes a simple log and extends it by dumping the log to disk once it reaches a certain size (Commonly referred to as a “segment”). Concurrently a designated process is responsible for merging the previously dumped segments. Thus, depending on the implementation, a write is as simple as appending to a file and runs in O(1) while reads can be as slow as reading all the data in all the segment from the latest to the oldest, i.e. O(N). The segments themselves can be kept sorted which might slow down writes but can speed up reads and merges drastically — mergers can be done efficiently using a procedure similar to the one used in “merge-sort”.
A Persistent B-Tree
After considering the alternatives I’ve decided to write a “traditional” BTree.
To make things simpler we are going to store the tree in a single file and the leaf nodes will contain our keys and values (This variation is called a B⁺Tree). Thus we will need to restrict the size of our keys and values as they need to somehow fit in a single page.
A BTree stores its nodes to disk in “pages” (Traditionally of size 4096 bytes), thus, minimizing the number of I/O to disk as the operating system fetches chunks of memory in “pages” — For example; a BTree with t=100 can lookup a key in a tree with 1 million keys in just three I/Os. Lets look at a naive yet effective design of a page representing a node —
An internal node -
| IS-ROOT 1-byte| NODE-TYPE 1-byte | PARENT OFFSET - 8 bytes | | Number of children - 8 bytes |
| Key #0 - ? bytes | Key #2 - ? bytes | ...
| Child Offset #0 - 8 bytes| Child offset #1 - 8 bytes | ...
A leaf node-
| IS-ROOT 1-byte| NODE-TYPE 1-byte | PARENT OFFSET - 8 bytes | | Number of pairs - 8 bytes |
| Key #0 - ? bytes | Value #0 - ? bytes | ...
| Key #N - ? bytes | Value #N - ? bytes |
How long should we limit our keys given the above layout? Lets do some “over the napkin” arithmetic; Lets say we would like to support a maximum branching factor of 200 — That means that a leaf node needs to accommodate up to 200 key-value pairs:
SPACE_FOR_PAIRS = PAGE_SIZE — HEADER_SIZE = 4096–18 = 4078 bytes.MAX_SIZE_OF_PAIR = SPACE_FOR_PAIRS / MAX_BRANCHING_FACTOR =
4078 / 200 ~= 20 bytes. (78 bytes remain as junk)
Thus each pair is twenty bytes long — For simplicity we will keep our values in the leaf nodes; alternatively, we could have allowed for variable sized values by keeping a pointer to the values in the leafs with the added cost of another I/O to disk. Since the keys and values should take up to twenty bytes we will simply split that between them: ten bytes for keys and ten bytes for values. To keep things simple we will treat our keys and values as UTF-8 encoded strings.
A Node struct:
Lets look at a Node struct capturing the design we have thus far-
Given the design we have we can now write the conversion of Pages to Nodes and Nodes to Pages-
A Page struct:
This is a simple wrapper for a page of memory.
In this first chapter we have designed our on disk page (i.e. node) structure and written methods to serialize nodes to pages and back. In the next chapter we will start writing the BTree.
The entire source code including tests and more can be found at - https://github.com/nimrodshn/btree, PRs are most welcome!