System Design: Designing a Local-First Key-Value Store (LevelDB/RocksDB)

System Design: Designing a Local-First Key-Value Store

While distributed databases get the industry spotlight, their underlying performance is built upon the foundation of their local storage engines. Modern highly scalable databases—including Apache Cassandra, RocksDB (Meta), LevelDB (Google), and ClickHouse—abandon traditional B-Tree configurations in favor of a Log-Structured Merge-Tree (LSM-Tree) architecture. LSM-Trees optimize random writes by converting them into sequential memory and disk appends, leveraging background merge-sorting to maintain read performance.

1. Requirements & Core Constraints

Functional Requirements

Key-Value API: Expose deterministic interfaces to Put(key, value), Get(key), and Delete(key).
Range Iteration: Support sequential range queries utilizing iterators (Seek(key), Next(), Prev(), Valid()).
Transaction Consistency: Support atomic atomic batch writing groups (all-or-nothing writes).

Non-Functional SLAs

High Write Throughput: Sustain 50,000 write operations per second on standard commodity NVMe SSDs.
Low Read Latency: Limit p99 read operations to < 5 milliseconds under active write workloads.
Crash Durability: Guarantee zero data loss for committed writes under sudden power outages or software crashes.
Resource Limits: Maintain a strictly bounded, user-configurable memory footprint for local buffers and block caches.

Back-of-the-Envelope Capacity Estimations & Math

To optimize physical layout allocations, we size our key-value engine for 1 Billion Keys:

1. Bloom Filter Memory Allocation Math

To bypass expensive disk seeks, each disk table (SSTable) carries an in-memory Bloom filter. We design for a target $1%$ false positive rate ($p = 0.01$):

The mathematical optimal bit-allocation formula is: $$m = - \frac{n \ln(p)}{(\ln(2))^2}$$ Where:
- $n$ is the number of keys.
- $p$ is the false positive rate ($0.01$).
- For $p = 0.01$, this reduces to approximately $9.56\text{ bits per key}$ (we round up to $10\text{ bits/key}$).
RAM Sizing for 1 Billion Keys: $$\text{Total Bloom Filter Bits} = 1,000,000,000 \times 10\text{ bits} = 10\text{ Billion Bits} \approx 1.25\text{ Gigabytes (GB)}$$

Operational Rule: Allocating a mere $1.25\text{ GB}$ of RAM allows us to skip 99% of unnecessary disk files for non-existent keys.

2. Write-Ahead Log (WAL) Bandwidth Sizing

Write Rate: $50,000\text{ writes/sec}$.
Average Entry Size: $1\text{ KB}$ ($256\text{ Bytes}$ Key, $768\text{ Bytes}$ Value).
WAL Ingestion Load: $$\text{Sequential Write Load} = 50,000\text{ QPS} \times 1\text{ KB} = 50\text{ MB/sec} \approx 400\text{ Mbps}$$ This sequential write rate is highly sustainable by standard SSDs.

3. Write & Space Amplification Budgets

Space Amplification Factor (SAF): Ratio of total disk file footprint to raw user data size. Levelled compaction maintains SAF $< 1.2\times$, while Size-Tiered compaction can spike SAF up to $> 2.0\times$ due to redundant key versions.
Write Amplification Factor (WAF): Ratio of total bytes written to storage by compaction compared to raw bytes written by user applications. We estimate average WAF around $10\times$ to $30\times$ for Leveled compaction models, which must be budgeted against SSD write endurance limits.

2. API Design & Core Contracts

Since a local key-value engine is implemented as an embedded library (like LevelDB), the core contract is expressed as programming interface signatures.

Embedded Storage Engine Interface (Java/C++ Style)

package com.codesprintpro.kvstore;

import java.io.IOException;

public interface LocalKeyValueStore {
    
    /**
     * Writes a key-value pair to the store.
     * Thread-safe. Performs WAL append and MemTable insertion.
     */
    void put(byte[] key, byte[] value) throws IOException;

    /**
     * Retrieves the value associated with a key.
     * Returns null if key is not found.
     */
    byte[] get(byte[] key) throws IOException;

    /**
     * Logically deletes a key-value pair.
     * Appends a special marker (tombstone) to the store.
     */
    void delete(byte[] key) throws IOException;

    /**
     * Executes atomic batch writes to prevent partial failures.
     */
    void writeBatch(WriteBatch batch) throws IOException;

    /**
     * Returns a new Iterator instance.
     */
    KVIterator iterator();

    interface KVIterator {
        void seek(byte[] targetKey);
        void next();
        void prev();
        boolean valid();
        byte[] key();
        byte[] value();
        void close();
    }
}

3. High-Level Design (HLD)

The LSM-Tree architecture relies on splitting data into an active in-memory buffer (MemTable), an append-only commit log (WAL), and hierarchically partitioned immutable disk tables (SSTables).

LSM Write Path Architecture

Incoming writes are written sequentially to the Write-Ahead Log (WAL) on disk for durability, and simultaneously inserted into the in-memory MemTable (a sorted data structure like a SkipList). Writes return immediately, completely bypassing random disk I/O.

graph TD
    %% Styling
    classDef client fill:#f9f9f9,stroke:#333,stroke-width:2px;
    classDef memory fill:#e1f5fe,stroke:#0288d1,stroke-width:2px;
    classDef disk fill:#fff3e0,stroke:#ef6c00,stroke-width:2px;
    classDef worker fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px;

    Client[User Application]:::client -->|Put/Delete Event| API[Store Ingestion API]:::memory
    
    subgraph RAM Buffer
        API -->|1. Insert Key-Value| MemTable[Active MemTable Skiplist]:::memory
    end

    subgraph Durable Disk Log
        API -->|2. Append-Only Write| WAL[(Write-Ahead Log)]:::disk
    end

    subgraph Background Flush Loop
        MemTable -->|When Full: Mark Read-Only| ImmutableMemTable[Immutable MemTable]:::memory
        ImmutableMemTable -->|Flush Worker Thread| Flush[SSTable Flush Processor]:::worker
        Flush -->|Sequential Flush Write| L0[Level 0 SSTables on Disk]:::disk
        Flush -.->|Truncate / Clear Log| WAL
    end

LSM Read Path Architecture

Reads perform a hierarchical search starting from the lowest latency memory tier and falling back to increasingly structured disk files.

graph TD
    classDef memory fill:#e1f5fe,stroke:#0288d1,stroke-width:2px;
    classDef disk fill:#fff3e0,stroke:#ef6c00,stroke-width:2px;
    classDef filter fill:#efebe9,stroke:#5d4037,stroke-width:2px;

    Client[Get Request] -->|1. Probe RAM| MemTable[Active MemTable]:::memory
    MemTable -->|Key Miss| ImmutableMemTable[Immutable MemTables]:::memory
    
    ImmutableMemTable -->|Key Miss| Cache[Block Cache RAM]:::memory
    
    subgraph Level 0 SSTables (Disk - Overlapping Keys)
        Cache -->|Key Miss| L0Files[Scan All L0 SSTables]:::disk
    end

    subgraph Level 1..N SSTables (Disk - Partitioned, Non-Overlapping Keys)
        L0Files -->|Key Miss| L1Bloom[Check Level 1 Bloom Filters]:::filter
        L1Bloom -->|Filter Hit: Search Binary Index| L1SST[Read L1 SSTable Block]:::disk
        L1Bloom -->|Filter Miss: Skip File| L2Bloom[Check Level 2 Bloom Filters]:::filter
    end

4. Low-Level Design (LLD) & Data Models

SSTable File Binary Format

Once flushed to disk, an SSTable is completely immutable. The physical file is divided into blocks:

+-----------------------------------+
|          Data Block 0             |
+-----------------------------------+
|          Data Block 1             |
+-----------------------------------+
|               ...                 |
+-----------------------------------+
|          Filter Block             |  <-- Bloom Filters data
+-----------------------------------+
|          Meta Index Block         |  <-- Pointers to filters
+-----------------------------------+
|          Index Block              |  <-- Pointers to Data Blocks
+-----------------------------------+
|          Footer Block             |  <-- Fixed 48-byte offset references
+-----------------------------------+

Footer Block: Resides at the absolute end of the file. It is a fixed-size ($48\text{ Bytes}$) block containing index block metadata, meta index offset limits, and a specific magic number byte sequence for validation.

Low-Level Compilable Code: Concurrent SkipList (MemTable Engine)

The MemTable requires an in-memory data structure that supports concurrent thread access and keeps keys sorted. RocksDB utilizes a lock-free SkipList because it avoids the massive rebalancing lock contention associated with AVL or Red-Black trees.

package com.codesprintpro.kvstore;

import java.util.Random;

/**
 * Thread-safe SkipList implementation demonstrating the core mechanics
 * of an LSM-Tree MemTable engine.
 */
public class ConcurrentSkipList {
    private static final int MAX_LEVEL = 16;
    private final Node head = new Node(null, null, MAX_LEVEL);
    private final Random random = new Random();
    private int levelCount = 1;

    public static class Node {
        public String key;
        public String value;
        public Node[] next;

        public Node(String key, String value, int level) {
            this.key = key;
            this.value = value;
            this.next = new Node[level];
        }
    }

    /**
     * Synchronized retrieval of key value.
     */
    public synchronized String get(String key) {
        if (key == null) return null;
        Node curr = head;
        // Traverse skip levels from top down
        for (int i = levelCount - 1; i >= 0; i--) {
            while (curr.next[i] != null && curr.next[i].key.compareTo(key) < 0) {
                curr = curr.next[i];
            }
        }
        if (curr.next[0] != null && curr.next[0].key.equals(key)) {
            return curr.next[0].value;
        }
        return null;
    }

    /**
     * Inserts or updates a key-value pair in sorted skip lists.
     */
    public synchronized void put(String key, String value) {
        if (key == null) return;
        int level = randomLevel();
        Node newNode = new Node(key, value, level);
        Node[] update = new Node[MAX_LEVEL];
        Node curr = head;

        // Find update points at each layer
        for (int i = levelCount - 1; i >= 0; i--) {
            while (curr.next[i] != null && curr.next[i].key.compareTo(key) < 0) {
                curr = curr.next[i];
            }
            update[i] = curr;
        }

        // Adjust node linkage
        for (int i = 0; i < level; i++) {
            if (update[i] == null) {
                newNode.next[i] = null;
                head.next[i] = newNode;
            } else {
                newNode.next[i] = update[i].next[i];
                update[i].next[i] = newNode;
            }
        }

        if (level > levelCount) {
            levelCount = level;
        }
    }

    /**
     * Generates a random insertion height using geometric distribution.
     */
    private int randomLevel() {
        int level = 1;
        while (random.nextInt(2) == 1 && level < MAX_LEVEL) {
            level++;
        }
        return level;
    }
}

5. System Trade-offs & CAP Posture

1. LSM-Tree vs B-Tree (Write vs Read Optimized)

B-Trees (e.g. MySQL InnoDB): Update pages in-place on disk. Leads to massive random disk writes and high WAF. The trade-off is that reads are fast because a key exists in exactly one leaf page, eliminating multiple file seeks.
LSM-Trees (e.g. RocksDB): Append all writes sequentially in memory/logs. Achieves 10x-50x higher write throughput. The trade-off is high read amplification; a key may exist across multiple SSTables, requiring Bloom filters and caches to keep reads viable.

2. Compaction Trade-offs: Size-Tiered vs Leveled Compaction

Size-Tiered Compaction (STCS): Focuses smaller SSTables of similar sizes and merges them into larger files.
- Pros: Low write amplification.
- Cons: Extremely high Space Amplification ($>2.0\times$) and unpredictable disk consumption peaks.
Leveled Compaction (LCS): Organizes disk files into distinct levels ($0, 1, 2...N$). Level $L$ holds at most $10^L\text{ MB}$. Levels $1..N$ guarantee that key ranges do not overlap.
- Pros: Minimal Space Amplification ($<1.2\times$) and fast point lookups.
- Cons: Severe Write Amplification due to background cascade merge-sorting.

6. Failure Scenarios & High-Availability Resilience

1. MemTable Crash & Recovery Playbook

If the server loses power, all keys inside volatile RAM MemTables are instantly lost.

Resilience: The Write-Ahead Log (WAL) acts as a persistent log. Every batch write is flushed sequentially to SSD before the MemTable is updated. During reboot, the database replays WAL logs, recreating active MemTables to recover up-to-the-millisecond transaction consistency.

2. Write Stall Cascades

When the write traffic QPS is consistently higher than the background disk compaction throughput, Level 0 files will pile up.

Resilience: If Level 0 files exceed a safe threshold (e.g. RocksDB default of 20 files), the storage engine triggers Write Stalling. It intentionally introduces micro-delays (rate-limiting incoming requests) or temporarily pauses writes entirely, preventing disk space depletion or p99 read latency spikes.

7. Scaling Challenges & System Bottlenecks

1. Bloom Filter False Positive Tuning

If our Bloom filters are sized poorly (e.g. 5 bits per key instead of 10), false positive rates climb from $1%$ to $10%$. This causes the search path to execute unnecessary disk reads, hitting physical drive boundaries and increasing P99 search latencies.

Mitigation: Implement dynamic tuning of Bloom filters based on SSTable level. Give lower levels (which hold the most frequently read, fresh keys) larger bit allocations (e.g. 12-14 bits per key) to minimize read amplification where it matters most.

2. Disk I/O Saturation during Compaction Spikes

During heavy compaction cycles, background merge-sort worker threads consume massive I/O bandwidth, starving client-facing read threads.

Mitigation: Implement Compaction Rate Limiting. Bounded maximum I/O read/write thresholds are allocated strictly to compaction tasks (e.g. max $50\text{ MB/s}$ compaction bandwidth), preserving NVMe disk channels for customer queries.

8. Staff Engineer Perspective (Operational Deep Dive)

9. Candidate Verbal Script (Interview Guide)

Part 1: Explaining LSM Write Performance

Interviewer: "Why would I choose an LSM-tree based database like RocksDB over a traditional B-Tree database like MySQL InnoDB for a high-write analytics log engine?"

Candidate: "In a write-heavy platform, B-Trees become severely throttled by disk bottlenecks. B-Trees update database pages in-place on disk. When a thread modifies a row, the database must write that specific random page to storage. This results in heavy random disk I/O, which saturates SSD controller channels and increases WAF due to writing entire $16\text{ KB}$ pages for single row changes. An LSM-Tree converts random writes into highly efficient sequential writes. Incoming operations are appended to an in-memory sorted MemTable and sequentially written to a Write-Ahead Log. Because writes are logged sequentially rather than scattered randomly across disk pages, we achieve up to a 10x-50x increase in write throughput. Background threads subsequently merge-sort these sorted tables into structured, read-optimized layers, shifting the performance cost away from client write paths."

Part 2: Negotiating the Compaction Latency Trade-offs

Interviewer: "If Leveled Compaction causes high Write Amplification, why don't we just use Size-Tiered Compaction to save SSD lifespan?"

Candidate: "While Size-Tiered Compaction reduces Write Amplification and prolongs SSD life, it introduces severe operational trade-offs: high Space Amplification and unpredictable read latencies. Because Size-Tiered merges files of similar sizes, a single key can exist concurrently across multiple large files. This means a point lookup must probe several SSTables, which degrades p99 read latencies. Furthermore, Size-Tiered compaction requires keeping up to $50%$ of your disk space empty. If you merge four $100\text{ GB}$ files, you need $400\text{ GB}$ of free space during the merge. By choosing Leveled Compaction, we guarantee that key ranges do not overlap within levels $1$ through $N$. A point lookup only needs to probe a single file per level, bound by Bloom filters. This reduces our Space Amplification from $100%$ down to $10%$-$20%$. In a production infrastructure at MANG scale, the cost of extra SSD wear is often outweighed by the significant savings in disk capacity and reliable read latencies."

System Design: Designing a Local-First Key-Value Store (LevelDB/RocksDB)

From vague architecture answers to staff-level trade-off thinking.