System Design: Designing a Distributed Cache (Redis)

Case Study: Designing a Distributed Cache (Redis Scale)

In high-scale web architectures, the primary bottleneck is almost always the persistent storage engine (e.g., PostgreSQL, MongoDB). A Distributed Cache acts as a low-latency, high-throughput in-memory buffer, pooling the RAM of hundreds of servers to handle millions of queries per second (QPS) with sub-millisecond latencies. In a system design interview, this question evaluates your mastery of memory management, networking protocols, consistency models, and distributed state coordination.

1. Requirements & Core Constraints

Functional Requirements

Core Key-Value Store API: Support get(key), set(key, value, ttl_seconds), and delete(key).
Arbitrary Binary Payloads: Safely store string keys and binary blob values up to 10MB (e.g., serialized HTML fragments, JSON API payloads, session tokens).
Time-To-Live (TTL): Support automatic key expiration based on standard TTL configurations.
Cache Eviction: Automatically evict keys when the system runs out of physical memory.

Non-Functional Requirements (SLAs)

Sub-Millisecond Read Latency: P99 read latencies must remain under 1 millisecond.
High Egress Throughput: Scale horizontally to handle 10 Million QPS total.
High Availability: Provide 99.999% uptime via automatic failover and master-replica architectures.
Minimal Data Loss (Optional): Provide persistence strategies (like Redis RDB or AOF) as optional configurations, prioritizing latency over strict storage persistence.

Back-of-the-Envelope Capacity Estimates

Let's size a production cluster operating at 10 Million QPS scale:

Total Read QPS: 8,000,000 (80%).
Total Write QPS: 2,000,000 (20%).
Average Key Size: 100 Bytes.
Average Value Size: 1 Kilobyte (KB).
Target Cache Storage Capacity: 50 Terabytes (TB) of hot in-memory state.
Average RAM per Server: 128 Gigabytes (GB).

Server Count Calculation:

To avoid thrashing and allow OS operations, we utilize 80% of a cache node's RAM for raw data storage. $$\text{Usable RAM per Server} = 128 \text{ GB} \times 0.80 \approx 102.4 \text{ GB.}$$ $$\text{Primary Cache Servers Needed} = \frac{50 \text{ TB}}{102.4 \text{ GB}} \approx \mathbf{489 \text{ physical servers.}}$$ Adding a 1:1 Master-Replica layout for high availability and load distribution, we require a minimum of 978 physical servers in the cluster.

Network Throughput Sizing:

Egress Bandwidth (Reads): $$8,000,000 \text{ reads/sec} \times 1.1 \text{ KB/key-value} \approx 8.8 \text{ GB/sec} \approx \mathbf{70.4 \text{ Gbps continuous egress.}}$$
Ingress Bandwidth (Writes): $$2,000,000 \text{ writes/sec} \times 1.1 \text{ KB} \approx 2.2 \text{ GB/sec} \approx \mathbf{17.6 \text{ Gbps continuous ingress.}}$$ This heavy network load requires cache nodes to utilize 10/25GbE network interfaces and non-blocking I/O multiplexing.

2. API Design & Core Contracts

While REST APIs can act as a fallback, distributed caches utilize lightweight TCP-level binary protocols or simple text protocols (e.g., Redis RESP or Memcached ASCII) to avoid HTTP parsing overhead.

A. Memcached-Style Socket Command Set (TCP Level)

# Command syntax: set <key> <flags> <exptime> <bytes>\r\n<data_block>\r\n
set session_98234 0 3600 52\r\n
{"user_id":1920,"roles":["premium"],"tier":"enterprise"}\r\n

# Response:
STORED\r\n

# Command syntax: get <key>\r\n
get session_98234\r\n

# Response:
VALUE session_98234 0 52\r\n
{"user_id":1920,"roles":["premium"],"tier":"enterprise"}\r\n
END\r\n

B. Fallback JSON REST HTTP API

POST /v1/cache/keys/session_98234?ttl=3600
Authorization: Bearer <TOKEN>
Content-Type: application/octet-stream

{"user_id":1920,"roles":["premium"],"tier":"enterprise"}

3. High-Level Design (HLD)

The architecture splits operations across the Client Library Layer (routing and serialization) and the Cache Server Cluster (storage, replication, and eviction).

graph TD
    %% Client Stack
    subgraph Client Application Instance
        App[Application Logic] -->|1. Read Key 'A'| ClientLib[Cache Client Library]
        ClientLib -->|2. Hash Key & Resolve Node| HashingResolver[Consistent Hashing Resolver]
    end
    
    %% Cluster State Metadata
    Zookeeper[ZooKeeper / Raft Consensus Cluster] <--->|3. Gossip / Cluster Topology Updates| ClientLib
    
    %% Cache Servers (Consistent Hashing Ring mapping)
    subgraph Cache Cluster Rings
        HashingResolver -->|4. Route to S1 Master (Hash Ring Match)| Node1_M[Cache Server 1 - Master]
        HashingResolver -.->|Failover Read| Node1_R[Cache Server 1 - Replica]
        
        HashingResolver -->|Route to S2 Master| Node2_M[Cache Server 2 - Master]
        HashingResolver -.->|Failover Read| Node2_R[Cache Server 2 - Replica]
    end
    
    %% Replication Pipelines
    Node1_M -->|5. Async Replication| Node1_R
    Node2_M -->|5. Async Replication| Node2_R
    
    %% Persistence Layer (Optional)
    Node1_M -->|6. Async File Dump| Disk1[(RDB / AOF Persistent File)]

Consistent Hashing Ring with Virtual Nodes

In static sharding (node_index = hash(key) % N), adding or removing a cache node shifts key locations, causing a 100% cache miss storm that crashes the database origin.

Consistent Hashing maps both cache servers and keys onto a circular $360^\circ$ ring using a shared hash function (e.g., MurmurHash3).

Virtual Nodes: To prevent "Hot Spotting" (where one physical server gets a disproportionate share of the keys), each physical server is mapped to multiple logical "Virtual Nodes" (e.g., S1_v1, S1_v2, S1_v3) distributed evenly across the ring.

graph TD
    subgraph Consistent Hashing Ring Map
        v1_1["S1_v1 (0 degrees)"] --- v2_1["S2_v1 (90 degrees)"]
        v2_1 --- v1_2["S1_v2 (180 degrees)"]
        v1_2 --- v2_2["S2_v2 (270 degrees)"]
        v2_2 --- v1_1
    end
    
    KeyA["Key 'session_98' (hash maps to 45 degrees)"] -->|Clockwise Next Node| v2_1
    KeyB["Key 'user_882' (hash maps to 210 degrees)"] -->|Clockwise Next Node| v2_2

4. Low-Level Design (LLD) & Data Models

Eviction Data Structures

To implement an LRU (Least Recently Used) Cache with $O(1)$ operations, we combine two data structures:

Doubly Linked List: Keeps track of access recency. The most recently accessed node is placed at the head; the least recently accessed is at the tail.
Hash Map: Maps keys to Linked List nodes, enabling constant-time index lookups.

Compilable Low-Level Design (Java Concurrent LRU Implementation)

import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.locks.ReentrantReadWriteLock;

public class ConcurrentLRUCache<K, V> {
    private final int capacity;
    private final ConcurrentHashMap<K, Node<K, V>> map;
    private final DoublyLinkedList<K, V> list;
    private final ReentrantReadWriteLock lock = new ReentrantReadWriteLock();

    public ConcurrentLRUCache(int capacity) {
        this.capacity = capacity;
        this.map = new ConcurrentHashMap<>();
        this.list = new DoublyLinkedList<>();
    }

    public V get(K key) {
        lock.writeLock().lock(); // Lock required to adjust list node ordering
        try {
            if (!map.containsKey(key)) {
                return null;
            }
            Node<K, V> node = map.get(key);
            list.moveToHead(node);
            return node.value;
        } finally {
            lock.writeLock().unlock();
        }
    }

    public void put(K key, V value) {
        lock.writeLock().lock();
        try {
            if (map.containsKey(key)) {
                Node<K, V> node = map.get(key);
                node.value = value;
                list.moveToHead(node);
            } else {
                if (map.size() >= capacity) {
                    Node<K, V> tail = list.removeTail();
                    if (tail != null) {
                        map.remove(tail.key);
                    }
                }
                Node<K, V> newNode = new Node<>(key, value);
                list.addToHead(newNode);
                map.put(key, newNode);
            }
        } finally {
            lock.writeLock().unlock();
        }
    }

    private static class Node<K, V> {
        K key;
        V value;
        Node<K, V> prev;
        Node<K, V> next;

        Node(K key, V value) {
            this.key = key;
            this.value = value;
        }
    }

    private static class DoublyLinkedList<K, V> {
        private Node<K, V> head;
        private Node<K, V> tail;

        void addToHead(Node<K, V> node) {
            if (head == null) {
                head = node;
                tail = node;
            } else {
                node.next = head;
                head.prev = node;
                head = node;
            }
        }

        void moveToHead(Node<K, V> node) {
            if (node == head) return;
            
            if (node == tail) {
                tail = tail.prev;
                tail.next = null;
            } else {
                node.prev.next = node.next;
                node.next.prev = node.prev;
            }
            
            node.next = head;
            node.prev = null;
            head.prev = node;
            head = node;
        }

        Node<K, V> removeTail() {
            if (tail == null) return null;
            Node<K, V> removed = tail;
            if (head == tail) {
                head = null;
                tail = null;
            } else {
                tail = tail.prev;
                tail.next = null;
            }
            return removed;
        }
    }
}

5. Scaling Challenges & Bottlenecks

A. The Cache Stampede (Thundering Herd / Cache Meltdown)

When a highly active key (e.g., trending_video_id) expires, thousands of parallel threads experience a cache miss simultaneously. They all query the heavy PostgreSQL database at once, causing the database load to spike, leading to cascading application timeouts.

Mitigation Strategies:

Mutex Gating (Locking): When a client misses the cache, it acquires a distributed mutex (e.g., using Redis SETNX). Only the singular worker acquiring the lock is permitted to query the database and write back the cache. All subsequent threads wait or read a transient stale value.
Probabilistic Early Expiration (XFetch): Instead of a hard TTL expiry, clients utilize a probabilistic formula to recalculate the key expiry locally as it nears expiration. If a background check returns true, a background worker proactively refreshes the key before it physically expires, maintaining 100% warm read hits for hot keys.

B. Hot Spot Sharding & Gossip Management

If a global celebrity (like a football star) creates an entry, the corresponding key maps to a single physical cache node on the Consistent Hashing ring. This node saturates its network interfaces while other ring nodes remain idle.

Mitigation Strategy:

We implement a Dual-Tier Cache Topology:

Layer 1 Local Memory Cache: The client libraries maintain a highly restricted in-memory cache (e.g., Guava Cache with a 5-second TTL) for recognized high-traffic keys. This intercepts the reads before they traverse the network, preventing cache node saturation.

6. Real-World Trade-offs

A. Eviction Algorithm Options: LRU vs. LFU vs. TinyLFU

LRU (Least Recently Used): Easy to implement, but vulnerable to Scan Pollution. If an offline batch job scans all keys in the DB, it pollutes the cache with useless data, evicting warm, frequently-used keys.
LFU (Least Frequently Used): Protects against scans, but keys that were highly active in the past (e.g., a Black Friday banner) retain high frequency scores and remain in memory indefinitely, blocking new hot keys.
TinyLFU: Solves this by utilizing a Bloom-Filter-like count-min sketch to maintain a decay factor. It asks: "Is the incoming key more likely to be requested than the least popular resident?" It yields excellent hit ratios at the expense of higher CPU memory tracking overhead.

B. Write Consistency Strategies

Cache-Aside (Lazy Loading): Application coordinates cache updates. Standard, but introduces race conditions if writing to DB and invalidating cache are not atomic.
Write-Through: Cache writes synchronous payloads to DB. Eliminates stale data, but adds write latency penalty.
Write-Back (Write-Behind): Client writes strictly to memory; background worker flushes writes to database asynchronously.
- Trade-off: Insanely fast write throughput, but risks data loss if the cache server experiences a hardware crash before flushing.

7. Failure Scenarios & Fault Tolerance

A. Automatic Node Failover (Master-Replica Election)

To handle physical server crashes:

We group servers into Replication Hubs managed via consensus layers (like Redis Sentinel or ZooKeeper).
Sentinel nodes execute gossip-based polling. If a Master node stops responding to heartbeats, the sentinels elect the Replica node with the highest offset (least replication lag) to become the new Master.
The GTM (Global Traffic Manager) updates the Consistent Hashing Ring configurations on all active client instances.

B. Consistent Hashing Split-Brain

In the event of a network partition, half the client libraries might lose connection to Sentinel nodes, resulting in diverging hash ring topography configurations.

Solution: Sentinel consensus must establish a quorum ($> N/2$) to coordinate ring updates. If a partitioned client cannot reach the majority quorum registry, it restricts operations to Read-Only mode to avoid writing key updates to diverging nodes.

8. Staff Engineer Perspective (Operational Deep Dive)

Trade-off

Single-Threaded Event Loop (Redis) vs. Multi-Threaded Architecture (Memcached)

A common design choice is picking between a single-threaded architecture (like Redis) and a multi-threaded architecture (like Memcached).

A single-threaded event loop utilizing non-blocking multiplexed I/O (via epoll or kqueue) is highly elegant. It eliminates CPU thread context switching, avoids lock overhead, and guarantees absolute atomicity for custom Lua scripts.

However, on massive bare-metal servers featuring 64 or 128 cores, a single-threaded cache only utilizes a single CPU core, leaving 98% of the processor idle.

As a Staff Engineer, to scale a single-threaded in-memory engine across large servers, you must run Multiple Independent Cache Instances on distinct port ranges on the same hardware, utilizing CPU pin-affinity (taskset) to bind each instance to its own physical core, maximizing silicon efficiency.

9. Candidate Verbal Script (Mock Interview Guide)

Below is an authentic verbal delivery demonstrating elite systems mastery during a live technical design loop:

Candidate: "To design a distributed cache capable of handling 10 Million QPS, I must protect the database origin from cache miss storms and eliminate Single Points of Failure. My primary routing engine will utilize Consistent Hashing with Virtual Nodes. Each physical server will map to 256 virtual points on the ring, ensuring uniform hash key distribution and preventing hot spots. By using consistent hashing, adding or removing nodes only affects a fraction ($1/N$) of our keys, preventing cascading database failures.

For low-level execution, if I need simple, extreme multi-threaded throughput, I'll lean on a Memcached-style architecture. However, if we require complex data structures and atomicity, I'll implement a Redis-style single-threaded event loop utilizing non-blocking multiplexed epoll sockets. To maximize the 128GB of RAM on our 64-core bare-metal servers, I'll deploy multiple independent cache daemons pinned to separate CPU cores via thread affinity.

To mitigate Cache Stampedes on viral keys, I will implement Probabilistic Early Expiration (the XFetch algorithm) within the client library. As a hot key approaches expiration, the client library uses a probabilistic decay formula to trigger an asynchronous background fetch to warm the cache before it physically expires. For other keys, I'll employ mutex-locked database fetching. This two-pronged strategy completely shields our database origin from ever experiencing thundering herd crashes."