System Design: Designing a Content Delivery Network (CDN)

Case Study: Designing a Content Delivery Network (CDN)

Mental Model

A Content Delivery Network is not just a bunch of distributed file servers, but a global, intelligent reverse-proxy fabric that leverages IP Anycast routing, hierarchical caching tiers, and edge computing to move the internet's static and dynamic state to within a physical millisecond of the user.

Requirements & System Constraints

A global CDN must serve static files (images, JS, CSS, video segments) and dynamic web pages at a massive scale.

Functional Requirements

Edge Static Caching: Cache static files at the network edge and serve them with ultra-low latency.
Dynamic Content Optimization: Enable routing of dynamic content (e.g., API requests) through optimized routing paths back to the origin server.
Active Cache Purging: Support immediate programmatic cache invalidation across the entire global network within 5 seconds of an API request.
Cache-Key Personalization: Allow customizing cache keys based on headers (e.g., device type, compression formats like gzip/brotli).
Edge Security (WAF): Mitigate Layer 7 attacks (SQL Injection, XSS) and block volumetric DDoS attacks at the edge.

Non-Functional SLAs

Latency: P99 cache hit latency must be under 20ms. P99 cache miss latency must be under 150ms (origin fetch).
Scale: The system must sustain 1 Million requests per second (QPS) globally with peak capacity up to 3 Million QPS.
Bandwidth: Maintain a massive concurrent network egress capacity of 800 Gigabits per second (Gbps).
High Availability: Enforce 99.999% availability for request routing and edge POP delivery.

Back-of-the-Envelope Capacity Estimates

Let's estimate the storage, memory, and bandwidth required to support a global CDN at 1 Million QPS.

1. Network Bandwidth & Egress

Average Object Size: $100\text{ KB}$ (representing images, JS/CSS files, and short media clips).
Average Throughput: $1,000,000\text{ QPS}$
Total Egress Bandwidth: $1,000,000 \times 100\text{ KB} = 100,000,000\text{ KB/s} = 100\text{ Gigabytes per second (GB/s)}$
In bits: $100\text{ GB/s} \times 8 \approx 800\text{ Gbps}$ of network transit capacity required at peak.

2. Storage Capacity (Edge POP cache)

Total Catalog Size at Origin: $10\text{ Petabytes (PB)}$ of content.
Active Working Set (10% standard cache target): $1\text{ PB}$ of popular assets.
Number of Global Point of Presence (POP) locations: $50$ major cities worldwide.
Storage per POP: To cache the active working set locally, each POP needs $1\text{ PB} / 50 \approx 20\text{ Terabytes (TB)}$ of high-speed NVMe SSD storage.
DRAM Index Size: To keep lookups fast, cache keys are indexed in RAM.
- Average Cache-Key size (MD5 hash + metadata): $64\text{ bytes}$.
- Total cached items per POP: $20\text{ TB} / 100\text{ KB} = 200\text{ Million items}$.
- RAM required for index: $200,000,000 \times 64\text{ bytes} \approx 12.8\text{ GB}$ of high-speed DRAM per POP.

API Design & Core Contracts

The CDN exposes developer APIs for content purge (invalidation) and prefetching, as well as handling client requests.

1. Invalidate (Purge) Cached Content

Allows developers to immediately invalidate one or more cached files across all edge locations.

POST /api/v1/purge

Request Headers:

Authorization: Bearer cdn_token_983274981

Request Payload:

{
  "paths": [
    "/static/assets/logo.png",
    "/static/styles/*.css"
  ],
  "purge_type": "wildcard",
  "async": true
}

Response Payload (Success):

{
  "status": "success",
  "purge_job_id": "job_purge_8932478",
  "estimated_propagation_ms": 3500,
  "created_at": 1779435420000
}

2. Cache Warm-up (Prefetch)

Allows developers to pre-warm the cache by pulling assets into edge POPs before a major launch event.

POST /api/v1/prefetch

Request Payload:

{
  "urls": [
    "https://example.com/assets/banner_hero.webp",
    "https://example.com/assets/intro_video.mp4"
  ],
  "target_pops": ["US-EAST", "EU-WEST", "AP-SOUTH"]
}

Response Payload (Success):

{
  "status": "in_progress",
  "prefetch_job_id": "job_warm_0923147",
  "total_urls_queued": 2
}

High-Level Design (HLD)

To achieve low-latency edge delivery, we split the architecture into a Global Request Router and a Hierarchical Cache Infrastructure.

1. Request Routing and BGP Anycast Flow

This flow details how a user request is directed to the nearest physical Edge POP.

graph TD
    Client[User Client] -->|1. DNS Query: static.example.com| AnycastDNS[Anycast DNS Server]
    AnycastDNS -->|2. Resolve nearest POP IP| Client
    Client -->|3. HTTP GET request| EdgeRouter[BGP Anycast Edge Router]
    
    subgraph Edge POP (Nearest Location)
        EdgeRouter -->|4. Route to| LoadBalancer[Edge L4 Load Balancer]
        LoadBalancer -->|5. Check Cache| L1Cache[Edge L1 Cache (DRAM/SSD)]
    end
    
    L1Cache -->|Cache Hit (Under 20ms)| Client
    L1Cache -->|Cache Miss| L2Cache[Regional L2 Cache]

2. Hierarchical Cache Miss Resolution Flow

When a file is missing from the local L1 cache, the CDN queries regional layers to avoid saturating the primary origin server.

graph TD
    L1[Edge POP L1 Cache] -->|1. Cache Miss| L2[Regional L2 Cache POP]
    L2 -->|2. Cache Miss| OriginShield[Regional Origin Shield Server]
    OriginShield -->|3. Check SSD Cache| OSStore[(Origin Shield SSD Store)]
    
    OSStore -->|4. Cache Miss| Origin[Origin S3 Storage / Application Server]
    Origin -->|5. Return File + Cache-Control Headers| OriginShield
    OriginShield -->|6. Populate Shield Cache| OSStore
    OriginShield -->|7. Return File| L2
    L2 -->|8. Populate L2 Cache| L1
    L1 -->|9. Serve to User Client| Client[User Client]

Low-Level Design (LLD) & Core Components

Let's dissect what happens inside a single Edge Point of Presence (POP) when a request arrives.

The Anatomy of an Edge POP

L4 Load Balancer (Maglev/IPVS): Distributes inbound TCP packets across a pool of L7 reverse proxies.
L7 Reverse Proxy & Cache Engine (Varnish/Nginx): Terminates TLS connections, parses HTTP request headers, and looks up the asset in the local cache.
Edge Cache-Key Builder: Enforces query parameter normalization, sorting, and header hashing to locate the exact cached resource.
Local Storage Tier:
- L1 DRAM Cache: Stores extremely popular hot assets (meta metadata, tiny text assets).
- L1 SSD Cache: Stores larger objects (images, webp files, small video chunks) on fast NVMe drives.

Edge Cache-Key Builder Implementation

A critical vulnerability in CDNs is Cache Poisoning and low cache-hit ratios caused by unsorted query parameters (e.g., ?a=1&b=2 vs ?b=2&a=1) or redundant tracking parameters (e.g., ?utm_source=twitter).

Below is a production-grade, compilable Python implementation of a normalized CDN Edge Cache-Key Builder.

import hashlib
import urllib.parse
from typing import Dict, List, Optional

class CDNCacheKeyBuilder:
    def __init__(self, ignored_params: List[str] = None, whitelisted_headers: List[str] = None):
        """
        Initialize the Edge Cache-Key builder.
        
        :param ignored_params: List of query parameters to strip (e.g., tracking tags)
        :param whitelisted_headers: List of headers to include in cache key (e.g., Accept-Encoding)
        """
        self.ignored_params = set(ignored_params or ["utm_source", "utm_medium", "utm_campaign", "fbclid"])
        self.whitelisted_headers = [h.lower() for h in (whitelisted_headers or ["accept-encoding", "user-agent-device"])]

    def normalize_url(self, raw_url: str) -> str:
        """
        Normalizes a URL by lowercasing the scheme and host.
        """
        parsed = urllib.parse.urlparse(raw_url)
        normalized_host = parsed.netloc.lower()
        normalized_path = parsed.path
        
        # Ensure path trailing slash consistency
        if len(normalized_path) > 1 and normalized_path.endswith("/"):
            normalized_path = normalized_path[:-1]
            
        return f"{parsed.scheme.lower()}://{normalized_host}{normalized_path}"

    def normalize_query_params(self, query_string: str) -> str:
        """
        Normalizes query parameters by sorting them alphabetically and removing ignored parameters.
        """
        if not query_string:
            return ""
            
        parsed_params = urllib.parse.parse_qsl(query_string, keep_blank_values=True)
        filtered_params = [
            (k, v) for k, v in parsed_params 
            if k not in self.ignored_params
        ]
        
        # Sort alphabetically by key, then by value
        sorted_params = sorted(filtered_params, key=lambda x: (x[0], x[1]))
        
        if not sorted_params:
            return ""
            
        return urllib.parse.urlencode(sorted_params)

    def extract_whitelisted_headers(self, headers: Dict[str, str]) -> str:
        """
        Extracts and normalizes whitelisted headers to allow content negotiation (e.g. gzip vs brotli).
        """
        header_fingerprints = []
        normalized_headers = {k.lower(): v for k, v in headers.items()}
        
        for header_name in self.whitelisted_headers:
            if header_name in normalized_headers:
                header_value = normalized_headers[header_name]
                # Normalize Accept-Encoding to prevent cache fragmentation
                if header_name == "accept-encoding":
                    if "br" in header_value:
                        header_value = "br"
                    elif "gzip" in header_value:
                        header_value = "gzip"
                    else:
                        header_value = "identity"
                header_fingerprints.append(f"{header_name}:{header_value}")
                
        return ";".join(sorted(header_fingerprints))

    def generate_cache_key(self, raw_url: str, query_string: str, headers: Dict[str, str]) -> str:
        """
        Generates a secure, cryptographically hashed Cache-Key for edge lookup.
        """
        normalized_url = self.normalize_url(raw_url)
        normalized_query = self.normalize_query_params(query_string)
        header_fingerprint = self.extract_whitelisted_headers(headers)
        
        # Compose final signature string
        signature_base = f"{normalized_url}|{normalized_query}|{header_fingerprint}"
        
        # Compute SHA-256 Hash
        hash_digest = hashlib.sha256(signature_base.encode("utf-8")).hexdigest()
        
        return f"cdn_key_{hash_digest}"

# Example Usage & Verification:
if __name__ == "__main__":
    builder = CDNCacheKeyBuilder(
        ignored_params=["utm_source", "gclid"],
        whitelisted_headers=["Accept-Encoding", "X-Device-Type"]
    )
    
    # Example 1: Standard URL
    url_1 = "https://Example.com/static/Images/banner.PNG/"
    query_1 = "utm_source=facebook&b=2&a=1"
    headers_1 = {"Accept-Encoding": "br, gzip, deflate", "X-Device-Type": "mobile"}
    
    # Example 2: Equivalent URL but unsorted parameters and casing differences
    url_2 = "https://example.com/static/images/banner.png"
    query_2 = "a=1&b=2&utm_source=newsletter"
    headers_2 = {"accept-encoding": "gzip, br", "X-Device-Type": "mobile"}
    
    key_1 = builder.generate_cache_key(url_1, query_1, headers_1)
    key_2 = builder.generate_cache_key(url_2, query_2, headers_2)
    
    print(f"Generated Cache Key 1: {key_1}")
    print(f"Generated Cache Key 2: {key_2}")
    
    # Verification assertion
    assert key_1 == key_2, "Cache key normalization failed to map equivalent requests to the same key!"
    print("Success: Normalized request variations resolved to the identical cache key.")

Scaling Nuances & Cache Optimization

Operating a CDN at a 1 Million QPS scale introduces massive cache management challenges.

1. Invalidation Propagation at Scale

When an asset is purged, sending a synchronous HTTP request to 10,000 edge servers is slow and error-prone.

Our Design: We use a two-tiered invalidation pipeline:
- Global Pub/Sub (Kafka + Redis Cluster): The purge API publishes a message to a global Kafka cluster. Each Edge POP runs a lightweight daemon that consumes from the corresponding partition.
- Tombstoning: Instead of immediately deleting the physical files from disk (which incurs high I/O latency), the Varnish proxy marks the cache key with a "Tombstone" flag in memory. Subsequent client requests treat the tombstoned asset as a cache miss, fetching fresh content from the origin while the old file is lazily garbage-collected in the background.

2. Cache Eviction Policies

An Edge POP with 20TB of NVMe capacity will eventually fill up. Choosing what to evict is critical for maintaining high Hit Ratios.

Standard LRU (Least Recently Used): Prone to cache pollution from "one-hit wonders" (a file requested once and never again).
Segmented LRU (SLRU) (Selected): The cache is split into two segments: a Probationary Segment and a Protected Segment.
- Incoming files on cache miss are placed in the Probationary Segment.
- If a probationary item is requested a second time before eviction, it is promoted to the Protected Segment.
- This protects the hot working set from being flushed by sporadic, low-popularity requests.

3. Dynamic Request Collapsing (Request Coalescing)

Under peak load, if a popular video segment expires, 10,000 concurrent client requests might hit the same Edge POP at the exact same millisecond. If all 10,000 requests are forwarded to the origin, the origin database will crash.

Mitigation: The Edge reverse proxy implements Request Collapsing (Coalescing). Only the first incoming request is forwarded to the origin server. A mutex lock is acquired for that cache key, forcing the other 9,999 requests to queue/wait. Once the first request returns with the file from the origin, the proxy stores it in cache and serves it to all queued clients, reducing origin traffic to exactly 1 request.

Trade-offs & Architectural Decisions

Designing a CDN involves strategic trade-offs depending on budget, operational complexity, and performance goals.

1. Routing Model: IP Anycast vs. Geo-DNS

Geo-DNS:
- Pros: Easy to implement using standard DNS servers. Can route traffic based on business rules or complex DNS weight configurations.
- Cons: DNS caches respect Time-To-Live (TTL) values. If an edge location goes down, client browsers will continue sending traffic to the failed POP for minutes or hours until the DNS cache expires.
IP Anycast (Selected):
- Pros: Sub-second automatic failover. Routers along the BGP path automatically bypass a failed POP because the BGP announcement is withdrawn, sending traffic to the next closest POP.
- Cons: Highly complex to configure and manage. Requires owning public IP prefixes (under /24 minimum) and peering directly with Tier-1 network transits.

2. Cache Filling Strategy: Push vs. Pull Model

Push Model: The origin server pushes new assets to all edge locations proactively upon upload.
- Best For: Popular media platforms (e.g., Netflix pushing an anticipated new release to all regional edge storage nodes).
Pull Model (Selected): The edge server fetches content from the origin on-demand only on a cache miss.
- Best For: Generic web hosting, e-commerce, and long-tail platforms where caching every single user upload at all edge POPs is cost-prohibitive.

Failure Scenarios & Mitigation Strategies

At a global scale, network failures, routing failures, and attacks are continuous events.

1. BGP Routing Flaps and Anycast Blackholing

Under unstable fiber connections, a POP's Anycast BGP router may rapidly establish and drop BGP sessions, causing traffic to oscillate between different POPs, breaking active TCP connections.

Mitigation: We implement Route Dampening at our BGP routers. If a route flaps more than 3 times in 5 minutes, we temporarily withdraw the prefix advertisement for 30 minutes, directing all traffic to secondary backup POPs.

2. Cache Poisoning Attacks

An attacker sends a request with malicious headers (e.g., X-Forwarded-Host: evil.com). The server processes the request and responds with malicious scripts, which the CDN caches under the default key. Future clients are served the malicious script.

Mitigation:
1. We enforce strict cache-key normalization as shown in our CDNCacheKeyBuilder class.
2. We completely ignore non-whitelisted headers during the cache-key generation process.
3. The WAF at the edge POP runs raw input filters on incoming HTTP headers before hitting the Varnish routing cache engine.

3. Volumetric DDoS Attacks at the Edge

A botnet launches a 10 Million QPS flood of requests targeting randomized query parameters to bypass the cache and hit the origin.

Mitigation:
- Edge Load Balancer Syn-Flood Protection: Use TCP SYN cookies at the Maglev load balancing layer to absorb SYN floods without allocating socket state.
- Edge Rate Limiting: Identify client IP patterns using a sliding-window count in shared memory and return HTTP 429 status codes instantly at the edge before forwarding to the cache lookup layer.

Staff Engineer Perspective

Operating a CDN requires a deep understanding of bare-metal networking and operating system kernel tuning.

Candidate Verbal Script & Mock Interview Guide

Here is a step-by-step walkthrough of how to articulate this design during an actual System Design interview.

1. Requirements & Scaling Phase (Minutes 0 - 5)

Candidate: "I will design a highly resilient, global CDN. First, I will clarify scope. Do we support static or dynamic assets? Both, but static caching is our main focus. For non-functional SLAs, I will design for a global scale of 1 Million QPS with an average file size of 100 KB, demanding 800 Gbps of egress capacity. Our target cache-hit latency must remain under 20ms."

2. Global Request Routing & Anycast (Minutes 5 - 15)

Candidate: "To achieve under 20ms latency, we cannot route all users to a central region. We must route requests to the nearest edge Point of Presence (POP). I will use IP Anycast routing over BGP. Multiple global POPs will advertise the exact same IP address. Routers on the internet will automatically forward the user's TCP packets to the topologically nearest POP. For dynamic routing, if a request cannot be cached, the Edge POP will tunnel the TCP connection back to the origin over a pre-warmed persistent connection pool to optimize latency."

3. Hierarchical Caching Topology (Minutes 15 - 25)

Candidate: "Caching should not be flat. If a file misses at the Edge L1 cache, querying the origin directly would saturate our core database. I will implement a hierarchical caching architecture. We will have Edge L1 caches (DRAM + NVMe SSDs), pointing to Regional L2 caches, which in turn point to a Regional Origin Shield. The origin shield is located near the primary origin server. This multiple-tier fallback minimizes origin traffic."

4. Cache Poisoning & Invalidation Deep-Dive (Minutes 25 - 40)

Candidate: "At a 1M QPS scale, security and cache hit optimization are crucial. I will implement an Edge Cache-Key Builder that normalizes all request fields. It will lowercase paths, sort query parameters alphabetically, and strip tracking parameters (like utm_source). For content invalidation, I will build an asynchronous pipeline. When a developer triggers a purge, the system will publish a tombstone message to a global Kafka cluster, which each Edge POP consumes. Edge servers will instantly mark the asset in-memory as expired without incurring disk I/O costs, replacing it lazily."