Lesson 2 of 38 17 minDesign Track

Introduction to High-Level Design

Introduction to High-Level Design deep dive for HLD & LLD interview preparation.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • [Multi-Region Architecture: Active-Active, Active-Passive, and Consistency Trade-Offs](/blog/multi-r
  • [System Design: Designing an Online Auction System (eBay Scale)](/blog/system-design-online-auction-
  • [The Saga Pattern: Managing Distributed Transactions in NoSQL](/blog/saga-pattern-distributed-transa
Recommended Prerequisites
Introduction to High-Level Design

Premium outcome

Bridge the gap between architecture diagrams and implementation details.

Engineers preparing for LLD rounds or leveling up their software design depth.

What you unlock

  • Cleaner reasoning around SOLID, patterns, responsibilities, and schema design
  • A usable bridge between HLD whiteboard thinking and concrete Java classes
  • Case-study practice across common interview-style design systems

High-Level Design (HLD) is the primary blueprint of system architecture. While Low-Level Design (LLD) focuses on modular class relationships, object-oriented encapsulation, and local thread mechanics within a single process, HLD operates at the macroscopic level. It defines the global services, network topologies, storage models, consistency boundaries, and communication protocols that form a resilient and highly scalable enterprise platform.

Mastering HLD requires transitioning from a local code-centric mindset to a global system-centric paradigm. In high-scale production systems, single-server implementations are completely obsolete. Modern systems are distributed by default, composed of hundreds of decoupled nodes that must coordinate state across geographic zones. This comprehensive guide deconstructs the foundational pillars of HLD, constructs a global multi-tier architecture, details sizing capacity math, and explores real-world scaling bottlenecks, trade-offs, and disaster recovery strategies.


System Requirements and Goals

To design a high-scale global application (such as a modern e-commerce platform or a real-time messaging engine), we must establish rigid functional boundaries and ambitious operational goals. Decoupling functional expectations from system reliability boundaries is the first step in successful High-Level Design.

1. Functional Requirements

  • Decoupled Service Mesh: Deconstruct monolithic codebases into independent, single-responsibility microservices that communicate via lightweight, contract-driven interfaces.
  • State Management: Isolate transactional processing (such as payment checkouts) from read-heavy catalog search operations.
  • Real-time Event Ingestion: Ingest millions of incoming user interaction events per second from clients distributed globally.
  • Omnichannel Notification Deliveries: Route transaction updates and status changes to mobile push gateways and email servers asynchronously.

2. Non-Functional Constraints

  • High Availability ("Five Nines"): Target a $99.999%$ availability SLA, bounding total annual unplanned downtime to less than $5.26\text{ minutes}$. This requires active-active multi-region deployment, automatic DNS failover routing, and zero-downtime database replication strategies.
  • Horizontal Scalability: Ensure the system can scale out horizontally by adding stateless computation nodes to handle $10\times$ traffic surges without manual architectural rewrites.
  • Low Latency Boundaries: Enforce strict P99 latency bounds: less than $100\text{ms}$ for read requests and less than $200\text{ms}$ for transactional writes.
  • Strong Durability: Protect transactional data against physical server failures through multi-region synchronous replication.

3. Sizing and Capacity Math

Let's conduct a back-of-the-envelope capacity estimation for a global system design interview target: a social media catalog service handling $100\text{ Million}$ Daily Active Users (DAU).

  • Read/Write Ratio: $100:1$ (extremely read-heavy, typical of social catalog apps).
  • Write Throughput (TPS): If each user writes an average of $2$ items per day: $$\text{Daily Writes} = 100,000,000 \text{ users} \times 2 = 200,000,000 \text{ writes/day}$$ $$\text{Average Write TPS} = \frac{200,000,000 \text{ writes}}{86,400 \text{ seconds}} \approx 2,315 \text{ TPS}$$ $$\text{Peak Write TPS (3x surge multiplier)} = 2,315 \times 3 \approx 6,945 \text{ Peak TPS}$$
  • Read Throughput (TPS): Using our $100:1$ ratio: $$\text{Average Read TPS} = 2,315 \times 100 = 231,500 \text{ TPS}$$ $$\text{Peak Read TPS} = 231,500 \times 3 \approx 694,500 \text{ Peak TPS}$$
  • Network Egress Bandwidth: If the average read payload (including metadata, user info, and asset URLs) size is $50\text{ KB}$: $$\text{Peak Egress Bandwidth} = 694,500 \text{ TPS} \times 50 \text{ KB} = 34,725,000 \text{ KB/s} \approx 34.72 \text{ Gigabytes/second}$$
  • Long-Term Storage Volumes: If each write payload consumes $1\text{ KB}$ of metadata database space, then daily metadata generation is: $$\text{Daily Storage} = 200,000,000 \text{ writes} \times 1 \text{ KB} \approx 200 \text{ GB/day}$$ Over a standard 5-year data retention compliance lifecycle, this scales to: $$\text{5-Year Storage} = 200 \text{ GB/day} \times 365 \text{ days/year} \times 5 \text{ years} \approx 365 \text{ Terabytes}$$ Adding indexes, replication factors (3x for high durability), and transaction logs, the system will actually require over $1.5\text{ Petabytes}$ of enterprise SSD storage.

This capacity math demonstrates that a single server or database instance will immediately collapse under the peak bandwidth of $34.72\text{ GB/s}$ and the petabyte-scale storage footprint. To survive, the High-Level Design must incorporate a multi-tier Content Delivery Network (CDN) cache layer, read replicas, horizontal load balancing, and database sharding partitions.

4. Regulatory and Compliance Sizing

At a global scale, physical HLD must also account for regulatory data residency boundaries. Laws like the European Union's General Data Protection Regulation (GDPR) and California's Consumer Privacy Act (CCPA) prohibit moving personal identifiable information (PII) across specific physical borders.

Consequently, we must size our multi-region databases not just for sheer throughput, but also for geographical partitioning. We implement localized database shards hosted exclusively within regional boundaries (e.g. EU West region for European users). The Geo-DNS router utilizes IP-based routing metrics to ensure a user's write request is physically processed and persisted only within their domestic cloud jurisdiction, maintaining absolute compliance at the network edge.


API Design and Interface Contracts

High-Level Design specifies the protocols and payload structures that allow services to interact securely and efficiently. Decoupling communication paths is critical for distributed system performance.

1. Ingest Transaction Request

POST /v2/orders

Request Headers:

  • X-Correlation-ID: corr-4433-8888-abc (for tracing request flows across microservices)
  • X-Idempotency-Key: idem-order-998877 (to prevent duplicate processing on retries)
  • Content-Type: application/json

Request Payload:

{
  "buyer_id": "usr-8877-xyz",
  "items": [
    { "product_id": "prod-101", "quantity": 2, "price_cents": 4900 }
  ],
  "shipping_address": {
    "street": "100 Broadway",
    "city": "New York",
    "zip_code": "10001"
  }
}

Response Payload (202 Accepted):

{
  "order_id": "ord-776655-aabb",
  "status": "PENDING_VALIDATION",
  "created_at": "2026-05-23T08:24:12Z",
  "estimated_delivery": "2026-05-26T12:00:00Z"
}

2. Error Response Payload (429 Too Many Requests)

If a client violates rate limits during peak events:

{
  "error_code": "RATE_LIMIT_EXCEEDED",
  "message": "Too many requests. Please try again after 60 seconds.",
  "retry_after_seconds": 60,
  "timestamp": "2026-05-23T08:24:15Z"
}

3. Protocol Selection Trade-offs

Stateless backend services use a mix of communication protocols. We utilize gRPC (HTTP/2 Protobuf) for internal service-to-service communication because its binary serialization reduces network payload size by up to 70%, and HTTP/2 multiplexing allows hundreds of requests to flow over a single TCP connection, eliminating TCP handshake bottlenecks.

Conversely, we expose RESTful HTTP (JSON) to public clients. JSON is universally supported by web browsers and third-party developers, and RESTful endpoints are easily cached by standard proxy servers and CDN edges, maximizing global read performance.


High-Level Design Architecture

To survive millions of requests, our architecture employs a completely decoupled, multi-tier layout.

1. Global Multi-Tier System Topology

The system leverages edge routing, caching layers, and asynchronous event buses to handle read and write traffic independently. Traffic is captured at the edge by an Anycast CDN, serving static assets locally. Cache misses are routed via Geo-DNS to the closest active Availability Zone (AZ) API Gateway. The API Gateway manages rate limiting and forwards valid calls via gRPC to stateless microservices. Writes are pushed asynchronously to Kafka to decouple checkout execution from database persistence.

graph TD
    %% Client Tier
    Client[Web/Mobile Clients] -->|1. Route Traffic| CDN[Anycast CDN Edge]
    CDN -->|Cache Miss| DNS[Geo-DNS Route Gateway]
    
    %% Gateway Tier
    DNS -->|2. Route to Active AZ| APIGateway[Distributed API Gateway]
    
    subgraph "Stateless Application Tier"
        APIGateway -->|3. Route via gRPC| OrderService[Order Validation Service]
        APIGateway -->|3. Read Catalog| CatalogService[Catalog Read Service]
    end

    subgraph "Caching & Queue Tier"
        CatalogService -->|4. Query Cache| RedisCache[(Redis Cluster Cache)]
        OrderService -->|5. Write Event| Kafka[Apache Kafka Message Broker]
    end

    subgraph "Distributed Data Tier"
        Kafka -->|6. Consume Asynchronously| PaymentWorker[Payment Processing Engine]
        OrderService -->|7. Persist Transaction| PrimaryDB[(Primary Relational DB)]
        PrimaryDB -->|8. Async Replication| ReadReplica[(Read Replica Database)]
        RedisCache -- Cache Reload --> ReadReplica
    end

    %% Styles
    style CDN fill:#1a1c23,stroke:#3b82f6,stroke-width:2px,color:#fff
    style APIGateway fill:#1e3a8a,stroke:#3b82f6,stroke-width:2px,color:#fff
    style Kafka fill:#1a1c23,stroke:#f59e0b,stroke-width:2px,color:#fff
    style PrimaryDB fill:#1a1c23,stroke:#ef4444,stroke-width:2px,color:#fff

2. Event-Driven Write Sequence Flow

The diagram below details the sequence of a user submitting an order, illustrating how we achieve sub-millisecond API response times by handling payment and notification tasks asynchronously via message brokers.

sequenceDiagram
    autonumber
    actor Client as User Browser
    participant Gateway as API Gateway
    participant Order as OrderService
    participant DB as SQL DB (Primary)
    participant Bus as Kafka Broker
    participant Worker as PaymentWorker

    Client->>Gateway: POST /v2/orders
    Gateway->>Order: processOrder(OrderData)
    Order->>DB: Check Inventory (Read)
    DB-->>Order: Inventory Available
    Order->>DB: Write Order (PENDING)
    Order->>Bus: Publish OrderCreatedEvent
    Order-->>Gateway: 202 Accepted (Order ID)
    Gateway-->>Client: Display Order Received Page
    
    Note over Bus,Worker: Asynchronous Processing Loop
    Bus->>Worker: Consume OrderCreatedEvent
    Worker->>Worker: Process Payment Gateway Charge
    Worker->>DB: Update Order Status (PAID)

Low-Level Design & Component Mechanics

High-level architectures must ultimately translate to concrete low-level components executing within physical server runtimes. Below is the LLD mechanics mapping how stateless web threads interact with database connection pools.

1. Connection Pool & Thread Pool Mechanics (Java LLD)

Stateless application servers must manage resources carefully to prevent CPU starvation. We configure a highly optimized HikariCP Database Connection Pool paired with a bounded ThreadPoolExecutor to ensure fast SQL execution. The ThreadPoolExecutor utilizes a LinkedBlockingQueue to bound memory, preventing memory exhaustion under traffic spikes.

package com.codesprintpro.hld.mechanics;

import java.sql.Connection;
import java.sql.SQLException;
import java.util.concurrent.*;
import java.util.concurrent.atomic.AtomicInteger;

public class HighScaleOrderProcessor {
    private final BoundedExecutor executor;
    private final MockDatabaseConnectionPool connectionPool;

    public HighScaleOrderProcessor(int threadCount, int maxQueueSize, int dbPoolSize) {
        this.executor = new BoundedExecutor(threadCount, maxQueueSize);
        this.connectionPool = new MockDatabaseConnectionPool(dbPoolSize);
    }

    public void handleOrderRequest(final String orderJson) {
        executor.submit(() -> {
            Connection conn = null;
            try {
                conn = connectionPool.getConnection();
                // Execute SQL validation and order persistence
                System.out.println("Processing order on thread " + Thread.currentThread().getName() + " using connection " + conn.hashCode());
                Thread.sleep(50); // Simulate database write network overhead
            } catch (SQLException | InterruptedException e) {
                System.err.println("Failed to process order: " + e.getMessage());
            } finally {
                if (conn != null) {
                    connectionPool.releaseConnection(conn);
                }
            }
        });
    }
}

// Bounded Executor prevents memory exhaustion under sudden traffic spikes
class BoundedExecutor {
    private final ExecutorService executor;

    public BoundedExecutor(int poolSize, int maxQueueSize) {
        this.executor = new ThreadPoolExecutor(
            poolSize, poolSize,
            0L, TimeUnit.MILLISECONDS,
            new LinkedBlockingQueue<>(maxQueueSize),
            new ThreadFactory() {
                private final AtomicInteger counter = new AtomicInteger(0);
                @Override
                public Thread newThread(Runnable r) {
                    return new Thread(r, "OrderProcessorThread-" + counter.incrementAndGet());
                }
            },
            new ThreadPoolExecutor.CallerRunsPolicy() // Backpressure: caller thread executes task if queue is full
        );
    }

    public void submit(Runnable task) {
        executor.submit(task);
    }
}

// HikariCP replica mock database connection pool
class MockDatabaseConnectionPool {
    private final BlockingQueue<Connection> connections;

    public MockDatabaseConnectionPool(int poolSize) {
        this.connections = new LinkedBlockingQueue<>(poolSize);
        for (int i = 0; i < poolSize; i++) {
            connections.add(new MockSQLConnection());
        }
    }

    public Connection getConnection() throws SQLException {
        try {
            Connection conn = connections.poll(1000, TimeUnit.MILLISECONDS); // Bounded wait
            if (conn == null) {
                throw new SQLException("Database connection timeout. Pool exhausted.");
            }
            return conn;
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            throw new SQLException("Connection pool interrupted.");
        }
    }

    public void releaseConnection(Connection conn) {
        if (conn != null) {
            connections.offer(conn);
        }
    }
}

class MockSQLConnection implements Connection {
    // Stub implementation of java.sql.Connection for LLD compilation validation
    @Override public void close() throws SQLException {}
    @Override public boolean isClosed() throws SQLException { return false; }
    // Overrides remaining 200+ Connection interface methods as stubs...
    @Override public <T> T unwrap(Class<T> iface) throws SQLException { return null; }
    @Override public boolean isWrapperFor(Class<?> iface) throws SQLException { return false; }
}

2. Thread Modulations and Backpressure Mechanics

The BoundedExecutor class utilizes a strict CallerRunsPolicy. In standard JVM designs, if the thread pool queue is unbounded, a sudden traffic spike will stack millions of tasks in heap memory, triggering Garbage Collection (GC) pauses and eventually crashing the container via OutOfMemory (OOM) failures.

By enforcing a bounded queue (e.g. maxQueueSize = 1000), when the queue is saturated, the CallerRunsPolicy forces the initiating caller thread (the thread running the API Gateway socket) to execute the task itself. This naturally slows down socket reading at the network layer, propagating an elegant backpressure signal upstream to the API Gateway. The API Gateway then sheds load or returns HTTP 429 responses, protecting the application server's memory boundaries.

Additionally, to optimize JVM performance at extreme scale, we configure our JVM nodes to run the Z Garbage Collector (ZGC). Traditional JVM garbage collectors (like G1GC) pause all application execution threads (Stop-The-World) to clean heap memory, which spikes P99 latency up to several seconds during heavy allocation waves. ZGC performs all compaction phases concurrently in the background, bounding pause times strictly under $1\text{ millisecond}$ regardless of heap size, preserving our low-latency non-functional constraints.


Scaling Challenges & Production Bottlenecks

Scaling horizontal stateless services is straightforward, but data storage and network layers expose major physical bottlenecks:

1. Database Write Saturation

As horizontal application instances scale from 10 to 1000 pods, the centralized relational database becomes the primary bottleneck due to locking conflicts and disk write saturation.

  • Database Sharding: Partition the database horizontally using a sharding key (e.g. user_id). By routing different users to independent physical databases (e.g. Shard A for IDs 1-1000, Shard B for 1001-2000), we distribute the write throughput across multiple CPU nodes.
  • CQRS (Command Query Responsibility Segregation): Decouple read traffic completely from write traffic. Write queries are processed only by the primary database, while read catalog requests are directed to read replicas or fast Redis caches, protecting primary instances from read-induced write resource starvation.

2. Hotspot Shards (Key Skewing)

When sharding by a key (e.g., influencer_id or product_id), a viral event (such as a celebrity post or a flash sale item) will route millions of requests to the exact same database shard, crashing the node.

  • Consistent Hashing: Map sharded keys and physical nodes onto a circular Hash Ring. Consistent hashing minimizes partition rebalancing overhead when scaling nodes out, preventing massive data migrations.
  • Key Salting: Append a random numeric suffix (e.g., prod-101_1, prod-101_5) to extremely hot keys during write operations, spreading the load across multiple physical nodes. During read operations, the API reads from all salted locations and consolidates the counts, ensuring no single shard collapses under concentrated traffic pressure.

Technical Trade-offs & Strategic Compromises

High-Level architectural decisions require balancing consistency, latency, write speed, and system complexity.

1. Relational (ACID) vs. NoSQL (Wide-Column)

Relational databases enforce strict ACID transactions, ensuring that all users see the exact same data immediately. This is required for financial ledgers where negative balances are unacceptable. However, this CP model requires distributed locks and synchronous coordination, resulting in high latency and low throughput.

NoSQL databases adopt eventual consistency. Writes return immediately, and updates propagate asynchronously to other replicas. This AP model offers ultra-low latencies and virtually unlimited throughput, but users may temporarily read stale data (e.g. a social media post showing 100 likes instead of 105). This is the optimal trade-off for social feeds where immediate absolute accuracy is secondary to latency.

2. Transport Protocol Options

Selecting communication protocols requires balancing structural flexibility against processing throughput boundaries.

Protocol Option Payload Format Network Overhead Browser Support Best System Use Case
gRPC (HTTP/2) Binary Protobuf Ultra-Low (Compact) Poor (Needs Proxy) Internal Microservice Mesh
REST (JSON) Textual JSON Medium (Verbose) Excellent (Native) Public API Gateway
WebSockets Custom Frames Low (Persistent socket) Excellent Real-Time Chat & Lobbies

Failure Scenarios and Fault Tolerance

Distributed systems operate on the assumption that components will fail. Our HLD incorporates robust resilience patterns.

1. Cascading Downstream Outages (Thundering Herds)

If a downstream microservice (such as a payment processor) crashes or slows down, the upstream caller service will hold network sockets open while waiting for timeouts, quickly saturating its thread pools and crashing the upstream service too. This creates a chain reaction that can crash the entire platform.

  • Circuit Breakers (Resilience4j): Monitor call success rates. If failures exceed a threshold (e.g., 50%), the breaker trips, immediately returning a local fallback response (or an error) without calling the failing downstream service, allowing it to recover.
  • Rate Limiters (Token Bucket): Throttle incoming client requests at the API Gateway to prevent traffic surges from overwhelming backend databases.
  • Graceful Degradation: If the database layer is slow, the Catalog Service falls back to returning static cached recommendations, keeping the platform available.

2. Network Partitions and Split-Brain

In multi-region active-active deployments, a network partition can isolate two database datacenters. If both centers continue accepting writes without communication, their states will drift, creating conflicting transactions (split-brain).

  • Consensus Protocols: We utilize consensus algorithms (such as Paxos or Raft) to coordinate write updates. In the event of a partition, only the majority side of the cluster is allowed to accept writes, while the isolated minority side transitions to read-only mode, protecting the system from split-brain data corruption.

Staff Engineer Perspective


Verbal Script & Mock Interview

Mock Interview Dialogue

Interviewer: "Welcome! Today, we want to design a highly scalable, real-time e-commerce order system capable of handling 100,000 requests per second. How would you approach the high-level architecture?"

Candidate: *"To design a resilient system of this scale, I would focus on three core architectural principles: complete decoupling of read and write paths using CQRS, an event-driven write pipeline to minimize synchronous latency, and horizontal sharding to prevent database write saturation.

At the edge, clients connect through a Geo-DNS route gateway and anycast CDN to a distributed API Gateway. The gateway acts as our entry shield, performing rate limiting, TLS termination, and authentication.

For the write path, when a user submits an order, the API Gateway forwards the request to our stateless Order Ingestion Service. Rather than performing a heavy database write synchronously, the ingestion service performs a fast memory check on cached inventory in Redis, writes a validated order record to our primary SQL database, and immediately publishes an OrderCreatedEvent to an Apache Kafka message broker. The server then immediately returns a 202 Accepted response to the client.

Asynchronously, a dedicated fleet of payment and shipping worker microservices consume the event from Kafka and process the actual transaction downstream. By decoupling these slow I/O operations from our API request thread, we keep order ingestion latency below 50 milliseconds."*

Interviewer: "That is a very fast ingestion pipeline. However, how do you handle the bottleneck of write saturation on your primary SQL database under peak load?"

Candidate: *"To scale the data tier, we apply Horizontal Database Sharding and CQRS.

We partition our relational databases horizontally using the buyer_id as the sharding key, distributing order records across multiple independent database nodes.

Furthermore, to protect the primary shards from read traffic, we replicate all transactions asynchronously to a cluster of read replicas. Catalog searches and order history queries are served exclusively from these read replicas or from our Redis cache clusters. This ensures that read queries never starve write operations of critical system resources."*

Interviewer: "Excellent. What if a network partition isolates one of your database shards? How do you prevent split-brain data conflict?"

Candidate: *"To prevent data corruption during network partitions, we prioritize Consistency over Availability (CP) at the database layer.

We deploy our primary database shards in a multi-region configuration coordinated by a consensus protocol like Raft.

If a network partition isolates a region, the database nodes will evaluate if they form a majority. Only the side that holds the majority of votes is permitted to elect a leader and accept writes. The isolated minority region is automatically transitioned to read-only mode.

Once the network partition is resolved, the minority region replays the log updates from the leader, safely merging state without any split-brain data conflicts."*

Interviewer: "Outstanding. Your master-level grasp of distributed patterns, latency trade-offs, and data integrity boundaries is exceptional."


Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.