System Design: Designing a Stock Trading Platform and Matching Engine

System Design: Designing a High-Performance Trading Platform

Designing a stock or cryptocurrency trading platform is the ultimate test of low-latency distributed systems engineering. Platforms like NASDAQ or Binance process millions of orders per second, maintain strict state consistency across active Order Books, and guarantee that trades are executed in the exact order they were received. Achieving sub-millisecond latency while maintaining high availability requires a deep understanding of mechanical sympathy, high-performance concurrency models, and hardware limits.

1. Requirements & Core Constraints

Functional Requirements

Order Ingestion: Support Limit, Market, and Stop-Loss order placements.
Deterministic Matching Engine: Match buy and sell orders deterministically based on Price-Time priority logic.
Durable Ledger Audit Trail: Persist a transactional execution record of every matching event and order balance adjustment.
Market Data Streaming: Stream real-time Level-2 (L2) order book depth updates (price levels and volumes) to millions of clients.
Order Lifecycle Management: Enable clients to cancel outstanding limit orders securely prior to matching execution.

Non-Functional Requirements (SLAs)

Ultra-Low Latency: Latency SLAs must guarantee a p99 order execution latency of < 1 millisecond (from gateway ingress to match completion).
Extreme Write Throughput: Support a peak throughput of 100,000 order operations per second (QPS).
Strict Data Consistency: Maintain a CP (Consistency/Partition-tolerance) posture for the order ledger. Double-matching, out-of-order execution, or orphan trades are unacceptable.
High Availability: Maintain 99.999% ("Five Nines") availability via active-passive warm replicas and write-ahead log replay patterns.

Back-of-the-Envelope Capacity Estimations & Scale

To design our capacity allocations, we model a platform with 10 Million Active Accounts:

Ingestion Traffic Sizing

Peak Concurrency: Assume $10%$ of active users ($1\text{ Million}$ accounts) are actively trading at peak hours.
Peak Ingestion Throughput: $100,000\text{ order operations/sec}$.
Payload Sizings: Each order event message (containing order_id, user_id, symbol, price, quantity, side, and timestamp) averages $256\text{ Bytes}$ in raw protobuf binary encoding.
Network Ingress Load: $$\text{Ingress Traffic} = 100,000\text{ QPS} \times 256\text{ Bytes} \approx 25.6\text{ MB/sec} \approx 204.8\text{ Mbps}$$

WebSocket Market Data Egress Sizing

Active WebSocket Connections: $1,000,000\text{ users}$ viewing active tickers.
Delta updates frequency: If we aggregate L2 order book updates down to $50\text{ updates/sec}$ per trading pair to prevent user-space DOM saturation, and each update package averages $128\text{ Bytes}$:
Total Network Egress Load: $$\text{Egress Bandwidth} = 1,000,000\text{ users} \times 50\text{ updates/sec} \times 128\text{ Bytes} = 6.4\text{ GB/sec} \approx 51.2\text{ Gbps}$$

Operational Note: This extreme bandwidth footprint requires a geo-distributed edge WebSocket fleet leveraging delta-only push caches to bypass core network saturation.

Persistent Ledger Audit Capacity

Daily Ingest Volume: Assume an average of $1\text{ Billion}$ total orders processed daily.
Durable Storage Requirements (Daily): $$\text{Daily Audit Storage} = 1,000,000,000\text{ orders} \times 256\text{ Bytes} \approx 256\text{ GB/day}$$
5-Year Persistent Archive Footprint: $$\text{5-Year Storage} = 256\text{ GB/day} \times 365\times 5 \approx 467.2\text{ TB}$$ This raw audit trail requires indexing in a columnar high-throughput store (e.g., ClickHouse or ScyllaDB) with cold snapshots compressed to AWS S3 Glacier.

2. API Design & Core Contracts

Low-latency trading architectures rely on standard gRPC (HTTP/2 with TCP multiplexing and Protobuf binary serialization) for execution gateways, and raw WebSockets for low-overhead real-time pricing dissemination.

Order Placement gRPC Interface

syntax = "proto3";

package codesprintpro.matching;

option java_multiple_files = true;
option java_package = "com.codesprintpro.matching.api";

enum OrderSide {
  ORDER_SIDE_UNSPECIFIED = 0;
  ORDER_SIDE_BUY = 1;
  ORDER_SIDE_SELL = 2;
}

enum OrderType {
  ORDER_TYPE_UNSPECIFIED = 0;
  ORDER_TYPE_LIMIT = 1;
  ORDER_TYPE_MARKET = 2;
  ORDER_TYPE_STOP_LOSS = 3;
}

message OrderPlacementRequest {
  string client_order_id = 1;  // Unique ID generated by client SDK
  string symbol = 2;           // E.g., "BTC-USD", "AAPL"
  OrderSide side = 3;          // BUY or SELL
  OrderType type = 4;          // LIMIT, MARKET, etc.
  uint64 price_in_cents = 5;   // Micro-unit scaling to avoid float drift (e.g., USD cents * 1000)
  uint64 quantity_in_satoshis = 6; // Asset unit scaled as integer (e.g., base unit * 10^8)
  uint64 timestamp_ns = 7;     // Client-side epoch nanoseconds timestamp
}

message OrderPlacementResponse {
  string order_id = 1;         // Exchange-allocated unique sequential order ID
  string client_order_id = 2;
  enum OrderStatus {
    STATUS_UNSPECIFIED = 0;
    STATUS_ACCEPTED = 1;
    STATUS_REJECTED = 2;
    STATUS_PARTIALLY_FILLED = 3;
    STATUS_FILLED = 4;
  }
  OrderStatus status = 3;
  string rejection_reason = 4; // Populated only if STATUS_REJECTED
  uint64 executed_quantity = 5;
  uint64 remaining_quantity = 6;
  uint64 matched_timestamp_ns = 7;
}

service TradingExecutionService {
  rpc SubmitOrder(OrderPlacementRequest) returns (OrderPlacementResponse);
}

3. High-Level Design (HLD)

Achieving sub-millisecond latency requires decoupling the ingestion/network layer, the matching execution layer, and the market data/journaling processes.

Core Order Execution Pathway

The matching engine operates as a single-threaded, in-memory execution processor to avoid locking overheads and thread context switches. The input queue uses an LMAX Disruptor RingBuffer to buffer incoming network events, while a durable Write-Ahead Log (WAL) Sequencer guarantees crash recovery.

graph TD
    %% Styling
    classDef client fill:#f9f9f9,stroke:#333,stroke-width:2px;
    classDef gateway fill:#e1f5fe,stroke:#0288d1,stroke-width:2px;
    classDef ring fill:#efebe9,stroke:#5d4037,stroke-width:2px;
    classDef engine fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px;
    classDef database fill:#fff3e0,stroke:#ef6c00,stroke-width:2px;
    
    Client[Institutional Trading Client]:::client -->|gRPC Submissions| Gateway[gRPC Execution Gateway Cluster]:::gateway
    
    subgraph Execution Node [Single Ultra-Low Latency Core Server]
        Gateway -->|Append Order| Sequencer[Kafka / Local Raft Ingress Sequencer]:::gateway
        Sequencer -->|Sequence Committed Order| InputDisruptor[LMAX Input RingBuffer]:::ring
        
        InputDisruptor -->|Poll Next Event| MatchingEngine[In-Memory Matching Engine]:::engine
        
        MatchingEngine -->|Engine Update State| OutputDisruptor[LMAX Output RingBuffer]:::ring
    end

    subgraph Durability & Distribution Layer
        OutputDisruptor -->|Worker A: Async Journaling| WAL[(SSD Write-Ahead Log)]:::database
        OutputDisruptor -->|Worker B: State Sync| DB[(ScyllaDB Execution Ledger)]:::database
        OutputDisruptor -->|Worker C: Pub/Sub Publish| EventBus[Kafka Match Event Topic]:::database
    end

Real-Time Market Data Broadcast Pipeline

To serve millions of users requesting live L2 Order Book depth updates, we route match confirmation events away from the execution thread pool to decoupled, distributed edge cache hubs that push updates to client browsers via WebSockets.

graph TD
    classDef edge fill:#efebe9,stroke:#5d4037,stroke-width:2px;
    classDef database fill:#fff3e0,stroke:#ef6c00,stroke-width:2px;
    classDef gateway fill:#e1f5fe,stroke:#0288d1,stroke-width:2px;

    EventBus[Kafka Match Event Topic]:::database -->|Consumer Stream| Analytics[Real-Time Flink Aggregator]:::gateway
    Analytics -->|Aggregated L2 Book Changes| EdgeCache[(Redis Cluster Local Caches)]:::edge
    
    EdgeCache -->|Local State Sync| WSGateway[Edge WebSocket Gateways Fleet]:::gateway
    WSGateway -->|Aggregated L2 Delta Diffs| ClientA[Web Ticker Dashboard]
    WSGateway -->|Aggregated L2 Delta Diffs| ClientB[Mobile Trading App]

4. Low-Level Design (LLD) & Data Models

In-Memory Order Book Structure

The core Order Book manages two primary datasets per trading pair: Bids (Buy Orders) and Asks (Sell Orders).

Sort Priority:
- Bids (Buy): Sorted by price descending (highest bid gets priority), then by timestamp ascending (earlier orders get matched first).
- Asks (Sell): Sorted by price ascending (lowest sell gets priority), then by timestamp ascending.
Java Structure: A standard HashMap tracks orderId -> Order references for $O(1)$ updates/cancellations, combined with nested TreeMap instances representing price levels.

import java.util.HashMap;
import java.util.TreeMap;
import java.util.Collections;

public class OrderBook {
    private final String symbol;
    
    // Asks sorted by price ascending, Bids sorted by price descending
    private final TreeMap<Long, PriceLevel> askLevels = new TreeMap<>();
    private final TreeMap<Long, PriceLevel> bidLevels = new TreeMap<>(Collections.reverseOrder());
    private final HashMap<Long, Order> orderMap = new HashMap<>();

    public OrderBook(String symbol) {
        this.symbol = symbol;
    }
    
    public static class PriceLevel {
        public long priceInCents;
        public java.util.LinkedList<Order> orders = new java.util.LinkedList<>();
    }
}

Low-Level Compilable Code: High-Performance LMAX RingBuffer

To feed our matching engine at extreme scales without facing thread contention overhead, we avoid standard locking structures (ArrayBlockingQueue) in favor of a lock-free LMAX Disruptor RingBuffer leveraging memory barrier padding to prevent cache-line thrashing.

package com.codesprintpro.matching;

import java.util.concurrent.atomic.AtomicLong;

/**
 * Custom high-performance, lock-free RingBuffer demonstrating the core mechanics
 * of the LMAX Disruptor pattern for ultra-low latency order matching ingestion.
 */
public class HighPerformanceRingBuffer {
    
    // Volatile padding to prevent cache-line false sharing (64-byte alignment)
    public static class OrderEvent {
        protected long p1, p2, p3, p4, p5, p6, p7;
        public long orderId;
        public String symbol;
        public double price;
        public long quantity;
        public String side; // "BUY" or "SELL"
        protected long p8, p9, p10, p11, p12, p13, p14;

        public void clear() {
            this.orderId = 0L;
            this.symbol = null;
            this.price = 0.0;
            this.quantity = 0L;
            this.side = null;
        }
    }

    private final OrderEvent[] ringBuffer;
    private final int bufferSize;
    private final int mask;
    
    // RingBuffer write sequence tracker
    private final AtomicLong sequence = new AtomicLong(-1L);

    public HighPerformanceRingBuffer(int bufferSize) {
        // Enforce power-of-two size to optimize modulo operations to fast bitwise ANDs
        if (Integer.bitCount(bufferSize) != 1) {
            throw new IllegalArgumentException("Buffer size must be a power of 2");
        }
        this.bufferSize = bufferSize;
        this.mask = bufferSize - 1;
        this.ringBuffer = new OrderEvent[bufferSize];
        for (int i = 0; i < bufferSize; i++) {
            this.ringBuffer[i] = new OrderEvent();
        }
    }

    /**
     * Reserves the next index sequence slot in the RingBuffer (Producer Thread).
     */
    public long next() {
        return sequence.incrementAndGet();
    }

    /**
     * Fetches the element at the specified sequence.
     */
    public OrderEvent get(long seq) {
        return ringBuffer[(int) (seq & mask)];
    }

    /**
     * Publishes the event making it visible to consumers.
     */
    public void publish(long seq) {
        // Sequence increment signals to consumer that memory writes are fully barrier-flushed.
    }

    /**
     * EventHandler interface implemented by the single-threaded matching processor.
     */
    public interface EventHandler {
        void onEvent(OrderEvent event, long sequence, boolean endOfBatch);
    }

    /**
     * Barrier protecting consumer from overrunning the active producers.
     */
    public static class SequenceBarrier {
        private final AtomicLong cursor = new AtomicLong(-1L);

        public void update(long seq) {
            cursor.lazySet(seq); // Low-overhead non-blocking CPU cache flush
        }

        public long getCursor() {
            return cursor.get();
        }
    }
}

Database DDL Schema (Ledger Audit Trail)

We store historical execution matches durably in PostgreSQL with partition schemes structured by day. Strong constraints prevent orphan accounts or illegal currency amounts.

-- Core Table capturing matched executions (Trades ledger)
CREATE TABLE trades_history (
    trade_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    buyer_order_id VARCHAR(64) NOT NULL,
    seller_order_id VARCHAR(64) NOT NULL,
    buyer_account_id VARCHAR(64) NOT NULL,
    seller_account_id VARCHAR(64) NOT NULL,
    symbol VARCHAR(16) NOT NULL,
    executed_price_cents BIGINT NOT NULL CHECK (executed_price_cents > 0),
    executed_quantity_satoshis BIGINT NOT NULL CHECK (executed_quantity_satoshis > 0),
    trade_value_cents BIGINT GENERATED ALWAYS AS ((executed_price_cents * executed_quantity_satoshis) / 100000000) STORED,
    executed_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL
) PARTITION BY RANGE (executed_at);

-- Example Partition definition for active day
CREATE TABLE trades_history_y2026m05d22 PARTITION OF trades_history
    FOR VALUES FROM ('2026-05-22 00:00:00+00') TO ('2026-05-23 00:00:00+00');

-- Highly optimized indexes for fast customer reporting
CREATE INDEX idx_trades_buyer ON trades_history (buyer_account_id, executed_at DESC);
CREATE INDEX idx_trades_seller ON trades_history (seller_account_id, executed_at DESC);
CREATE INDEX idx_trades_symbol ON trades_history (symbol, executed_at DESC);

5. System Trade-offs & CAP Posture

1. CAP Theorem Decisions: Strong Consistency vs Latency

Traditional web apps favor Availability and Partition Tolerance (AP) by choosing eventual consistency models. For stock exchanges, this is a severe violation. An order matched twice or sold to two separate accounts results in catastrophic ledger deficits.

Decision: CP Posture. The system must enforce immediate consistency. If a network partition occurs between active matching partitions, the affected trading pair's book must freeze rather than execute split-brain matches.

2. Matching Thread Models: Single-Threaded CPU Pinning vs Multi-Threaded Locking

Multi-Threaded Model: Utilizes synchronized locks or read-write locks across order books. Leads to thread blocking, high CPU context-switching overhead, and 100-microsecond tail latency spikes.
Single-Threaded Model (LMAX): Pinning order book matching for a single trading pair to a single core guarantees lock-free determinism, sustaining sub-millisecond execution times. The trade-off is that a single trading pair's throughput is capped by the single-core frequency of the server.

6. Failure Scenarios & High-Availability Resilience

1. Engine Crash & Memory Recovery

Since the order book operates entirely in-memory, a physical power outage or server crash will completely wipe the live state.

Resilience Playbook: Every submitted order undergoes synchronous write-ahead log logging before queue ingestion. Upon recovery, the standby machine loads the nearest nightly state snapshot and replays the transaction stream log sequentially to rebuild identical state safely.

2. High-Frequency Market Hotspots (Hot Keys & Shard Saturation)

If Apple releases earnings reports, order placement QPS on AAPL can exceed $500,000\text{ QPS}$, saturating the specific execution core.

Resilience Playbook: Leverage LMAX RingBuffer Batching. If the input Disruptor queues up a backlog, the matching engine processes all pending items in a single sequential execution sweep, updating in-memory models before issuing a single aggregated batch confirmation. This buffers extreme micro-spikes cleanly.

7. Scaling Challenges & System Bottlenecks

1. The Partitioning Paradox

Standard systems can be sharded horizontally using standard partition keys (like user_id). However, an Order Book for a single trading pair (e.g., BTC/USD) cannot be partitioned. Matching orders requires a globally ordered sequence of bids and asks.

Mitigation: Implement Symmetric Sharding. Run dedicated single-threaded matching instances per trading pair. Asset pair BTC/USD runs on Node A, ETH/USD on Node B, AAPL on Node C.

2. Cache Stampede and Egress Saturation

When millions of WebSocket clients view the order book, streaming every minor liquidity delta will instantly saturate downstream node connections.

Mitigation:Delta Compression & Level-2 Bucket Consolidation.
- WebSocket proxies push only changes, rather than full book state.
- Price points are consolidated into defined buckets (e.g., $10\text{-cent}$ increments instead of individual decimal levels) prior to network broadcast.

8. Staff Engineer Perspective (Operational Deep Dive)

Pitfall

JVM Garbage Collection Latency Hazards: A standard Java runtime relies on automatic Garbage Collection (GC) sweeps. When G1GC or ZGC runs a cycle, even sub-millisecond GC pauses will instantly break our microsecond SLAs. High-performance matching engines resolve this by utilizing a Zero-Allocation Architecture:

Avoid creating new objects inside the execution path: Reuse event pools via the Flyweight design pattern.
Avoid primitive auto-boxing: Leverage primitive-specialized collections (e.g. Eclipse Collections) instead of HashMap<Integer, Order>.
Allocate transient queue memory off-heap via direct ByteBuffer allocation to hide data from GC scanning sweeps.

[!INSIGHT] Multicast vs. TCP Unicast: In traditional financial exchanges (like NYSE), market data is broadcast using IP Multicast (UDP). This allows the exchange to send a single packet that is duplicated at the network switch level, delivering updates to all co-located institutional clients simultaneously. Modern cloud-native exchanges (like crypto platforms) cannot leverage IP Multicast due to cloud provider virtualization limits. We must build robust, high-fanout TCP Unicast WebSocket networks backed by regional Edge CDNs to achieve equivalent real-time price dissemination.

9. Candidate Verbal Script (Interview Guide)

Part 1: Establishing Architectural Alignment

Interviewer: "We need you to design a high-throughput, low-latency stock trading platform. How do you approach the core architecture?"

Candidate: "To achieve sub-millisecond latencies at scale, my system must run the core matching engine in-memory as a single-threaded execution loop. The core matching algorithms themselves are CPU-bound; introducing locks, mutexes, or multiple threads would cause severe context-switching overhead and cache-line invalidation. By keeping the entire order book in RAM and matching sequentially on a core pinned strictly to that task, I can process millions of transactions per second. To bridge the gap between high-throughput multi-threaded network layers and the single-threaded matching core, I will use an LMAX Disruptor RingBuffer pattern, allowing lock-free event sequencing."

Part 2: Addressing Durability & Crash Recovery

Interviewer: "If the matching engine is single-threaded and runs purely in-memory, what happens when that node loses power? How do you prevent data loss without blocking matching latency?"

Candidate: "We enforce durability using Event Sourcing via a Write-Ahead Log (WAL) Sequencer. Before an order is submitted to the matching engine's input RingBuffer, it must first be appended to a high-speed sequential log—either a local Raft log running on NVMe SSDs or a Kafka cluster configured with acks=all. The sequencer assigns a globally monotonic sequence number to every order. When a matching engine node crashes, it reboots, loads the latest nightly database snapshot of the order book, and replays the sequence log starting from that snapshot ID. Because order matching is completely deterministic, the engine is guaranteed to reconstruct the exact same state without losing a single transaction. The write path to the SSD is completely asynchronous from the matching path, ensuring disk I/O bottlenecks do not impact p99 execution SLAs."