Lesson 62 of 105 17 minFlagship

System Design: Designing an Event Mesh (Pub/Sub at Global Scale)

How do you route millions of events across multi-cloud and multi-region deployments with sub-millisecond latencies? Designing a global Event Mesh.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • An Event Mesh dynamically routes pub/sub events across heterogeneous clouds and geographic regions without a centralized broker.
  • Dynamic Subscription Routing allows brokers to build routing tables on the fly based on active subscriber interests.
  • To survive regional link failures, implement backpressure queues and deduplication ledgers at mesh broker boundaries.
Recommended Prerequisites
System Design Interview FrameworkWebSocket Fleet Management: Handling Millions of Connections

Premium outcome

From vague architecture answers to staff-level trade-off thinking.

Backend engineers preparing for senior, staff, and architecture rounds.

What you unlock

  • A reusable system design answer framework for ambiguous prompts
  • Clear language for consistency, scaling, and reliability trade-offs
  • Case-study depth across feeds, payments, storage, and messaging systems

In modern distributed architectures, applications are rarely confined to a single geographic data center or cloud provider. A global enterprise might run consumer-facing web services in AWS (US East), inventory databases in Google Cloud (Europe), and legacy fulfillment systems on-premises. Connecting these isolated platforms using a centralized message broker (such as Kafka or RabbitMQ) introduces major performance bottlenecks: WAN latency spikes, single-point-of-failure risks, and complex cross-cloud security configurations.

An Event Mesh is the industry standard for routing real-time events across multi-cloud and multi-region deployments. Unlike a traditional centralized broker, an Event Mesh is a decentralized network of interconnected event brokers that dynamically calculate routing paths. Events are forwarded only to the nodes where active subscribers exist, keeping cross-region network traffic low and optimizing latency.

This system design guide details the architectural blueprint for designing a global Event Mesh capable of routing 1,000,000 events per second with sub-millisecond local processing latencies.


System Requirements

To design a global Event Mesh, we divide our requirements into functional capabilities, non-functional operational constraints, and explicit scale parameters.

Functional Requirements

  • Dynamic Subscription Routing: Automatically propagate subscriber topics across the mesh, building routing paths dynamically without manual broker configuration.
  • Multi-Protocol Support: Ingest and deliver events using standard protocols (such as AMQP, MQTT, WebSockets, gRPC, and REST).
  • Decentralized Event Distribution: Publish an event in one region and have the mesh route it only to brokers containing active subscribers.
  • Message Filtering at the Edge: Support SQL-like selector filters on event headers to drop or route events before they traverse cross-region links.
  • Traceability & Monitoring: Track message propagation paths across broker hops for debugging and auditing.

Non-Functional Requirements

  • Ultra-Low Local Latency: Local event routing (publisher-to-subscriber in the same data center) must execute in less than 5 milliseconds.
  • WAN Bandwidth Optimization: Restrict WAN link usage by avoiding broadcast patterns; only replicate messages across regions if active remote subscribers exist.
  • Fault-Tolerant Link Routing: Automatically route around failed broker nodes or slow WAN links using dynamic path recalculation.
  • High Concurrency Connection Handling: Support millions of persistent subscriber connections (e.g., WebSockets, MQTT) across global edge nodes.

Scale Assumptions

  • Global Ingestion Rate: 1,000,000 events per second at peak.
  • Geographic Coverage: 3 primary regions: Americas (us-east-1), Europe (eu-west-1), and Asia-Pacific (ap-southeast-1).
  • Average Event Size: 500 bytes.
  • WAN Replication Rate: Average of 20% of events require cross-region replication.
  • Active Edge Connections: 5,000,000 concurrent client subscribers.

API Design and Interface Contracts

The Event Mesh exposes ingestion endpoints for publishers and WebSockets/gRPC streams for active subscribers.

1. Ingest Event Payload (HTTP POST /v1/events/publish)

Used by HTTP-compatible clients to publish events to the local edge broker node.

Request Headers:

Content-Type: application/json
CE-SpecVersion: 1.0
CE-Type: com.codesprintpro.order.created
CE-Source: /billing/checkout
CE-ID: evt_uuid_88192ab_99

Request Payload:

{
  "orderId": "ord_88271a2",
  "customerId": "cust_331902",
  "amountCents": 4900,
  "currency": "USD"
}

Response Payload (202 Accepted):

{
  "eventId": "evt_uuid_88192ab_99",
  "status": "ACCEPTED",
  "brokerTimestamp": 1770289945
}

2. Client Subscription Stream (gRPC Protocol)

Subscriber applications open bi-directional gRPC streams to establish persistent connections and register topic filters dynamically.

syntax = "proto3";

package codesprintpro.eventmesh.v1;

service SubscriptionService {
  rpc Subscribe (stream SubscriptionRequest) returns (stream EventEnvelope);
}

message SubscriptionRequest {
  string client_id = 1;
  enum Action {
    SUBSCRIBE = 0;
    UNSUBSCRIBE = 1;
  }
  Action action = 2;
  string topic_pattern = 3; -- e.g., "orders/+/billing"
  string filter_expression = 4; -- e.g., "headers.amountCents > 10000"
}

message EventEnvelope {
  string event_id = 1;
  string topic = 2;
  map<string, string> headers = 3;
  bytes payload_bytes = 4;
  int64 publish_timestamp_ms = 5;
  repeated string routing_path = 6; -- List of broker IDs traversed
}

High-Level Architecture

The Event Mesh consists of a network of local brokers that communicate using a peer-to-peer routing protocol (such as Link State Routing).

Global Event Mesh Topology

This diagram illustrates the decentralized routing topology across three primary geographic regions.

graph LR
    subgraph Americas Region
        PubA[Publisher A] -->|Publish| BrokerA1[Broker Node US-1]
        BrokerA1 <-->|Internal Link| BrokerA2[Broker Node US-2]
        SubA[Subscriber A] <-->|WS Connection| BrokerA2
    end

    subgraph Europe Region
        BrokerE1[Broker Node EU-1] <--> SubE[Subscriber E]
    end

    subgraph Asia Region
        BrokerAP1[Broker Node AP-1] <--> SubAP[Subscriber AP]
    end

    BrokerA1 <-->|WAN Link: 70ms RTT| BrokerE1
    BrokerE1 <-->|WAN Link: 120ms RTT| BrokerAP1
    BrokerA2 <-->|WAN Link: 180ms RTT| BrokerAP1

    note over BrokerA1, BrokerE1: Dynamic Routing Protocol:<br/>Propagates active topic subscriptions<br/>across regions

Cross-Region Message Replication Flow

When a message is published, the local broker checks the dynamic routing table and replicates the packet to remote regions only if active subscribers exist there.

sequenceDiagram
    autonumber
    participant Pub as Publisher in US
    participant US_Node as Broker Node US-1
    participant EU_Node as Broker Node EU-1
    participant AP_Node as Broker Node AP-1
    participant Sub as Subscriber in AP

    note over AP_Node, Sub: Subscriber registers interest for "orders/ap/#"
    Sub->>AP_Node: gRPC Stream: Subscribe("orders/ap/#")
    AP_Node->>US_Node: Routing Protocol Update: Add route for "orders/ap/#" via AP-1
    AP_Node->>EU_Node: Routing Protocol Update: Add route for "orders/ap/#" via AP-1

    Pub->>US_Node: Publish message on topic "orders/ap/checkout"
    US_Node->>US_Node: Query Routing Table for "orders/ap/checkout"
    
    note over US_Node, EU_Node: No subscribers exist in Europe
    note over US_Node, AP_Node: Subscriber exists in Asia (AP-1)
    
    US_Node->>AP_Node: Forward Encrypted Packet (WAN)
    AP_Node->>Sub: Deliver message over active gRPC connection

Low-Level Design and Schema

While event routing is processed in memory by the brokers, the mesh control plane manages routing rules, tracing logs, and broker node directories in a PostgreSQL relational database.

-- Tracks all registered broker nodes in the mesh and their statuses
CREATE TABLE broker_nodes (
    broker_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    broker_name VARCHAR(128) NOT NULL UNIQUE,
    ip_address VARCHAR(45) NOT NULL,
    datacenter_region VARCHAR(64) NOT NULL,
    broker_status VARCHAR(32) NOT NULL DEFAULT 'ACTIVE', -- ACTIVE, DRAINING, OFFLINE
    active_connections_count INT NOT NULL DEFAULT 0,
    cpu_utilization DECIMAL(5, 2) NOT NULL DEFAULT 0.00,
    last_heartbeat_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_brokers_region_status ON broker_nodes (datacenter_region, broker_status);

-- Stores dynamic routing rules configured across the mesh
CREATE TABLE event_routing_rules (
    rule_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    source_topic_pattern VARCHAR(256) NOT NULL,
    destination_broker_id UUID NOT NULL REFERENCES broker_nodes(broker_id) ON DELETE CASCADE,
    filter_expression TEXT, -- Optional SQL-like payload selector filter
    rule_status VARCHAR(32) NOT NULL DEFAULT 'ACTIVE', -- ACTIVE, SUSPENDED
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    CONSTRAINT uk_topic_broker UNIQUE (source_topic_pattern, destination_broker_id)
);

CREATE INDEX idx_routing_rules_topic ON event_routing_rules (source_topic_pattern);

-- Tracks message propagation hops for tracing and audits (sampled writes)
CREATE TABLE message_trace_logs (
    trace_id BIGSERIAL PRIMARY KEY,
    message_id VARCHAR(128) NOT NULL,
    origin_broker_id UUID NOT NULL REFERENCES broker_nodes(broker_id),
    topic VARCHAR(256) NOT NULL,
    payload_size_bytes INT NOT NULL,
    hop_count INT NOT NULL DEFAULT 1,
    routing_path VARCHAR(256)[] NOT NULL, -- Array of broker names traversed
    processing_latency_ms INT NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_trace_lookup ON message_trace_logs (message_id);
CREATE INDEX idx_trace_created ON message_trace_logs (created_at DESC);

Schema Rationale & Index Optimization

  1. idx_brokers_region_status: Optimizes the broker allocation queries. The control plane regularly scans this index to identify healthy brokers in a specific region to handle new client WebSocket connections.
  2. idx_trace_created: Message trace logs grow rapidly. By index-partitioning the table monthly based on created_at, we ensure support teams can query recent transactions without scanning billions of older records.
  3. uk_topic_broker: Enforces strict unique constraints on topic-to-broker rules, preventing duplicate route creation during routing table updates.

Scaling Challenges and Capacity Estimation

Routing 1,000,000 events per second globally requires evaluating network bandwidth, memory allocation for routing tables, and edge connection CPU capacities.

1. WAN Bandwidth Capacity Calculation

  • Assumptions:

    • Global peak rate = $1,000,000$ events/second
    • Average event size = $500$ bytes
    • WAN replication rate = 20% ($200,000$ events/second)
    • Envelope framing overhead (routing path arrays, trace headers) = 1.2x multiplier
  • Calculations: $$\text{WAN Throughput in Bytes} = 200,000\text{ events/s} \times 500\text{ bytes} \times 1.2 = 120,000\text{ KB/second} = 120\text{ MB/second}$$ $$\text{WAN Bandwidth Required} = 120\text{ MB/second} \times 8 \approx 960\text{ Mbps}$$

To route $960$ Mbps of event traffic reliably across regions without packet loss, the network must use dedicated cloud transit networks (such as AWS Transit Gateway or Cloudflare Magic WAN). Additionally, brokers must implement compression algorithms (like ZSTD or Snappy) on cross-region connections to reduce payload sizes by up to 50%, bringing the bandwidth requirement down to approximately $480$ Mbps.

2. In-Memory Routing Table Scale

  • Assumptions:

    • Active subscribers across the mesh = $5,000,000$
    • Unique subscription topic patterns = $500,000$
    • Each routing entry in a broker's memory uses a Radix Tree node structure consuming $128$ bytes.
  • Calculations: $$\text{Memory for Routing Radix Tree} = 500,000\text{ nodes} \times 128\text{ bytes} = 64,000,000\text{ bytes} = 64\text{ MB}$$

Since the routing table requires only $64$ MB of memory, brokers can cache the entire routing table in RAM. This allows routing decisions to execute in less than 10 microseconds, satisfying the sub-millisecond local processing latency requirement.

3. Edge WebSocket Connection Memory Footprint

  • Assumptions:

    • Total concurrent subscribers = $5,000,000$
    • Number of edge broker nodes = $50$ instances ($100,000$ connections per node)
    • Each persistent socket connection consumes $40$ KB of RAM in the OS kernel buffer.
  • Calculations: $$\text{RAM per Edge Node} = 100,000 \times 40\text{ KB} = 4,000,000\text{ KB} \approx 4\text{ GB}$$

An edge broker requires approximately 4 GB of memory to maintain persistent TCP sockets, leaving ample memory on a standard 16 GB server instance for CPU thread pools, TLS encryption buffers, and message queues.


Failure Scenarios and Resilience

A global Event Mesh must maintain operational continuity even when cross-region WAN links drop or subscribers fail to consume data.

1. Network Partition between Regions (WAN Split-Brain Routing)

The network link between us-east-1 and ap-southeast-1 drops, but both regions remain online locally.

  • The Threat: The brokers in Asia assume the US region has crashed, while the US region assumes Asia is dead. If they attempt to rebuild routing tables independently without coordination, they will create circular routing paths when the link is restored.
  • Resilience Design:
    • We use a Link-State routing protocol (like OSPF adapted for application messaging) with monotonic sequence numbers.
    • When the link fails, brokers update their local routing topology map, flag the link as UNREACHABLE, and route traffic through an alternative path (e.g., US routing through Europe to reach Asia).
    • If no path exists, messages are buffered in a local disk-backed Spool Queue (up to a configured memory limit, e.g., 50 GB). When the link recovers, the routing engines exchange topology sync packets, resolve the routing tables, and drain the spooled messages sequentially.

2. Subscriber Slow-Consumer Backpressure Cascades

A subscriber client consuming real-time market data experiences thread pool exhaustion, slowing down message reads.

  • The Threat: The edge broker's memory buffers fill up waiting for the client to acknowledge packets. This memory pressure spreads to upstream brokers, stalling the ingestion pipeline for other healthy subscribers.
  • Resilience Design:
    • We configure Consumer Isolation and Lease Limits.
    • If a subscriber's socket TCP buffer fills up and the client cannot consume data fast enough, the edge broker drops the connection or drops non-critical packets.
    • For critical queues, we implement Dynamic Throttle Webhooks. The broker monitors consumer queue depth: $$\text{Queue Depth} \ge \text{threshold_limit}$$ When reached, it triggers backpressure notifications back to the publishing source, slowing down ingestion at the edge.

3. Regional Broker Node Crash

An edge broker node in the Europe cluster crashes under high CPU load.

  • The Threat: Active client connections are dropped, and events routed to that node are lost.
  • Resilience Design:
    • We deploy brokers in High-Availability Pairs (Primary-Backup) within each local cluster using virtual Anycast IP routing.
    • The backup node monitors the primary node using keep-alive pings. If the primary node fails to respond within 500 milliseconds, the backup node takes over the Anycast IP address.
    • Clients reconnect automatically using their WebSocket fallback logic, and the backup broker rebuilds the active subscription table in less than 1 second.

4. Message Duplication during WAN Retries

A WAN link drops packets, causing a sending broker in Europe to retry sending a block of events to Asia before receiving confirmation ACKs.

  • The Threat: Asia processes and delivers the same events multiple times to subscribers, corrupting downstream transactional states.
  • Resilience Design:
    • Every event envelope includes a unique UUID and a monotonically increasing sequence vector.
    • Receiving brokers maintain a rolling Deduplication Ledger (in-memory Key-Value store with a 15-minute TTL).
    • If an incoming message ID already exists in the ledger, the broker discards the duplicate payload and immediately returns the ACK confirmation to the sender, ensuring at-least-once delivery.

Architectural Trade-offs

Choosing the routing model and link protocol requires balancing delivery latencies against system complexity.

Trade-off 1: Decentralized Event Mesh vs. Centralized Message Broker

An Event Mesh routes events dynamically across peers; a centralized broker requires all publishers and subscribers to connect to a single cluster.

Feature / Metric Decentralized Event Mesh Centralized Message Broker (Kafka)
Geographic Latency Low. Local traffic stays local; WAN is only used when remote subscribers exist. High. A client in Asia must traverse the WAN to reach a cluster in the US.
Operational Overhead High. Requires managing routing tables and link state networks. Low. A single cluster is managed centrally.
WAN Bandwidth Usage Low. Messages are filtered at the edge and routed selectively. High. All messages are replicated to the central cluster.
Single Point of Failure None. Node or link failures are bypassed automatically. High. If the central cluster is unreachable, all clients are blocked.

Brokers can communicate using reliable TCP sockets or high-speed, loss-tolerant UDP datagrams.

Feature / Metric TCP Mesh Links UDP Mesh Links
Reliability High. Guarantees packet ordering and delivery confirmation. Low. Packets can be dropped or arrive out of order.
Latency Profile High. Packet loss triggers retransmissions, blocking the link. Low. Real-time packets are sent without transmission delays.
Congestion Control High. Adapts to link capacity limits. Low. Requires application-layer rate-limiting.

Staff Engineer Perspective

Operating a global Event Mesh requires monitoring connection and resource limits across all edge instances.


Verbal Script

Interviewer: "How does an Event Mesh route messages dynamically without a centralized broker, and how does it prevent WAN bottlenecks?"

Candidate: "An Event Mesh routes messages dynamically using Dynamic Subscription Propagation combined with local routing trees at each broker node.

Unlike a centralized broker where all messages are sent to a single location, brokers in an Event Mesh are connected in a peer-to-peer network.

When a subscriber connects to a local broker in Tokyo and subscribes to a topic pattern (like orders/ap/#), the Tokyo broker registers this subscription locally and propagates a routing update to its neighboring brokers in Europe and the Americas.

These brokers update their internal routing tables to indicate that messages matching orders/ap/# should be forwarded to Tokyo.

If a publisher in New York publishes a message to orders/us/checkout, the local US broker checks its routing table. Finding no active subscribers in Europe or Asia for US checkout events, it delivers the message locally, avoiding WAN usage.

If the publisher writes to orders/ap/checkout, the US broker sees the route to Tokyo and forwards the packet across the WAN.

This edge filtering prevents WAN bottlenecks by ensuring that messages are only replicated across geographic regions when active remote subscribers exist."


Interviewer: "How would you handle a slow consumer in the Event Mesh to prevent it from stalling the ingestion pipeline for other users?"

Candidate: "We handle slow consumers by implementing TCP window-size checks, broker-side discard policies, and consumer group isolation.

First, when a subscriber's connection slows down (due to resource exhaustion or client lag), its OS TCP buffer fills up. The broker detects this when the socket write calls return EWOULDBLOCK or when the TCP send window shrinks.

Second, to prevent this lag from consuming broker memory, we configure a Discard Policy based on the message type. For real-time updates (like stock tickers), the broker drops older unsent packets in the queue, preferring to deliver the latest state.

For transactional events that cannot be dropped, the broker writes the overflow to a disk-backed spool queue and flags the client as BACKPRESSURED.

Third, we isolate subscribers using Consumer Groups.

Multiple instances of a service share the consumer load for a topic.

If one subscriber instance lags, the broker stops routing messages to its socket and redistributes the load to healthy instances in the group, protecting the ingestion pipeline from local client failures."


Interviewer: "What is your strategy for handling routing loops in a multi-path Event Mesh when network links fail and recover?"

Candidate: "We prevent routing loops using Split-Horizon routing protocols combined with monotonically increasing sequence vectors and loop-detection flags in our message envelopes.

First, the mesh brokers run a Link-State routing protocol. Each broker maintains a topology map of the network.

When a link fails, brokers update their maps and calculate the shortest path using Dijkstra's algorithm.

The split-horizon rule ensures that a broker never advertises a route back to the neighbor it received it from, preventing basic routing loops.

Second, during rapid network failures and recoveries, routing tables can become temporarily inconsistent.

To prevent packets from looping, every event envelope includes a routing_path header (an array of broker IDs).

When a broker receives a packet, it checks the routing path array:

  • If its own broker ID is already in the list, it detects a loop.
  • The broker drops the message, logs a routing anomaly, and updates the local routing table to mark the path as invalid.

Third, we set a strict Time-To-Live (TTL) or hop limit header on all event envelopes (typically 16 hops).

If a packet bounces between nodes during a routing convergence, the hop counter decrements on each node.

When the hop count reaches zero, the message is discarded, preventing packets from looping indefinitely and consuming network resources."


Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.