System Design: Designing a Pub/Sub Messaging Platform

Introduction & The Role of Pub/Sub

In modern distributed systems, services must communicate asynchronously to remain resilient and scalable. Synchronous HTTP/REST requests couple services, forcing the upstream caller to wait for the downstream service to complete processing. If a downstream service fails or experiences a latency spike, the failure propagates upstream, causing a cascading outage.

A Publish/Subscribe (Pub/Sub) messaging platform decouples these components by introducing an asynchronous event bus. Publishers send events to named channels called Topics without knowing who will consume them. Subscribers register interest by creating Subscriptions to these topics. The Pub/Sub platform ensures that every message published to a topic is copied and delivered to all active subscriptions.

Designing a Pub/Sub platform at internet scale (similar to Google Cloud Pub/Sub or Apache Pulsar) presents major distributed systems challenges: handling millions of messages per second, guaranteeing durable persistence, preventing message loss, and managing slow consumers.

Requirements and System Goals

Functional Requirements

Topic and Subscription Management: Publishers must be able to create, delete, and list topics. Subscribers must be able to create and configure subscriptions attached to a parent topic.
Multi-Subscriber Fanout: Every message published to a topic must be delivered to all active subscriptions attached to that topic. If a topic has three subscriptions, each subscription must receive its own copy of the message stream.
Flexible Delivery Modes: Support both Push delivery (the platform pushes messages to client HTTP webhooks or gRPC endpoints) and Pull delivery (subscribers poll the platform for messages).
Message Acknowledgment (ACK): Subscribers must explicitly acknowledge processed messages. Unacknowledged messages must be redelivered after a configurable visibility timeout.
Dead-Letter Queues (DLQ): Messages that fail processing repeatedly (after a max retry limit) must be automatically routed to a Dead-Letter Queue for inspection.

Non-Functional Requirements

High Write/Read Throughput: The system must scale horizontally to handle millions of publish requests per second, matching the volume of massive clickstream or telemetry pipelines.
At-Least-Once Delivery Guarantee: The platform must guarantee that every message is delivered to every active subscription at least once. Zero data loss is tolerated for durable subscriptions.
Horizontal Scalability: Topic storage and message routing capacity must scale out linearly by sharding topic data across multiple partitions and brokers.
Predictable Backlog Latency: The system must isolate performance so that a slow consumer on one subscription does not degrade publishing performance or delay delivery to other healthy subscriptions on the same topic.

API Interfaces and Service Contracts

We specify gRPC service contracts for publisher and subscriber interactions.

1. Publisher API (Publishing Messages)

Publishers write batches of messages to a specific topic.

syntax = "proto3";

package messaging.pubsub.v1;

service PublisherService {
  // Publish a batch of messages to a topic
  rpc Publish(PublishRequest) returns (PublishResponse);
}

message PubsubMessage {
  string message_id = 1;      // Unique message UUID generated by client or server
  bytes data = 2;             // Binary payload
  map<string, string> attributes = 3; // Custom metadata headers
  int64 publish_time_epoch_ms = 4;
}

message PublishRequest {
  string topic = 1;           // Format: projects/{project}/topics/{topic}
  repeated PubsubMessage messages = 2;
}

message PublishResponse {
  repeated string message_ids = 1; // Server-acknowledged message IDs
}

2. Subscriber API (Pull Delivery Model)

Subscribers call Pull to retrieve messages and Acknowledge to mark them as processed.

service SubscriberService {
  // Pull a batch of messages from a subscription
  rpc Pull(PullRequest) returns (PullResponse);
  
  // Acknowledge receipt and processing of messages
  rpc Acknowledge(AcknowledgeRequest) returns (AcknowledgeResponse);
}

message PullRequest {
  string subscription = 1;     // Format: projects/{project}/subscriptions/{sub}
  int32 max_messages = 2;      // Maximum batch size requested
  bool return_immediately = 3;  // Long-polling behavior toggle
}

message ReceivedMessage {
  string ack_id = 1;           // Temporary ID representing this specific delivery
  PubsubMessage message = 2;
  int32 delivery_attempt = 3;  // Count of redelivery attempts
}

message PullResponse {
  repeated ReceivedMessage received_messages = 1;
}

message AcknowledgeRequest {
  string subscription = 1;
  repeated string ack_ids = 2; // IDs to confirm successful processing
}

message AcknowledgeResponse {
  bool success = 1;
}

High-Level Design and Visualizations

The platform decouples message ingestion (writing to a persistent partitioned log) from message delivery (routing and tracking client reads).

Pub/Sub System Architecture Topology

graph TD
    Publishers[Publisher Clients] -->|1. POST / grpc Publish| FE[Frontend API Gateways]
    FE -->|2. Check metadata / route| MS[Metadata Service: etcd/ZooKeeper]
    
    subgraph Message Broker & Storage Cluster
        FE -->|3. Append Log Batch| B1[Broker 1: Leader Topic A - Part 1]
        FE -->|3. Append Log Batch| B2[Broker 2: Leader Topic A - Part 2]
        
        B1 -->|Sync log replication| B1_Rep1[(Replica Broker Node)]
        B2 -->|Sync log replication| B2_Rep2[(Replica Broker Node)]
    end
    
    subgraph Delivery Engine
        B1 -.->|4. Fetch Log Segments| D1[Delivery Agent 1]
        B2 -.->|4. Fetch Log Segments| D2[Delivery Agent 2]
        
        D1 -->|5a. Push Event / Webhook| Sub_Push[Push Subscriber Endpoint]
        D2 -->|5b. Long-Poll Response| Sub_Pull[Pull Subscriber Worker]
        
        Sub_Pull -->|6. Acknowledge message| FE
    end

    style Publishers fill:#f8f9fa,stroke:#343a40
    style Sub_Push fill:#d4edda,stroke:#28a745
    style Sub_Pull fill:#d4edda,stroke:#28a745

Partition Replica Replication Loop

To ensure durability, brokers replicate topic partitions using a Raft or Paxos-based consensus loop.

sequenceDiagram
    participant P as Publisher Client
    participant L as Broker Leader (Node 1)
    participant F1 as Broker Follower (Node 2)
    participant F2 as Broker Follower (Node 3)

    P->>L: Publish (Message Batch)
    L->>L: Append to local uncommitted WAL segment
    par Leader to Follower 2
        L->>F1: AppendEntries (Log payload)
    and Leader to Follower 3
        L->>F2: AppendEntries (Log payload)
    end
    F1-->>L: AppendEntries ACK
    F2-->>L: AppendEntries ACK
    Note over L: Consensus Reached (Majority Write 2/3)
    L->>L: Commit Log Index
    L-->>P: PublishResponse (Msg IDs Acknowledged)

Low-Level Design and Schema Strategies

To track which subscriber has received which messages without scanning millions of rows, the platform needs metadata tracking schemas.

Metadata Storage Schema: Subscription Registry

For configuration and subscription metadata, we use a consistent relational or distributed key-value database (e.g., CockroachDB or Spanner):

CREATE TABLE topics (
    topic_path VARCHAR(255) PRIMARY KEY,
    retention_period_seconds INT NOT NULL DEFAULT 604800, -- 7 days default
    partition_count INT NOT NULL DEFAULT 4,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE subscriptions (
    subscription_path VARCHAR(255) PRIMARY KEY,
    topic_path VARCHAR(255) NOT NULL REFERENCES topics(topic_path),
    delivery_type VARCHAR(10) NOT NULL CHECK (delivery_type IN ('PUSH', 'PULL')),
    push_endpoint VARCHAR(1024), -- NULL if PULL
    ack_deadline_seconds INT NOT NULL DEFAULT 10,
    dead_letter_topic VARCHAR(255),
    max_delivery_attempts INT NOT NULL DEFAULT 5,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

In-Memory Ack State Tracking Map (Broker Delivery Engine)

For active subscriptions, tracking every individual message ACK via database writes hurts scale. Instead, the delivery agents use a Sliding Window Offset Range Map coupled with individual ACK tracking.

For partitioned logs (like Kafka), we store a single monotonic offset pointer representing the point before which all messages are confirmed. For unordered messaging platforms (like Google Pub/Sub), the broker tracks a sparse state map of unacknowledged message IDs:

Subscription Path	Topic Partition	Monotonic Read Offset	Unacknowledged Message IDs (Leased List)	Lease Expiry Epoch
`sub-billing-processing`	`topic-orders-part-0`	`10080`	`[10082, 10085, 10089]`	`1774834810000`
`sub-analytics-ingest`	`topic-orders-part-0`	`10089`	`[]`	N/A

If a message ID in the leased list (e.g., 10085) passes its lease expiry epoch before an ACK is received, the delivery agent removes it from the list and schedules it for redelivery.

Scaling and Operational Challenges: Calculations & Formulations

Pub/Sub platforms are highly network-bound. Let us calculate the network bandwidth required to handle a high-volume fanout scenario.

Back-of-the-Envelope Bandwidth Calculation

Let us define:

$R_{\text{msg}}$: Ingestion rate of unique messages (e.g., 1,000,000 messages/sec).
$S_{\text{msg}}$: Average message size (e.g., 200 bytes).
$N_{\text{sub}}$: Number of independent subscriptions attached to the topic (e.g., 10 subscriptions).

Step 1: Calculate Ingestion Egress & Ingress (Write Path)

The publishers send the messages to the frontend gateways. Gateways write to partition brokers, and brokers replicate to 3 nodes.

$$\text{Ingress Network Rate} = R_{\text{msg}} \times S_{\text{msg}} = 1,000,000 \text{ msg/s} \times 200 \text{ bytes} = 200,000,000 \text{ bytes/sec} = 200 \text{ MB/sec}$$

With 3-way partition replication inside the broker cluster:

$$\text{Internal Replication Traffic} = 200 \text{ MB/sec} \times 2 \text{ replicas} = 400 \text{ MB/sec}$$

Step 2: Calculate Delivery Egress (Read Path / Fanout)

The delivery service must read the message log and push it to all subscriptions. Each subscription receives a copy of the stream:

$$\text{Fanout Rate} = R_{\text{msg}} \times N_{\text{sub}} = 1,000,000 \text{ msg/s} \times 10 = 10,000,000 \text{ msg/s}$$

$$\text{Egress Network Rate} = \text{Fanout Rate} \times S_{\text{msg}} = 10,000,000 \text{ msg/s} \times 200 \text{ bytes} = 2,000,000,000 \text{ bytes/sec} = 2.0 \text{ GB/sec}$$

Step 3: Total Broker Network Load

The network load on the broker cluster is:

$$\text{Total Network Load} = \text{Ingress} + \text{Replication} + \text{Egress}$$

$$\text{Total Network Load} = 200 \text{ MB/sec} + 400 \text{ MB/sec} + 2,000 \text{ MB/sec} = 2.6 \text{ GB/sec}$$

To handle $2.6$ Gigabytes per second of network I/O without saturation, the broker cluster must be sharded across at least 11 broker nodes, assuming each node has a 2.5 Gbps network interface card (NIC) operating at 80 percent maximum capacity.

Zero-Copy Optimization

To scale the delivery service egress to gigabytes per second, delivery nodes must avoid copy operations between kernel space and user space.

The Optimization: The broker uses the sendfile system call. This allows the OS kernel to transmit data bytes directly from the page cache to the network socket descriptor, bypassing user-space buffer copies entirely.

$$\text{Zero-Copy Egress Path: } \text{Disk} \longrightarrow \text{Page Cache} \xrightarrow{\text{sendfile}} \text{NIC Socket Buffer} \longrightarrow \text{Network}$$

Trade-offs and Architectural Alternatives

Designing a Pub/Sub platform requires choosing between log-based partitioned stream brokers and memory-buffered message brokers.

Message Platform Comparison Table

Dimension / Choice	Log-Based Partitioned Broker (e.g., Kafka)	Memory-Buffered Broker (e.g., RabbitMQ)	Cloud-Native Decoupled Broker (e.g., Pulsar, GC PubSub)
Storage Engine	Append-only files on disk	In-memory queues (flushed to disk under memory pressure)	Split: Stateful brokers + Segment-based bookies (BookKeeper)
Consumption Model	Pull-based offset tracking	Push-based routing via exchanges	Hybrid: Push and Pull via independent subscriptions
Order Guarantees	Strict per partition	Relaxed (Ordering lost on message redeliveries)	Relaxed or strict depending on subscription type
Message Lifecycle	Retained until expiration TTL (Immutable Log)	Deleted immediately upon successful ACK	Retained until all subscriptions have acknowledged
Slow Consumer Impact	None (Consumers read at their own pace)	High (Buffers grow, consuming broker memory and slowing writes)	Low (Broker storage decoupled from delivery worker resources)

Key Architectural Trade-offs

Pull (Client Polling) vs. Push (HTTP Webhook):
- Push: Ideal for serverless architectures (e.g., AWS Lambda, GCP Cloud Run) because subscribers do not need to maintain active polling connections. However, the platform must handle backpressure. If the subscriber is overloaded, the push delivery worker must back off or it will crush the subscriber endpoint.
- Pull: Excellent for high-throughput batch workers. Subscribers poll for batches when they have spare processing capacity, providing natural client-side flow control. The trade-off is higher idle poll connection overhead.

Failure Modes and Fault Tolerance Strategies

Scale exposes the platform to persistent failures at the node, network, and subscriber layers.

1. Slow Consumer Backlog (Cascading Broker Out of Memory)

If a subscription’s workers slow down, messages accumulate in the backlog. In a memory-buffered system, this backlog fills RAM, causing the broker to throttle publishers or crash.

Mitigation: Implement Tiered Log Storage. The broker keeps the most recent 10 minutes of log segments in RAM/Page Cache for active subscribers. If a subscriber falls behind, its reads are directed to disk segments, keeping RAM free for fast path publishes.

2. Message Duplication (At-Least-Once Replay Scenarios)

If Subscriber A retrieves a message, processes it, but the network connection drops before it can send the Acknowledge RPC, the platform's visibility lease expires. The platform assumes the message delivery failed and delivers it to Subscriber B, resulting in duplicate processing.

Mitigation: The platform cannot prevent network-level duplicates. The application tier must implement Idempotency Checkpoints. Every subscriber write to database tables must include a unique transaction key check (using the message_id) to filter out duplicate deliveries.

3. Broker Leader Crash & Partition Rebalancing

If a broker node hosting the leader replica of Topic Partition 1 crashes, the cluster coordinator (e.g., etcd or ZooKeeper) detects the lost heartbeat.

Mitigation: The coordinator initiates an epoch advancement and elects one of the in-sync followers to become the new partition leader. Publisher clients re-resolve the partition metadata and reconnect to the new leader within milliseconds.

Verbal Script

Interviewer: "How would you design a distributed Pub/Sub platform like Google Cloud Pub/Sub that supports high throughput and handles slow consumers gracefully?"

Candidate: "To design a high-throughput Pub/Sub system, I would build a decoupled architecture comprising three main components: a stateless Frontend API Gateway layer, a distributed consensus metadata store, and a storage cluster.

To scale the write path, I would partition topics into multiple logical logs and distribute them across broker nodes. The frontend writes to these partition leaders, which replicate logs synchronously to followers using a quorum consensus protocol before acknowledging the write.

To handle slow consumers without affecting write performance, I would decouple the message delivery logic from the ingestion path.

The storage layer acts as an append-only log. When a subscriber falls behind, its backlog is stored on disk segments. Healthy subscribers read their messages from the OS page cache (the fast path), while slow subscribers read from disk. This prevents a slow consumer from consuming broker memory or blocking publishers.

For the delivery model, I would support gRPC long-polling for pull subscribers to retrieve messages in batches, and write a pool of push delivery agents that route messages to subscriber webhooks using exponential backoff backpressure rules."

Interviewer: "What happens when a subscriber crashes mid-processing? How does the platform guarantee the message is not lost?"

Candidate: "The platform guarantees durability by combining message persistence with a leased acknowledgment model.

When a subscriber retrieves a message via a pull call or a push event, the delivery agent does not delete the message. Instead, it marks the message state as 'leased' and starts an internal timer based on the subscription's ack_deadline_seconds.

If the subscriber crashes mid-processing, it will fail to send an Acknowledge call containing the ack_id before the lease timer expires.

Once the lease deadline passes, the delivery agent moves the message back to the active queue. It becomes visible to subsequent pull requests or push worker runs, ensuring it is delivered to another healthy subscriber worker.

The message is only deleted or marked as acknowledged when a client returns a valid AcknowledgeRequest before the lease timer expires."