Lesson 96 of 105 12 minFlagship

Designing a High-Throughput Notification System for 100K Events per Second

End-to-end architecture for a notification system handling 100,000 events per second: capacity planning, Kafka partition sizing, fan-out strategy, rate limiting, idempotency, and incident simulation.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • **Hybrid Fan-Out Strategy:** Dynamically partitioning delivery routes using Eager Push for regular users and Lazy Pull for high-follower celebrity profiles.
  • **Kafka Partitioning Sizing:** Structuring event pipelines with isolated queues to survive downstream delivery blockages.
  • **Token Bucket Rate Limiting:** Enforcing localized user frequency limits using atomic sliding windows in Redis.
Recommended Prerequisites
System Design Interview FrameworkKafka Internals: Deep Dive into Distributed Messaging

Premium outcome

From vague architecture answers to staff-level trade-off thinking.

Backend engineers preparing for senior, staff, and architecture rounds.

What you unlock

  • A reusable system design answer framework for ambiguous prompts
  • Clear language for consistency, scaling, and reliability trade-offs
  • Case-study depth across feeds, payments, storage, and messaging systems

Mental Model

Designing a high-throughput notification system capable of handling 100,000 events per second is primarily a high-volume data ingestion and fan-out stream routing problem. If a single upstream event (e.g., a viral post or a global promotion) must alert millions of followers, the system faces massive write amplification. A premier architecture decouples ingestion from delivery: raw events are consumed by stateless event processors, transformed via a hybrid push/pull fan-out engine, partitioned across dynamic Kafka topics, and dispatched asynchronously to downstream channel providers (FCM, APNs, SES, and Twilio) through rate-limited thread pools.


Requirements and System Goals

To architect an enterprise notification engine of this magnitude, we must establish quantitative scaling bounds and latency budgets.

1. Functional Requirements

  • Multi-Channel Delivery Support: Route messages across diverse endpoints: push notifications (FCM/APNs), email (AWS SES), SMS (Twilio), and in-app feeds.
  • Quiet Hours & User Preferences: Enforce per-user channel preferences, blocking deliveries during user-defined quiet windows.
  • Asynchronous Delivery receipts: Track end-to-end message states (Sent $\rightarrow$ Delivered $\rightarrow$ Failed).

2. Non-Functional Requirements & Performance Budgets

  • Ultra-High Ingestion Throughput: Support up to 100,000 raw events per second at peak.
  • Low Latency Delivery SLA: Deliver push notifications to active clients in less than 5 seconds end-to-end, and email/SMS in less than 60 seconds.
  • At-Least-Once Delivery Guarantee: Guarantee zero lost alerts using idempotent consumer offset management.
  • Strict Per-User Rate Limiting: Prevent notification spam by capping push notifications to a maximum of 10 alerts per user per hour.
  • Active-Active Ingestion Architecture: Maintain active-active deployment setups across geographical clouds to survive regional outages.

API Interfaces and Service Contracts

Stateless producer microservices submit events to the notification engine via high-speed JSON schemas.

1. Submit Notification Event API Contract

This endpoint is called by internal services (e.g., a Social service when a user publishes a post) to ingest a raw event.

POST /api/v1/events

Request Payload:

{
  "event_id": "evt_9a8b7c6d-4433-2211-bb00-eeddccbbaa99",
  "sender_id": "usr_celebrity_888",
  "event_type": "USER_POSTED",
  "payload": {
    "post_id": "post_5f8a2c1d3e",
    "post_preview": "Exciting system design updates coming!"
  },
  "submitted_at": "2026-06-01T11:08:00Z"
}

Response Payload (202 Accepted):

{
  "status": "ACCEPTED",
  "event_id": "evt_9a8b7c6d-4433-2211-bb00-eeddccbbaa99",
  "ingested_timestamp_ms": 1780416480000,
  "estimated_delivery_count": 48200
}

High-Level Design and Visualizations

Decoupling high-frequency raw ingestion queues from the complex, resource-heavy fan-out delivery pipelines is the only way to shield core components from thread starvation.

1. High-Throughput Stream Ingestion Pipeline

This layout details how raw events are buffered in Kafka, processed by stateless microservices, and distributed across distinct channel queues.

graph TD
    subgraph Event Ingress
        Producers[Stateless Producer Services] -->|1. 100K events/sec| IngestTopic[raw-events Kafka Topic - 300 Partitions]
    end

    subgraph Processing & Fan-Out
        IngestTopic -->|2. Ingest Batch| ProcessorPool[Event Processor Cluster - 200 Pods]
        ProcessorPool -->|3. Fetch Preferences| UserDB[(Cassandra User Database)]
        ProcessorPool -->|4. Parallel Fan-Out| DispatchTopic[notifications-to-send Topic - 1500 Partitions]
    end

    subgraph Dispatch Layer
        DispatchTopic -->|5. Push Stream| PushDisp[Push Dispatcher Pool - FCM/APNs]
        DispatchTopic -->|5. Email Stream| EmailDisp[Email Dispatcher Pool - SES]
        
        PushDisp -->|6. Asynchronous Status| StatusTopic[notification-status Kafka Topic]
        EmailDisp -->|6. Asynchronous Status| StatusTopic
    end

    subgraph Persistence Store
        StatusTopic -->|7. Write Batch| Storage[(DynamoDB/Cassandra Status Store)]
    end

2. End-to-End Fan-Out Sequence Workflow

The diagram below details the sequence of events from raw ingestion, hybrid preference resolution, rate limiting, and final delivery.

sequenceDiagram
    autonumber
    participant Prod as Producer Service
    participant Ingest as raw-events Kafka
    participant Proc as Event Processor
    participant Redis as Redis Rate Limiter
    participant Queue as notifications-to-send
    participant Disp as Push Dispatcher (FCM)
    participant Client as User Device

    Prod->>Ingest: Publish event (USER_POSTED)
    Ingest->>Proc: Batch pull event
    Proc->>Proc: Resolve followers and preferences
    
    rect rgb(240, 255, 240)
        Note over Proc, Queue: Fan-Out Loop & Rate Limiting Checks
        Proc->>Redis: sliding_window_check (usr_123, limit 10/hour)
        Redis-->>Proc: Return Allowed (Count = 4)
        Proc->>Queue: Publish delivery task (usr_123, message_bytes)
    end

    Queue->>Disp: Consume delivery task
    Disp->>Disp: Acquire FCM channel token permit (Max 500/s)
    Disp->>Client: Send push message (APNs/FCM channel)
    Client-->>Disp: Acknowledge delivery
    Disp->>Queue: Publish status event (DELIVERED)

Low-Level Design and Schema Strategies

To support rapid sequential writes and scale storage cost-effectively, message histories are mapped to wide-column tables.

1. Cassandra Delivery History Schema

This data schema partitions notification transactions by user_id to guarantee that retrieving a single user's alert inbox consumes exactly one physical disk seek.

-- User Notification History Ledger
CREATE TABLE user_notifications_ledger (
    user_id UUID,
    created_at TIMESTAMP,
    notification_id UUID,
    channel_type VARCHAR(16), -- 'push', 'email', 'sms', 'in_app'
    title TEXT,
    body TEXT,
    delivery_status VARCHAR(16), -- 'PENDING', 'SENT', 'DELIVERED', 'FAILED'
    gateway_message_id VARCHAR(255), -- External ID from FCM/SES
    idempotency_key VARCHAR(128),
    metadata MAP<TEXT, TEXT>,
    PRIMARY KEY ((user_id), created_at, notification_id)
) WITH CLUSTERING ORDER BY (created_at DESC)
  AND compaction = {
      'class': 'TimeWindowCompactionStrategy',
      'compaction_window_unit': 'DAYS',
      'compaction_window_size': 1
  };

-- Enable automatic partition cleanup after 90 days (7,776,000 seconds)
ALTER TABLE user_notifications_ledger WITH default_time_to_live = 7776000;

2. Redis Sliding Window Rate Limiting Layout

To protect users from notification storms, the API gateway executes an atomic sliding-window rate check using Redis sorted sets.

  • Key Schema: rate:push:<user_id> (e.g. rate:push:usr_alpha_123).
  • Data Type: Sorted Set (ZSET).
  • Algorithm:
    1. Score = Timestamp of the event in milliseconds.
    2. Element = Random unique string (UUID or timestamp).
    3. When an event arrives, execute a Redis Lua Script to:
      • Remove elements older than 1 hour: ZREMRANGEBYSCORE key -inf (now - 3600000).
      • Query count of remaining elements: ZCARD key.
      • If count is less than 10, add current event: ZADD key now element and return success.
      • Else, return rate limit exceeded block.

Scaling and Operational Challenges

1. Hybrid Fan-Out Design (Celebrity Outages)

If a user with $N = 100$ followers posts an update, fanning out 100 database writes is trivial. However, if a celebrity with $10,000,000$ followers posts, an eager push fan-out will attempt to write 10 Million rows to Cassandra concurrently, triggering database thread locks and CPU starvation across our cluster.

graph LR
    subgraph Hybrid Fan-Out Routing
        Event[Raw Ingestion Event] -->|Check Follower Count| Router{Router Engine}
        Router -->|Followers < 100K| Eager[Eager Push Path]
        Router -->|Followers >= 100K| Lazy[Lazy Pull Path]
        
        Eager -->|1. Write to all follower feeds| DB_E[(Eager Cassandra DB)]
        Lazy -->|1. Write single celebrity feed| DB_L[(Lazy Cassandra DB)]
        Lazy -->|2. Pull & merge on user read| Client[User Feed Fetch]
    end
  • The Strategy:
    • We enforce a Follower Threshold ($T_{\text{followers}}$) of 100,000.
    • Eager Push Path: If the sender has less than 100,000 followers, we execute eager fan-out immediately, writing notifications to each follower's feed.
    • Lazy Pull Path: If the sender has greater than or equal to 100,000 followers, we write exactly one record representing the celebrity update. When a follower opens their feed, the client service reads their eager notification partition and joins it with the celebrity lazy feed partition, completely bypassing write amplification on the hot path.

2. Kafka Partition Sizing and Consumer Alignment

In Kafka-based architectures, write throughput is bound by partition scale.

  • Ingestion Sizing:
    • Peak event rate: 100,000 events/sec.
    • Average payload size: 1KB.
    • Throughput: 100 MB/sec.
    • A single high-performance Kafka broker handles roughly 50MB/sec.
    • To guarantee safety under $3\times$ replication, we deploy a minimum of 3 active brokers and configure the topic with 300 partitions, enabling 300 consumer threads to pull events in parallel without lock congestion.

Notification Messaging Systems Trade-offs

Choosing a delivery model requires selecting between delivery latency and compute overhead.

Architectural Dimension Eager Push Fan-Out Lazy Pull Fan-Out Hybrid Approach
Write Amplification High (Multiplies writes by follower count) Zero (One write per event) Optimized (Low for regular, Zero for celebs)
Read/Fetch Latency Ultra-Low (Reads pre-compiled feed directly) High (Requires merging feeds at read time) Low (Fast page loads for all user feeds)
Memory Utilization High (Stores millions of redundant alert copies) Ultra-Low (Single copy stored) Medium (Well-balanced storage scale)
Operational Complexity Low (Standard SQL/NoSQL insert sequences) High (Requires active read-merge queries) High (Requires routing logic and thresholds)
Best Use Case Platforms with low follower counts. Enterprise archives, slow read dashboards. High-scale global social web systems.

Failure Modes and Fault Tolerance Strategies

1. Consumer Backpressure Pause Under Downstream Failure

If FCM (Firebase Cloud Messaging) experiences a partial outage and begins rate-limiting our requests, our dispatch threads will block, raising latency and causing consumer queues to back up in Kafka.

  • The Resilience Strategy:
    • The dispatch consumer monitors downstream health metrics.
    • If FCM's error rate spikes greater than 20%, the consumer calls consumer.pause() on its assigned partitions.
    • The dispatcher stops pulling new messages, allowing the lag to safely build up inside Kafka (which acts as our robust backplane buffer) rather than crashing the gateway.
    • A scheduler thread executes lightweight status checks. Once FCM health returns to normal, the thread triggers consumer.resume(), safely draining the buffered queue without losing a single notification.

2. Dead Letter Queue (DLQ) Routing Topologies

If a notification contains an invalid email address or a corrupted token, downstream gateways throw non-retryable errors. Attempting to retry these packets will block our active queues.

  • The DLQ Blueprint:
    • When a dispatcher catches a non-retryable exception (e.g. InvalidRegistration), it instantly stops retries.
    • The packet is routed to a dedicated notifications-dlq Kafka topic.
    • A DLQ consumer logs the diagnostic error, updates the user's database record to mark their push token as invalid, and prevents subsequent deliveries to that address, maintaining high gateway reputational scores.

Staff Engineer Perspective


Production Readiness Checklist

Ensure these checks are satisfied before putting your high-throughput notification system into active production:

  • Consumer Partition Matching: Confirm that the number of active dispatcher consumer pods matches or is less than the number of Kafka topic partitions.
  • Cassandra Compaction Settings: Check that the status databases use TimeWindowCompactionStrategy to optimize sequential time deletes.
  • Redis Rate Limiting scripts: Verify the Lua sorted set sliding window scripts are loaded and executing atomically.
  • Independent DLQ Topics: Ensure separate DLQ topics are configured for push, email, and SMS channels to prevent cascading cross-channel blocks.


Verbal Script

Interviewer: "How would you design a high-throughput notification system capable of handling 100,000 events per second, and how do you manage the scaling challenges of fan-out and downstream failures?"

Candidate: "To design a high-throughput notification system handling 100,000 events per second, I would build an asynchronous, decoupled pipeline centered around Kafka for stream buffering, stateless consumer clusters for processing, and a hybrid fan-out model to manage write amplification.

The pipeline begins when upstream microservices ingest raw events into a high-capacity Kafka topic called raw-events, configured with 300 partitions to support massive parallel write throughput. A stateless pool of Event Processors pulls these events. The processor's primary job is to resolve user notification preferences and compile the target distribution list.

This brings us to the Fan-Out problem, which is the core scaling challenge. If a viral user posts an update, an eager push to millions of followers concurrently will saturate our NoSQL database thread pools. To resolve this, I would implement a Hybrid Fan-Out Strategy. If the sender has less than 100,000 followers, we execute eager fan-out, writing individual notifications directly to their followers' feed partitions in Apache Cassandra. However, if the sender has greater than or equal to 100,000 followers, we bypass the write storm: we write the event exactly once to a celebrity log partition, and when followers open their feeds, the client service pulls and merges the celebrity log dynamically at read time, reducing write overhead to zero.

Once target notifications are compiled, they are written to a partitioned notifications-to-send Kafka topic. We run isolated dispatcher consumer groups for each delivery channel: Push, Email, and SMS. This isolation is critical; if our email gateway experiences latency, its consumer group lags independently, ensuring our time-sensitive push notifications continue to deliver in less than 5 seconds.

To protect the downstream channels from thundering herd spikes, we enforce user-level rate limiting using Redis sorted sets (ZSET). The gateway executes atomic Lua scripts to maintain a sliding window, blocking any request that exceeds 10 push notifications per user per hour.

Finally, to handle downstream outages, our consumers implement Backpressure Control. If Firebase Cloud Messaging error rates spike past 20%, the dispatcher calls consumer.pause(). This pauses active partition pulls, letting the data safely buffer inside Kafka's persistent disk blocks. Once FCM recovers, we call consumer.resume(), safely draining the backlog without dropping a single packet or overwhelming the gateway."

Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.