Lesson 87 of 105 18 minFlagship

System Design: Building a Webhook Delivery Platform

Design a production webhook delivery platform with event ingestion, outbox persistence, retries, exponential backoff, signing, endpoint secrets, idempotency, rate limits, dead-letter queues, replay, observability, and tenant isolation.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • customers can create webhook endpoints
  • customers can subscribe endpoints to event types
  • product services can publish webhook events
Recommended Prerequisites
System Design Interview Framework

Premium outcome

From vague architecture answers to staff-level trade-off thinking.

Backend engineers preparing for senior, staff, and architecture rounds.

What you unlock

  • A reusable system design answer framework for ambiguous prompts
  • Clear language for consistency, scaling, and reliability trade-offs
  • Case-study depth across feeds, payments, storage, and messaging systems

Webhooks represent a simple contract on the surface: "When an event occurs in our system, send an HTTP POST request with a JSON payload to a customer-configured URL."

However, taking this concept to a high-scale production environment introduces a host of complex distributed systems challenges. In production, your webhook platform must absorb customer-side fragility—such as slow endpoints, expired TLS certificates, flaky DNS, rate limits, and network dropouts—while preventing any noisy neighbors from starving other tenants. Simultaneously, the system must guarantee that your own core transaction flows remain unaffected by slow webhook deliveries, and that no committed events are ever lost.

This comprehensive guide designs a highly available, fault-tolerant, and secure webhook delivery platform capable of handling billions of events daily.


System Requirements and Goals

To design a production-grade webhook delivery platform, we must partition our system goals into strict functional and non-functional requirements, backed by rigorous capacity estimations.

1. Functional Requirements

  • Endpoint Management (CRUD): Customers can register, modify, pause, and delete webhook endpoint URLs through an API or dashboard.
  • Subscription Configuration: Customers can subscribe specific endpoints to a subset of event types (e.g., order.created, invoice.payment_failed).
  • Secure Delivery: The platform delivers events via HTTP POST with cryptographic signatures (HMAC-SHA256) to guarantee payload integrity and authenticity.
  • Reliable Retry Mechanism: Failed deliveries must automatically retry using exponential backoff with jitter.
  • Observability & Logs: Customers can inspect the delivery history of their endpoints, complete with status codes, request/response headers, latency, and payloads.
  • Manual Replay: Customers can trigger manual replays for specific failed events within a 30-day retention window.
  • Administrative Controls: System operators can pause or globally disable chronically failing endpoints.

2. Non-Functional Requirements

  • At-Least-Once Delivery Guarantee: Every committed webhook event must be delivered at least once. We prioritize durability and availability over zero-duplicate constraints.
  • Decoupled Architecture (Low Latency Ingestion): Generating and publishing a webhook must never block or slow down the core product transaction path (e.g., checking out an order).
  • Tenant Isolation & Fair Scheduling: A sudden surge of events from a single high-volume client, or a slow endpoint belonging to one customer, must not delay webhook deliveries for other customers.
  • Strict Security Controls: The system must actively prevent Server-Side Request Forgery (SSRF) and DNS rebinding attacks since it executes HTTP requests to arbitrary user-controlled URLs.
  • High Scale & Throughput: Support ingestion of tens of thousands of events per second with graceful horizontal scaling.

3. Capacity Estimation & Scalability Math

Let's establish a concrete mathematical baseline for our capacity planning:

  • Average Event Ingestion Rate: $10,000$ events per second (eps).
  • Peak Event Ingestion Rate: $30,000$ eps.
  • Subscription Fan-Out Factor: On average, each event is subscribed to by $3$ distinct endpoints.
  • Peak Delivery Rate: $30,000 \times 3 = 90,000$ HTTP POST deliveries per second.
  • Average Payload Size: $2 \text{ KB}$ (compressed JSON).
  • Ingestion Network Bandwidth: $$\text{Bandwidth} = 30,000 \text{ eps} \times 2 \text{ KB} = 60 \text{ MB/s} = 480 \text{ Mbps}$$
  • Metadata Logging Volume: Each delivery attempt writes a structured metadata record of approximately $1 \text{ KB}$ (headers, status codes, latency, errors) to our analytical store. $$\text{Storage Rate} = 90,000 \text{ deliveries/sec} \times 1 \text{ KB} = 90 \text{ MB/s}$$ $$\text{Daily Storage Requirement} = 90 \text{ MB/s} \times 86,400 \text{ seconds} \approx 7.77 \text{ TB/day}$$
  • Retention Policy: Hot storage retention of metadata for 14 days ($108.7 \text{ TB}$), followed by automated archival to cold object storage (S3/GCS) compressed for 365 days.

API Design and Interface Contracts

The platform exposes external REST APIs for customers to manage endpoints, as well as an internal ingestion contract for microservices.

1. External Endpoint Management API

Create a Webhook Endpoint

POST /v1/webhook_endpoints

Request Headers:

Authorization: Bearer secret_live_abc123
Content-Type: application/json

Request Payload:

{
  "url": "https://api.customer.com/webhooks/receiver",
  "subscribed_events": ["order.created", "order.fulfilled", "payment.failed"],
  "metadata": {
    "environment": "production"
  },
  "config": {
    "timeout_ms": 5000,
    "max_attempts": 5
  }
}

Response Payload (201 Created):

{
  "id": "wh_end_8f2d9c4e1a",
  "url": "https://api.customer.com/webhooks/receiver",
  "subscribed_events": ["order.created", "order.fulfilled", "payment.failed"],
  "status": "ACTIVE",
  "secret": "whsec_p9K2xR8qLmN7sJ4wB3tY1vZ6c",
  "created_at": "2026-05-23T08:00:00Z",
  "config": {
    "timeout_ms": 5000,
    "max_attempts": 5
  }
}

2. Delivery Payload Envelope (Sent to Customer URLs)

When a webhook is delivered, the payload is wrapped in a standard envelope.

POST https://api.customer.com/webhooks/receiver

Delivery Headers:

User-Agent: CodeSprintPro-Webhook-Deliverer/2.0
Content-Type: application/json
X-Webhook-Event-Id: evt_9a1b8c7d6e
X-Webhook-Delivery-Id: del_3f4e5d6c7b
X-Webhook-Timestamp: 1779523574
X-Webhook-Signature: t=1779523574,v1=9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08

Delivery Payload:

{
  "id": "evt_9a1b8c7d6e",
  "event_type": "order.created",
  "created_at": "2026-05-23T08:06:14Z",
  "tenant_id": "tenant_prod_xyz987",
  "data": {
    "order_id": "ord_88776655",
    "amount": 25000,
    "currency": "USD",
    "customer": {
      "id": "cust_4433",
      "email": "user@example.com"
    }
  }
}

High-Level Design Architecture

The webhook delivery platform is partitioned into decoupled ingestion, resolution, scheduling, and execution components. This isolates the synchronous core transaction path from the highly variable, unpredictable nature of external network calls.

graph TD
    %% Define System Actors
    subgraph "Core Business Layer"
        ProdService[Order/Payment Service] -->|1. Write Transaction & Outbox| PrimaryDB[(PostgreSQL DB)]
        OutboxProcessor[Outbox Poller / CDC] -->|2. Read & Publish| IngestAPI[Webhook Ingestion API]
    end

    subgraph "Ingestion & Brokerage"
        IngestAPI -->|3. Route to Partition| KafkaBroker[Kafka Event Broker]
        KafkaBroker -->|4. Consumer Group Group-Resolve| FanoutEngine[Subscription Fan-Out Engine]
        MetadataStore[(Subscription & Endpoint DB)] <-->|Query Rules| FanoutEngine
    end

    subgraph "Distribution & Delivery"
        FanoutEngine -->|5. Create Delivery Tasks| RedisCache[(Redis Rate-Limits & Metadata)]
        FanoutEngine -->|6. Enqueue Tasks| DeliveryQueue[Kafka Delayed Delivery Queue]
        DeliveryQueue -->|7. Consume Task| WorkerPool[Async HTTP Delivery Workers]
        WorkerPool -->|8. Fetch Decrypted Secret| KMS[Key Management Service]
        WorkerPool -->|9. Safe HTTP POST via Proxy| SecureProxy[SSRF Sanitizing Outbound Proxy]
    end

    subgraph "Customer Ecosystem"
        SecureProxy -->|10. HMAC Signed Delivery| CustomerEndpoint[Customer Server URL]
    end

    subgraph "Observability & Recovery"
        WorkerPool -->|11. Attempt Logs| TimescaleDB[(ClickHouse/Timescale Log DB)]
        WorkerPool -->|12. Exhausted Attempts| DLQQueue[Kafka Dead-Letter Queue]
        DLQQueue -->|13. Persist DLQ| PrimaryDB
    end

    %% Styles
    style PrimaryDB fill:#1a1c23,stroke:#3b82f6,stroke-width:2px,color:#fff
    style KafkaBroker fill:#1a1c23,stroke:#f59e0b,stroke-width:2px,color:#fff
    style WorkerPool fill:#1e3a8a,stroke:#3b82f6,stroke-width:2px,color:#fff
    style SecureProxy fill:#0f172a,stroke:#10b981,stroke-width:2px,color:#fff
    style CustomerEndpoint fill:#2e1065,stroke:#a855f7,stroke-width:2px,color:#fff

Core Architecture Components

  1. Transactional Outbox Engine: Rather than pushing directly to the Webhook Ingestion API from application code, services write the event payload to an outbox table in the local PostgreSQL DB within the same ACID transaction. An asynchronous poller or Change Data Capture (CDC) connector (e.g., Debezium) stream-pushes these rows to the Ingestion API.
  2. Webhook Ingestion API: A high-throughput, stateless microservice that validates the payload envelope, checks authentication, and pushes the event raw into a log-structured message broker (Kafka).
  3. Subscription Fan-Out Engine: Consumes raw events from the main Kafka topic. For each event, it queries subscription metadata, identifies matching active endpoints, and resolves the target configuration. It then creates individual delivery records and enqueues them into partition-specific delivery queues.
  4. Partitioned Delayed Delivery Queues: Managed via Kafka with message keying mapped to tenant_id or endpoint_id. This ensures sequential queueing and partition-level isolation for individual tenants.
  5. SSRF Sanitizing Outbound Proxy: An essential security buffer (e.g., Squid configured with strict IP filters, or a custom internal proxy) that resolves target DNS records, validates that they do not belong to private or local loopback CIDRs, and conducts the actual HTTP delivery.
  6. Analytical Log Database (ClickHouse): Stores chronological delivery attempt logs for customer dashboards, keeping transactional databases clear of heavy write-once, query-rarely logging workloads.

Low-Level Design & Component Mechanics

To ensure high-performance execution and maximum reliability, we translate our architectural concepts into highly optimized database structures and scalable concurrent worker code.

1. Database Schema (PostgreSQL DDL)

The transactional storage stores the configuration state, subscriptions, and active delivery pipelines. High-frequency write tables feature partitioning and compound indexes to optimize task scheduling.

-- Webhook Endpoints
CREATE TABLE webhook_endpoints (
    endpoint_id VARCHAR(64) PRIMARY KEY,
    tenant_id VARCHAR(64) NOT NULL,
    url VARCHAR(2048) NOT NULL,
    secret_hash VARCHAR(128) NOT NULL,
    status VARCHAR(20) NOT NULL DEFAULT 'ACTIVE', -- ACTIVE, PAUSED, DISABED_BY_SYSTEM
    max_attempts INT NOT NULL DEFAULT 5,
    timeout_ms INT NOT NULL DEFAULT 5000,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    CONSTRAINT chk_timeout CHECK (timeout_ms BETWEEN 500 AND 30000),
    CONSTRAINT chk_attempts CHECK (max_attempts BETWEEN 1 AND 20)
);

CREATE INDEX idx_endpoints_tenant ON webhook_endpoints (tenant_id, status);

-- Webhook Subscriptions (Event-Type Mappings)
CREATE TABLE webhook_subscriptions (
    subscription_id VARCHAR(64) PRIMARY KEY,
    endpoint_id VARCHAR(64) NOT NULL REFERENCES webhook_endpoints(endpoint_id) ON DELETE CASCADE,
    event_type VARCHAR(128) NOT NULL,
    enabled BOOLEAN NOT NULL DEFAULT TRUE,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    CONSTRAINT uq_endpoint_event UNIQUE (endpoint_id, event_type)
);

CREATE INDEX idx_subscriptions_lookup ON webhook_subscriptions (event_type) WHERE enabled = TRUE;

-- Durable Webhook Events Store
CREATE TABLE webhook_events (
    event_id VARCHAR(64) PRIMARY KEY,
    tenant_id VARCHAR(64) NOT NULL,
    event_type VARCHAR(128) NOT NULL,
    payload JSONB NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (created_at);

-- Active Webhook Deliveries (Task Queue Manager)
CREATE TABLE webhook_deliveries (
    delivery_id VARCHAR(64) PRIMARY KEY,
    event_id VARCHAR(64) NOT NULL,
    endpoint_id VARCHAR(64) NOT NULL,
    tenant_id VARCHAR(64) NOT NULL,
    status VARCHAR(20) NOT NULL DEFAULT 'PENDING', -- PENDING, RETRYING, SUCCESS, FAILED, DEAD_LETTER
    attempt_count INT NOT NULL DEFAULT 0,
    next_attempt_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    last_attempt_at TIMESTAMPTZ,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Optimize Scheduler Task-Pull Queries
CREATE INDEX idx_deliveries_schedule ON webhook_deliveries (status, next_attempt_at) 
WHERE status IN ('PENDING', 'RETRYING');

CREATE INDEX idx_deliveries_lookup ON webhook_deliveries (endpoint_id, status, created_at DESC);

2. Delivery Worker Core Logic (TypeScript & Node.js)

The worker consumes tasks, constructs an HMAC signature using the endpoint secret, and triggers a secure outbound request with an explicit timeout.

import crypto from 'crypto';
import dns from 'dns/promises';
import ipRangeCheck from 'ip-range-check';

interface WebhookPayload {
  eventId: string;
  endpointId: string;
  url: string;
  secret: string;
  payload: Record<string, any>;
  timeoutMs: number;
}

interface DeliveryResult {
  success: boolean;
  statusCode?: number;
  latencyMs: number;
  error?: string;
}

// Security: Prevent SSRF by validating destination IP addresses
const PRIVATE_CIDRS = [
  '127.0.0.0/8',
  '10.0.0.0/8',
  '172.16.0.0/12',
  '192.168.0.0/16',
  '169.254.169.254/32',
  '::1/128',
  'fc00::/7'
];

async function validateURL(targetUrl: string): Promise<string> {
  const parsed = new URL(targetUrl);
  if (parsed.protocol !== 'https:') {
    throw new Error('SSRF Blocked: Only HTTPS destinations are permitted in production.');
  }

  // Resolve DNS records before sending request to check against internal IPs
  const addresses = await dns.resolve4(parsed.hostname);
  if (addresses.length === 0) {
    throw new Error(`DNS Resolution failed for ${parsed.hostname}`);
  }

  const targetIp = addresses[0];
  if (ipRangeCheck(targetIp, PRIVATE_CIDRS)) {
    throw new Error(`SSRF Blocked: Destination IP ${targetIp} belongs to a restricted subnet.`);
  }

  return targetIp;
}

// Generate cryptographic signature to ensure payload authenticity
export function generateHMACSignature(secret: string, timestamp: number, body: string): string {
  const signedPayload = `${timestamp}.${body}`;
  const hmac = crypto.createHmac('sha256', secret);
  hmac.update(signedPayload);
  return `t=${timestamp},v1=${hmac.digest('hex')}`;
}

// Non-blocking Webhook Deliverer Core Method
export async function deliverWebhook(task: WebhookPayload): Promise<DeliveryResult> {
  const startTime = Date.now();
  const stringifiedBody = JSON.stringify(task.payload);
  const timestamp = Math.floor(startTime / 1000);
  
  try {
    // 1. SSRF and DNS Rebinding check
    const resolvedIp = await validateURL(task.url);
    
    // 2. Compute signature
    const signature = generateHMACSignature(task.secret, timestamp, stringifiedBody);
    
    const parsedUrl = new URL(task.url);
    // Force HTTP client to bypass local DNS caching and request resolved secure IP directly, while retaining original Host header
    const requestHeaders = {
      'Host': parsedUrl.hostname,
      'Content-Type': 'application/json',
      'User-Agent': 'CodeSprintPro-Webhook-Deliverer/2.0',
      'X-Webhook-Event-Id': task.eventId,
      'X-Webhook-Timestamp': timestamp.toString(),
      'X-Webhook-Signature': signature
    };

    // 3. Initiate request with explicit Abort Controller timeout
    const controller = new AbortController();
    const timeoutId = setTimeout(() => controller.abort(), task.timeoutMs);

    const response = await fetch(`https://${resolvedIp}${parsedUrl.pathname}${parsedUrl.search}`, {
      method: 'POST',
      headers: requestHeaders,
      body: stringifiedBody,
      signal: controller.signal
    });

    clearTimeout(timeoutId);
    const latency = Date.now() - startTime;

    return {
      success: response.status >= 200 && response.status < 300,
      statusCode: response.status,
      latencyMs: latency
    };

  } catch (err: any) {
    const latency = Date.now() - startTime;
    return {
      success: false,
      latencyMs: latency,
      error: err.name === 'AbortError' ? 'REQUEST_TIMEOUT' : err.message
    };
  }
}

Scaling Challenges & Production Bottlenecks

Building a system to trigger hundreds of thousands of concurrent external network calls requires careful engineering around network protocols, system capacity, and security.

1. High-Concurrency Worker Thread Exhaustion & Slow Clients

If you use a simple worker pool where each worker blocks on an HTTP POST request, a sudden surge in slow customer endpoints (taking 10 seconds to respond) will immediately saturate your worker thread pools. This completely halts deliveries to healthy clients.

Mitigation: We must adopt Non-blocking asynchronous I/O loops combined with connection-pool recycling (using keep-alive HTTP agents). Node.js, Go (Goroutines), or Java (using Virtual Threads/Project Loom) are ideal. Workers register a socket event callback and yield execution rather than blocking OS-level threads. We also configure strict, low timeout limits: 5 seconds maximum.


2. Fair Scheduling & Noisy Neighbor Prevention

A single large tenant generating 5,000 events/second can completely overwhelm consumer resources, starvation-blocking smaller tenants who only trigger 1 event/minute.

Mitigation (Token Bucket Rate-Limiting + Sharded Virtual Queues):

  • Distributed Rate Limiting: We implement a Redis-backed Token Bucket rate limiter at the partition worker stage. If an endpoint exceeds its dynamic throughput allocation, its deliveries are pushed to a dedicated "Cool Down" low-priority queue.
  • Virtual Sharding: Instead of a single delivery topic, we distribute deliveries across multiple Kafka topics grouped by tenant tier (e.g., delivery.high-priority, delivery.default, delivery.backfill). Within the worker group, fair scheduling reads a round-robin rotation of tasks.
graph TD
    subgraph "Fair Scheduler Topologies"
        KafkaInbound[Inbound Delivery Streams] --> IngestFilter{Evaluate Ingestion Rate}
        IngestFilter -->|Under Limit| FastQueue[Tier A: Fast Virtual Queue]
        IngestFilter -->|Over Limit / Spiky| SlowQueue[Tier B: Backlogged / Spikes]
        
        FastQueue --> WorkerGroup[Fair Round-Robin Consumer]
        SlowQueue --> WorkerGroup
        
        WorkerGroup --> OutboundWorker{Execute Delivery}
    end

3. DNS Rebinding Attacks: The Ultimate Webhook Vulnerability

A sophisticated attacker registers a malicious domain that resolves to a public IP on the first lookup (when our validator checks the address) but immediately drops its TTL to 0 and resolves to 127.0.0.1 or 169.254.169.254 on the actual HTTP request line.

Mitigation: As implemented in our low-level design code:

  1. Conduct the DNS lookup first explicitly in your application code.
  2. Validate the resolved IP address against your private IP address blocklist.
  3. Make the HTTP connection directly to the validated IP address (e.g., https://192.0.2.1/path).
  4. Force the Host header of the HTTP connection back to the original domain name (e.g., Host: api.attacker.com) to prevent SSL certificate validation failures and routing anomalies.

Technical Trade-offs & Strategic Compromises

Architectural choices in high-throughput messaging always represent an explicit trade-off between system complexity, operating costs, and consistency guarantees.

Architectural Pattern Delivery Ordering Network Latency Fault Durability Operational Complexity Cost Efficiency
Log-Based Streaming (Kafka Buffer) Strict (partition keys) Medium (~20ms) Very High (Replicated Disk) High (Zookeeper/KRaft, Cluster scale) Medium (Disk capacity required)
Memory-Backed (Redis Streams/BullMQ) Partial Ultra-Low (<1ms) Medium (Requires AOF persistence) Low (Single-instance/Cluster) High (Requires RAM scaling)
Push Queue (AWS SQS / RabbitMQ) Relaxed (Unless FIFO) Low (~10ms) High (Distributed replication) Low (SaaS Managed) Low at scale (High API charges)

Consistency Trade-Off: At-Least-Once vs exactly-Once Delivery

Achieving exactly-once HTTP webhook delivery is theoretically impossible without distributed transaction coordination (two-phase commit) extending to the customer's database, which is out of our control.

If the customer's server processes our webhook successfully but their network drops before sending the 200 OK confirmation, our system must retry, generating a duplicate. Thus, our strategic compromise is:

  • Deliver at-least-once: Retain reliable logs and publish stable message headers (X-Webhook-Event-Id).
  • Enforce Client-Side Idempotency: Push the burden of deduplication to the client, providing documentation and SDK wrappers that verify the event ID has not been processed before.

Failure Scenarios and Fault Tolerance

In a distributed environment, external resources will fail continuously. The webhook engine must be designed to withstand failures at every level.

1. Exponential Backoff with Jitter (Avoiding Thundering Herds)

If a major customer platform goes down for 30 minutes, thousands of queued deliveries will fail simultaneously. If we retry them at exact linear intervals (e.g., every 5 minutes), we will generate a massive spike in outbound traffic that keeps crashing their system.

Implementation Formula: We apply an exponential backoff formula with randomized full jitter to distribute retry traffic: $$T_{\text{retry}} = T_{\text{base}} \times 2^{\text{attempt}} + \text{random_jitter}$$

Where $\text{random_jitter}$ is a uniform random value between $0$ and $0.2 \times (T_{\text{base}} \times 2^{\text{attempt}})$.


2. Auto-Pausing / Circuit Breaking Broken Endpoints

If a customer's endpoint fails with a 5xx error or times out consistently for $1,000$ consecutive deliveries over a 24-hour period, attempting to deliver more messages is a waste of resource bandwidth.

Circuit Breaker Rules:

  • Trigger: If failure rate exceeds 95% over the last 1,000 deliveries, the endpoint is transitioned to PAUSED_BY_SYSTEM.
  • Action: Stop all real-time delivery attempts to this endpoint. New events generated during this period are immediately cataloged in a local table as PENDING_RESUME or moved straight to the Dead-Letter Queue (DLQ).
  • Notification: Automatically send an email/Slack warning notification to the customer requesting manual inspection. Once they reactivate the endpoint via the dashboard, a retry task executes a historical bulk drain.

Staff Engineer Perspective


Verbal Script & Mock Interview

Mock Interview Dialogue

Interviewer: "I see you've designed a highly decoupled, asynchronous webhook delivery system. How would you design a robust replay mechanism that allows customers to recover from a major outage on their server without overwhelming our event database?"

Candidate: "To implement a secure and scalable replay mechanism, we must first establish that a 'replay' request should never mutate the historical status log of our original delivery attempts. Doing so ruins auditing and compliance trails. Instead, a replay must be treated as a brand new delivery task that points to the original, immutable event_id."

Interviewer: "Good. And where do you fetch the event data for this replay?"

Candidate: *"We partition our database storage. Our events are retained in our main PostgreSQL DB under a 14-day partition strategy. The customer requests a replay through the API or dashboard by providing a time range or a list of specific event_ids. The Replay Service queries the partitioned webhook_events table, verifies the tenant has permissions to access those event resources, and fetches the payloads.

To prevent this manual process from knocking over our workers, the Replay Service does not write tasks to the high-priority real-time worker queues. Instead, it pushes the replay requests into a dedicated low-priority delivery.backfill Kafka topic. This topic has a restricted consumer pool concurrency level, ensuring that even if a customer replays 5 million events, real-time webhooks for other tenants continue to deliver with sub-second latencies."*

Interviewer: "Excellent. How does the customer's server know that this is a replayed event?"

Candidate: "We include a specific header: X-Webhook-Replay: true. We also retain the original X-Webhook-Event-Id in the envelope payload. This is critical because the customer's server should have idempotency logic. If their database successfully processed that event ID before their server crashed, they can immediately de-duplicate and return a 200 OK without wastefully running duplicate business logic again."

Interviewer: "How would you handle a customer endpoint that is returning a 429 Too Many Requests?"

Candidate: *"A 429 status code indicates a temporary rate-limit saturation on their end. Unlike a 400 Bad Request or 404 Not Found which represent permanent configuration errors and should not be retried, a 429 is treated as a retryable failure.

However, we must respect their rate limit. When we receive a 429, our worker parses the Retry-After HTTP header. If the header specifies a delay (e.g., Retry-After: 30), we schedule the next delivery attempt at exactly that time (plus a small randomized jitter to prevent synchronization). If no Retry-After header is present, we fall back to our standard exponential backoff with full jitter, but we temporarily reduce the token refilling rate in our Redis token-bucket rate limiter for that specific endpoint to throttle the throughput automatically."*

Interviewer: "Perfect. Let's wrap up here. This shows an excellent grasp of high-scale system design!"


Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.