Lesson 81 of 105 18 minFlagship

System Design: Building an Email Delivery Platform

Design a production email delivery platform with queues, templates, provider failover, idempotency, suppression lists, bounce handling, unsubscribe flows, rate limits, and observability.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • Separate ingestion from delivery using decoupled asynchronous priority queues to isolate bulk campaigns from critical transactional emails.
  • Enforce deliverability compliance using hashed suppression lists that automatically catch hard bounces and spam complaints.
  • Track provider health using rolling error rates to initiate automated failover when primary delivery APIs time out or return errors.
Recommended Prerequisites
System Design Interview Framework

Premium outcome

From vague architecture answers to staff-level trade-off thinking.

Backend engineers preparing for senior, staff, and architecture rounds.

What you unlock

  • A reusable system design answer framework for ambiguous prompts
  • Clear language for consistency, scaling, and reliability trade-offs
  • Case-study depth across feeds, payments, storage, and messaging systems

Sending a single email is straightforward. However, sending millions of emails daily with low latency, high delivery rates, and full compliance is a system design challenge.

An enterprise-grade email delivery platform must handle product notifications (such as password resets and logins), billing notices, and bulk marketing campaigns without letting one starve the resources of the other. Because email delivery relies on external providers (such as SendGrid, Mailgun, or AWS SES) and internet service provider (ISP) reputation rules, the platform must implement rate limits, suppression filters, and fallback routing. If these components are poorly designed, critical emails disappear, IP addresses are blacklisted by ISPs, and support teams cannot trace what went wrong.

This system design case study details the architectural blueprint for designing a multi-tenant email delivery platform capable of processing 50 million dispatches per month with 500 emails per second peak ingestion.


System Requirements

To build a reliable email platform, we separate requirements into functional capabilities, non-functional operational constraints, and explicit scale parameters.

Functional Requirements

  • Multi-Tenant Ingestion: Accept email send requests from various internal services, validating payloads and parameters.
  • Priority-Based Dispatch: Route incoming requests into separate execution lanes based on priority (e.g., security OTPs vs. marketing digests).
  • Template Management: Store, version, compile, and render HTML/plaintext templates with dynamic user parameters.
  • Suppression List Filtering: Intercept requests targeting emails that have previously hard-bounced or marked dispatches as spam.
  • Provider Failover & Routing: Intelligently distribute dispatches across multiple external delivery providers, shifting traffic if a provider fails.
  • Webhook Feedback Processing: Consume real-time callbacks from providers to track email lifecycle states (delivered, bounced, opened).
  • Unsubscribe Management: Provide a secure, one-click unsubscribe mechanism for bulk dispatches.

Non-Functional Requirements

  • High Ingestion Availability: Ensure the send API accepts requests with low latency, buffering them for processing.
  • Strict Queue Isolation: Guarantee that bulk marketing queues never delay critical security OTP deliveries.
  • Rate-Limiting Compliance: Throttling writes to respect provider quotas and ISP-specific domain ingestion limits.
  • Observability: Expose real-time delivery tracing metadata for search and customer support audits.
  • Privacy & Security: Protect sensitive user data by encrypting recipient email addresses in database storage logs.

Scale Assumptions

  • Monthly Output Volume: 50,000,000 emails per month.
  • Peak Ingestion Rate ($R_{\text{in}}$): 500 requests per second.
  • Webhook Events Rate ($R_{\text{webhook}}$): 1,500 callback requests per second at peak (averaging 3 lifecycle events per email).
  • Metadata Log Retention: 90 days of searchable transaction audit records.

API Design and Interface Contracts

The email platform uses RESTful ingestion endpoints, signed token unsubscribe mechanisms, and internal gRPC definitions to coordinate microservice communication.

1. Ingest Email Send Request (HTTP POST /v1/emails/send)

Invoked by microservices to enqueue an email dispatch.

Request Headers:

Idempotency-Key: idemp_email_99812_ac
Content-Type: application/json

Request Payload:

{
  "tenantId": "tenant_auth_prod_42",
  "recipient": {
    "email": "user@example.com",
    "userId": "usr_uuid_10292ab"
  },
  "templateKey": "mfa_login_otp",
  "variables": {
    "otpCode": "882019",
    "expirationMinutes": "5"
  },
  "priority": "CRITICAL",
  "category": "security_alerts"
}

Response Payload (202 Accepted):

{
  "messageId": "msg_uuid_99218ab44",
  "status": "QUEUED",
  "queuedAt": "2026-06-07T12:15:00Z"
}

2. Inbound Provider Webhook Callback (HTTP POST /v1/webhooks/sendgrid)

Consumes asynchronous delivery events pushed by SendGrid.

[
  {
    "email": "user@example.com",
    "timestamp": 1770289945,
    "event": "delivered",
    "smtp-id": "<442812.99812.prod@sendgrid.com>",
    "sg_message_id": "sg_msg_55162ab",
    "messageId": "msg_uuid_99218ab44"
  }
]

3. Ingestion and Status Service Contract (gRPC)

Stateless backend microservices interact with the core email engine using gRPC.

syntax = "proto3";

package codesprintpro.email.dispatch.v1;

service EmailDispatcher {
  rpc QueueEmail (QueueEmailRequest) returns (QueueEmailResponse);
  rpc QueryMessageStatus (MessageStatusRequest) returns (MessageStatusResponse);
}

message QueueEmailRequest {
  string tenant_id = 1;
  string recipient_email = 2;
  string template_key = 3;
  map<string, string> variables = 4;
  string priority = 5; -- CRITICAL, TRANSACTIONAL, STANDARD, BULK
  string category = 6;
  string idempotency_key = 7;
}

message QueueEmailResponse {
  string message_id = 1;
  string queue_status = 2; -- ENQUEUED, SUPPRESSED, REJECTED
  int64 timestamp_ms = 3;
}

message MessageStatusRequest {
  string message_id = 1;
}

message MessageStatusResponse {
  string message_id = 1;
  string status = 2; -- QUEUED, RENDERED, SENT, DELIVERED, BOUNCED, COMPLAINED
  string assigned_provider = 3;
  string error_code = 4;
  int64 last_updated_ms = 5;
}

High-Level Architecture

The platform architecture decouples ingestion from delivery using priority-based queues and background worker pools.

Email Ingestion and Priority Dispatch Pipeline

The API Gateway receives requests, checks the suppression cache, and routes validated emails into designated priority queues.

graph TD
    Client[Product Service] -->|POST /v1/emails/send| Gate[API Gateway]
    Gate -->|1. Validate Schema| Validator[Payload Validator]
    Validator -->|2. Check Suppression List| BloomFilter[Redis Bloom Filter Cache]
    
    BloomFilter -->|If Blocked| Err[Return HTTP 422: Suppressed]
    BloomFilter -->|If Allowed| Router[Queue Traffic Router]
    
    Router -->|Critical Lane| QCri[Queue: email.critical]
    Router -->|Transactional Lane| QTx[Queue: email.transactional]
    Router -->|Standard Lane| QStd[Queue: email.standard]
    Router -->|Bulk Lane| QBulk[Queue: email.bulk]
    
    subgraph Worker Pool Instances
        QCri --> WCri[Critical Workers: Dedicated CPU]
        QTx --> WTx[Transactional Workers]
        QStd --> WStd[Standard Workers]
        QBulk --> WBulk[Bulk Workers: Scaled down]
    end
    
    WCri --> Render[Template Compiler & Renderer]
    WTx --> Render
    WStd --> Render
    WBulk --> Render
    
    Render --> Limiter[Redis Rate Limiter]
    Limiter --> Sender[Provider Selector & API Client]
    
    Sender -->|HTTP POST| SES[AWS SES]
    Sender -->|HTTP POST| SGrid[SendGrid]

Webhook Feedback and Suppression Loop

Webhook handlers receive callback events from delivery providers, verify their signatures, update message states, and populate the suppression lists if hard bounces or spam complaints occur.

graph TD
    SGridWeb[SendGrid Event Webhook] -->|POST Callback| WebHook[Webhook Ingress Receiver]
    SESWeb[AWS SES SNS Webhook] -->|POST Callback| WebHook
    
    WebHook -->|1. Verify HMAC Signatures| SigVal[Signature Validator]
    SigVal -->|2. Queue Event Log| EventBus[Kafka Topic: email.events]
    
    EventBus --> EventHandler[Event Processing Engine]
    EventHandler -->|3. Update status to DELIVERED / BOUNCED| DB[(Postgres Audit Logs)]
    
    EventHandler -->|4. If Hard Bounce or Spam Complaint| Suppress[Suppression Manager]
    Suppress -->|5. Write Permanent Block| SQLSuppress[(Postgres Suppression Table)]
    Suppress -->|6. Refresh cache| BloomFilterCache[Redis Bloom Filter Cache]

Low-Level Design and Schema

Raw message metadata, templates, suppressions, and webhook logs are modeled inside a PostgreSQL relational database.

-- Tracks every email transaction record through the pipeline
CREATE TABLE email_messages (
    message_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id VARCHAR(64) NOT NULL,
    message_key VARCHAR(128) NOT NULL,
    recipient_email_hash VARCHAR(64) NOT NULL, -- SHA-256 hash for fast index lookups
    recipient_email_encrypted BYTEA NOT NULL, -- AES-256 encrypted address for privacy
    template_key VARCHAR(128) NOT NULL,
    template_version INT NOT NULL,
    priority VARCHAR(32) NOT NULL DEFAULT 'STANDARD', -- CRITICAL, TRANSACTIONAL, STANDARD, BULK
    status VARCHAR(32) NOT NULL DEFAULT 'QUEUED', -- QUEUED, RENDERED, SENT, DELIVERED, BOUNCED, COMPLAINED, SUPPRESSED
    idempotency_key VARCHAR(256) NOT NULL,
    provider_name VARCHAR(64), -- SENDGRID, MAILGUN, SES
    provider_message_id VARCHAR(256),
    error_code VARCHAR(64),
    error_message TEXT,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    queued_at TIMESTAMPTZ,
    sent_at TIMESTAMPTZ,
    delivered_at TIMESTAMPTZ,
    failed_at TIMESTAMPTZ,
    CONSTRAINT uk_tenant_idempotency UNIQUE (tenant_id, idempotency_key)
);

CREATE INDEX idx_messages_lookup ON email_messages (recipient_email_hash, created_at DESC);
CREATE INDEX idx_messages_provider ON email_messages (provider_name, provider_message_id) WHERE provider_message_id IS NOT NULL;

-- Dynamic template storage version control
CREATE TABLE email_templates (
    tenant_id VARCHAR(64) NOT NULL,
    template_key VARCHAR(128) NOT NULL,
    version INT NOT NULL,
    subject_template TEXT NOT NULL,
    html_template TEXT NOT NULL,
    text_template TEXT NOT NULL,
    template_status VARCHAR(32) NOT NULL DEFAULT 'ACTIVE', -- ACTIVE, DEPRECATED
    created_by VARCHAR(128) NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (tenant_id, template_key, version)
);

-- Recipient blocklist rules to preserve IP sender reputation
CREATE TABLE email_suppressions (
    tenant_id VARCHAR(64) NOT NULL,
    email_hash VARCHAR(64) NOT NULL, -- SHA-256 hash of email
    suppression_reason VARCHAR(64) NOT NULL, -- HARD_BOUNCE, COMPLAINT, UNSUBSCRIBE
    source_event_id VARCHAR(256),
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    expires_at TIMESTAMPTZ, -- For soft bounces, expire after a set time
    PRIMARY KEY (tenant_id, email_hash, suppression_reason)
);

CREATE INDEX idx_suppression_lookup ON email_suppressions (tenant_id, email_hash);

-- Webhook deduplication log to prevent status double processing
CREATE TABLE webhook_deduplication_logs (
    event_fingerprint VARCHAR(128) PRIMARY KEY, -- Hash of event_id + status + timestamp
    provider_name VARCHAR(64) NOT NULL,
    processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Schema Rationale & Index Optimization

  1. idx_messages_lookup: Uses the hashed email value rather than decrypting the email address. This composite index allows support teams to locate all historic transactions for a specific user profile in less than 2 milliseconds.
  2. recipient_email_encrypted (AES-256): Relational databases storing raw emails represent a significant security risk. We encrypt the email addresses at the application layer using AES-256 before writing to disk, keeping only the SHA-256 hash unencrypted for index searches.
  3. uk_tenant_idempotency: Enforces strict uniqueness of idempotency keys within each tenant boundary, preventing duplicate processing if a client retries requests.

Scaling Challenges and Capacity Estimation

Ingesting 500 emails per second and processing 1,500 webhook events per second requires evaluating storage requirements, network configurations, and database growth.

1. High-Volume Ingest and Metadata Storage Footprint

  • Assumptions:

    • Monthly Volume = $50,000,000$ emails
    • Retention Window = $90$ days ($150,000,000$ messages)
    • Average row storage size per email (including metadata, encrypted variables, logs) = $2$ KB
    • Index overhead = 1.4x multiplier
  • Calculations: $$\text{Storage Per Month} = 50,000,000\text{ records} \times 2\text{ KB} = 100,000,000\text{ KB} \approx 100\text{ GB/month}$$ $$\text{Raw 90-Day Storage} = 100\text{ GB/month} \times 3\text{ months} = 300\text{ GB}$$ $$\text{Total Storage with Indexing} = 300\text{ GB} \times 1.4 = 420\text{ GB}$$

Writing $420$ GB of data to a single relational database table degrades indexing performance over time. To scale this, we partition the email_messages table monthly based on the created_at timestamp. At the end of the 90-day retention window, older partitions are detached and archived to Amazon S3 Glacier, maintaining high read/write performance on active partitions.

2. Webhook Callback processing Lag

  • Assumptions:

    • Peak Webhook Ingress ($W$) = $1,500$ callbacks/second
    • Average write time to update database status = $5$ milliseconds per write
  • Calculations: If the webhook receiver attempts to update the database synchronously: $$\text{Database Connections Required} = 1,500 \text{ writes/s} \times 0.005\text{ seconds/write} = 7.5 \text{ concurrent connections}$$

While 7.5 connections is low, network bottlenecks or lock contention on the email_messages table can cause database response times to spike (e.g., to 500ms). $$\text{Degraded Database Connections} = 1,500 \times 0.5 = 750 \text{ concurrent connections}$$

This connection spike will exhaust the database connection pool, crashing both the webhook listener and the ingestion API.

To prevent this, the webhook receivers are designed as stateless services. When a callback lands, the receiver validates the signature, writes the raw event to a Kafka topic (email.webhooks), and returns an immediate HTTP 200 OK to the provider. Stateless consumer groups read from the topic and update the database in batches, protecting the database from request spikes.

3. Redis Rate-Limiting Overhead

  • Assumptions:

    • Send Throttle Rate = $500$ emails/second
    • Rate limits are tracked across 3 dimensions: tenant limits, provider limits, and recipient domain throttling (e.g., maximum 50 writes/second to yahoo.com).
    • Average operations per write = 3 checks
  • Calculations: $$\text{Redis Queries Per Second} = 500\text{ sends/s} \times 3\text{ checks} = 1,500\text{ QPS}$$

A Redis cluster can process 1,500 QPS easily. We write a custom Lua script that evaluates the token buckets for the tenant, provider, and domain in a single round-trip, avoiding network round-trip overhead.


Failure Scenarios and Resilience

Email delivery architectures must handle external gateway outages and protect deliverability reputation metrics.

1. Delivery Provider API Outage

A primary provider (e.g., SendGrid) experiences an outage, timing out or returning HTTP 500 Internal Server Errors.

  • The Threat: Outgoing emails fail, blocking critical dispatches like OTPs.
  • Resilience Design:
    • We use Circuit Breakers combined with Priority Fallback Routing.
    • We configure a fallback chain for critical dispatches: SENDGRID -> AWS_SES.
    • If a worker encounters 5 consecutive timeout errors or 5xx responses from SendGrid, the circuit breaker trips.
    • For the next 5 minutes, the worker bypasses SendGrid and routes all critical and transactional emails to AWS SES. The broker continues to send a small percentage of low-priority bulk traffic (1%) to SendGrid to check if the service has recovered before closing the circuit breaker.

2. Queue Starvation (Bulk Marketing Digests Blocking OTPs)

An internal marketing team initiates a newsletter campaign targeting 10 million users, filling the message queue.

  • The Threat: Critical password resets and login verification OTPs are queued behind the 10 million marketing emails, causing them to arrive hours late.
  • Resilience Design:
    • We enforce Strict Queue Isolation using dedicated queue topics and worker groups.
    • Ingestion endpoints assign a priority to each email (CRITICAL, TRANSACTIONAL, STANDARD, BULK) based on the template type.
    • Critical emails are routed to the email.critical queue, while marketing digests are routed to email.bulk.
    • We assign dedicated worker nodes to process the email.critical queue. These workers are configured to never read from the email.bulk queue, ensuring that marketing backlogs have no impact on critical delivery times.

3. Webhook Delivery Failures and Duplicate Events

A delivery provider experiences a network split, sending the same delivery confirmation webhook event multiple times.

  • The Threat: Duplicate events trigger redundant database updates and duplicate suppression entries.
  • Resilience Design:
    • We use Webhook Deduplication.
    • When the webhook receiver ingests an event, it generates a unique fingerprint by hashing the provider event ID and target status.
    • The receiver attempts to insert this fingerprint into the webhook_deduplication_logs table. If the database returns a key constraint violation, the event is discarded as a duplicate.
    • To prevent status regressions (e.g., processing a late SENT webhook after a DELIVERED webhook has already been logged), the status transition engine uses a deterministic state matrix, discarding updates that attempt to transition a message to an earlier state.

4. Suppression List Lookup Latency

Checking if an email is suppressed against a database table containing 20 million blocked addresses slows down the ingestion API.

  • The Threat: High database read latency blocks API threads, decreasing system throughput.
  • Resilience Design:
    • We deploy an in-memory Redis Bloom Filter in front of the database.
    • The Bloom filter stores the set of all suppressed email hashes.
    • When an ingestion request arrives, the system queries the Bloom filter:
      • If the Bloom filter returns false, the recipient is guaranteed to be clean, and the request bypasses the database check.
      • If the Bloom filter returns true (indicating the email might be suppressed), the system queries the email_suppressions table to verify.
    • This avoids database read operations for greater than 95% of incoming requests.

Architectural Trade-offs

Choosing the template rendering model and worker dispatch strategy requires balancing latency against resource consumption.

Trade-off 1: Synchronous vs. Asynchronous Template Rendering

Synchronous rendering compiles and evaluates templates at the API Gateway before queueing; asynchronous rendering delegates compilation to background workers right before dispatch.

Feature / Metric Synchronous Rendering (At Ingestion) Asynchronous Rendering (In Worker)
Ingestion Latency High. Gateway must parse and compile HTML templates before returning. Low. Gateway only writes raw JSON variables to the queue.
Queue Payload Size High. The queue holds the fully rendered HTML payload (often 50 KB). Low. The queue only holds the raw template variables (typically 1 KB).
Template Editing Safety High. An edit to a template does not change emails that are already queued. Low. Changing a template can alter the layout of emails currently in the queue.
Resource Efficiency Low. Requires compute resources at the gateway layer. High. Workers render templates asynchronously in batches.

Trade-off 2: Push-Based Broker vs. Pull-Based Worker Dispatch

Push-based models stream messages to workers using persistent sockets; pull-based models require workers to fetch messages from the broker manually.

Feature / Metric Push-Based Broker (RabbitMQ) Pull-Based Worker (Kafka)
Delivery Latency Low. Messages are pushed to available workers immediately. Medium. Workers poll the broker at configured intervals.
Backpressure Control Poor. If workers are overloaded, the broker continues to push data. Excellent. Workers only pull new messages when they have active capacity.
Scale Capacity Medium. Requires managing persistent connection sockets. High. Scales horizontally to millions of messages using partitions.

Staff Engineer Perspective

Maintaining an email delivery infrastructure requires implementing deliverability protections and database optimizations.


Verbal Script

Interviewer: "How do you protect transactional email delivery from being delayed by bulk marketing campaigns?"

Candidate: "We protect transactional email delivery by implementing Strict Queue Isolation and Worker Pools Isolation.

If we routed all email dispatches to a single message queue, a marketing campaign targeting 10 million users would add 10 million messages to the queue.

A critical password reset request generated during the campaign would sit at the back of the queue, taking hours to deliver.

To solve this:

  • We set up separate queue topics: email.critical (for OTPs and security alerts), email.transactional (for billing and order confirmations), and email.bulk (for newsletters).
  • The API Gateway identifies the message type and routes it to the matching queue.
  • We deploy separate worker groups. The critical worker pool is dedicated exclusively to the email.critical queue. These workers are configured with a strict concurrency limit and never read from the bulk queues.
  • This guarantees that even if the bulk queue has a backlog of millions of messages, the critical workers continue to process and deliver OTPs in less than 500 milliseconds."

Interviewer: "How would you handle a massive spike in duplicate webhook feedback events from a provider like SendGrid?"

Candidate: "We handle duplicate webhook events using a two-layered defense: stateless webhook buffering and idempotency checks with a database state transition matrix.

First, we decouple webhook ingestion from processing. Webhook events can spike to thousands of requests per second, and trying to process them synchronously would exhaust our database connection pools.

The webhook receiver validates the HMAC signature, writes the raw event payload to a Kafka topic (email.webhooks), and returns an immediate HTTP 200 OK to the provider. This ensures we ingest the data quickly without blocking the provider's connection.

Second, the event consumers read from the Kafka topic and deduplicate events before writing to the database. We generate a unique fingerprint for each event by hashing the provider event ID and target status.

We write this fingerprint to a Redis cache with a 24-hour TTL using a SET NX operation. If Redis returns a conflict, the event is discarded as a duplicate.

Finally, to handle out-of-order webhooks (e.g., receiving a SENT event after a DELIVERED event has already been processed), our database update uses a state transition matrix.

The update query only modifies the message status if the transition is valid (e.g., transitioning from SENT to DELIVERED), preventing late-arriving events from overwriting newer statuses."


Interviewer: "Why is email address hashing critical for suppression lists, and how does it affect database performance and compliance?"

Candidate: "Email address hashing is critical for suppression lists because it addresses both data privacy compliance (such as GDPR and HIPAA) and database search performance.

From a privacy perspective, storing raw email addresses in cleartext on a suppression list is a risk. If the database is compromised, the email list is exposed.

Under GDPR, users have the 'right to be forgotten'. However, if a user unsubscribes or hard-bounces, we must retain their record on a suppression list to prevent sending them future emails.

By hashing the email address using a SHA-256 algorithm with a tenant-specific salt, we remove the personally identifiable information (PII) from the database. We can verify if an address is suppressed by hashing the incoming request and matching it against the hash list without storing the email in cleartext.

From a performance perspective, searching B-Tree indexes for fixed-length 64-character SHA-256 hashes is significantly faster than searching variable-length string emails (which can range up to 254 characters).

This reduces index sizes, fits more index pages into memory, and ensures sub-millisecond lookups during ingestion."


Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.