System Design: Designing a Real-Time Ad Click Aggregator

Ad click aggregation represents one of the most high-throughput data engineering problems in modern advertising platforms. When billions of users click on ads across search engines, social media networks, and mobile applications, these clicks must be instantly captured, deduplicated, aggregated, and stored.

This case study analyzes the architecture required to scale a real-time ad click aggregator to handle billions of daily events, focusing on Exactly-Once Semantics (EOS), stateful stream processing, dynamic click fraud filtering, and optimized OLAP databases.

1. Requirements & Core Constraints

To design an ad click aggregator operating at the scale of tech giants, we must establish functional boundaries and evaluate performance expectations mathematically.

Functional Constraints

Real-Time Dashboards: Advertisers must see click counts, click-through rates (CTR), and budget spends aggregated in 1-minute windows on their dashboards within $\le 5$ seconds of clicks occurring.
Strict Billing Accuracy: Every single click representing a financial transaction must be accounted for. The billing system must never bill advertisers twice for the same click (deduplication) and must never drop valid click transactions.
Ad Fraud & Bot Detection: The platform must identify and filter out fraudulent clicks (e.g., botnets, repeated clicks from the same IP, click farms) in real-time, preventing invalid charges.
Audit Trails: Raw click data must be persisted for long-term historical analysis, compliance audits, and advertiser dispute resolutions.

Non-Functional SLAs

High Throughput Scaling: The system must scale to handle $1,000,000,000$ (1 Billion) clicks per day, accommodating extreme traffic spikes (e.g., Super Bowl ads, global holiday shopping events).
Sub-Second Streaming Latency: Ingest tracking nodes must acknowledge requests in $< 10\text{ms}$. Stream processing pipelines must propagate aggregated windows to OLAP storage within $1\text{ second}$.
Durability & Fault Tolerance: Zero-loss data guarantees. Ingested clicks must be written to disk in a persistent write-ahead log prior to acknowledgement.
Availability: High availability ($99.999%$ uptime SLA) for tracking gateway clusters to prevent advertiser revenue loss.

Back-of-the-Envelope Estimation

1. Ingest Throughput

Daily Click Volume: $1,000,000,000$ clicks per day.
Average Ingest QPS: $$\text{Average QPS} = \frac{1,000,000,000 \text{ clicks}}{86,400 \text{ seconds}} \approx 11,574 \text{ clicks/sec}$$
Peak Scaling QPS: We must design for a peak factor of $5\times$ average load to support massive, concurrent ad events: $$\text{Peak Ingest QPS} = 11,574 \times 5 \approx 57,870 \text{ clicks/sec}$$

2. Network Bandwidth

Raw Event Payload: Each tracking request contains headers, IP addresses, cookie hashes, user-agent details, ad identifiers, and browser parameters, averaging $500 \text{ bytes}$: $$\text{Ingress Traffic (Average)} = 11,574 \text{ clicks/sec} \times 500 \text{ bytes} \approx 5.78 \text{ MB/s} \approx 46.24 \text{ Mbps}$$ $$\text{Ingress Traffic (Peak)} = 57,870 \text{ clicks/sec} \times 500 \text{ bytes} \approx 28.93 \text{ MB/s} \approx 231.44 \text{ Mbps}$$

3. Storage Sizing (5-Year Plan)

Raw Log Storage (AWS S3): For historical audit trails, raw click events are compressed and saved in Apache Parquet format, reducing file sizes by $80%$ (to $100 \text{ bytes}$ per click event): $$\text{Compressed Daily Storage} = 1,000,000,000 \text{ clicks} \times 100 \text{ bytes} \approx 100 \text{ GB/day}$$ $$\text{5-Year Audit Storage} = 100 \text{ GB/day} \times 365 \text{ days} \times 5 \text{ years} \approx 182.5 \text{ TB}$$
Aggregated Storage (ClickHouse): Instead of storing every raw click in the OLAP layer, events are pre-aggregated into 1-minute intervals grouped by ad_id, advertiser_id, geo, and device.
- Assume the platform hosts $1,000,000$ active ads.
- On average, $200,000$ distinct ads are clicked in any given minute.
- Each aggregated entry (containing counts, sum of spend, and timestamp) is $64 \text{ bytes}$: $$\text{Aggregated Ingest Rate} = 200,000 \text{ entries/min} \times 64 \text{ bytes} = 12.8 \text{ MB/min}$$ $$\text{Aggregated Daily Storage} = 12.8 \text{ MB/min} \times 1,440 \text{ mins/day} \approx 18.43 \text{ GB/day}$$ $$\text{5-Year OLAP Storage} = 18.43 \text{ GB/day} \times 365 \text{ days} \times 5 \text{ years} \approx 33.63 \text{ TB}$$

2. API Design & Core Contracts

Ad click ingestion requires a low-overhead tracking API alongside internal REST endpoints for configuration.

1. Client Click Tracking Request

GET /api/v1/clicks Executed when a user clicks on an ad, routing the client request through the tracking gateway fleet.

Query Parameters:

ad_id=ad_99a2c3d4
advertiser_id=adv_88f7e6d5
campaign_id=cmp_77b6a5d4
click_id=clk_abc123xyz456
user_id=usr_55c4b3a2
cost_per_click=0.45
ip_address=192.168.1.1
timestamp=1782236500
redirect_url=https://www.advertisersite.com/landing

Response Headers (302 Found):

Location: https://www.advertisersite.com/landing
Cache-Control: no-cache, no-store, must-revalidate
Set-Cookie: csp_tracking_id=usr_55c4b3a2; Path=/; Secure; HttpOnly; SameSite=Strict

2. Stream Ingest Protobuf Contract (Kafka Payload)

To maximize throughput and minimize network I/O, the stateless click tracking servers serialize the raw query string parameters into structured protocol buffer payloads before writing to Apache Kafka:

syntax = "proto3";

package csp.adclick.v1;

message ClickEvent {
  string click_id = 1;
  string ad_id = 2;
  string advertiser_id = 3;
  string campaign_id = 4;
  string user_id = 5;
  double cost_per_click = 6;
  string ip_address = 7;
  string user_agent = 8;
  int64 timestamp = 9;
  string referrer = 10;
}

3. Advertiser Aggregation Query Endpoint (REST)

GET /api/v1/advertisers/analytics Invoked by the advertiser dashboard frontend to load aggregated performance charts.

Request Headers:

Authorization: Bearer <JWT_TOKEN>
Accept: application/json

Query Parameters:

advertiser_id=adv_88f7e6d5
start_time=1782236400
end_time=1782237000
granularity=1m

Response Payload (200 OK):

{
  "advertiser_id": "adv_88f7e6d5",
  "granularity": "1m",
  "series": [
    {
      "timestamp": 1782236400,
      "ad_id": "ad_99a2c3d4",
      "clicks": 540,
      "total_spend": 243.00,
      "impressions": 12000,
      "ctr": 0.045
    },
    {
      "timestamp": 1782236460,
      "ad_id": "ad_99a2c3d4",
      "clicks": 620,
      "total_spend": 279.00,
      "impressions": 14500,
      "ctr": 0.042
    }
  ]
}

3. High-Level Design (HLD)

The architecture leverages a hybrid Kappa Architecture design. The "Speed Path" streams click events in real-time to Flink and ClickHouse for interactive reporting, while the "Batch/Audit Path" captures raw events into cold storage for offline processing.

graph TD
    %% User Action
    User[User Clicks Ad] -->|1. HTTP GET Tracking Request| LB[DNS Load Balancer]
    
    %% Ingest Tier
    LB -->|Geo Routing| TrackingFleet[Stateless Tracking Server Fleet]
    TrackingFleet -->|2. Redirect User to Landing Page| User
    
    %% Ingest Buffer
    TrackingFleet -->|3. Serialize & Write Protobuf| KafkaCluster[("Apache Kafka Ingestion Broker")]
    
    %% Ingestion Decoupling / Speed vs Batch Paths
    subgraph SpeedPath["Speed Path (Real-time Reporting)"]
        KafkaCluster -->|4a. Stream Events| FlinkCluster["Apache Flink Stream Processor"]
        
        %% In-Memory State & Deduplication
        FlinkCluster <-->|4b. Query / Set TTL| RedisDeduplicate[("Redis Cluster Deduplication Cache")]
        
        %% OLAP Storage
        FlinkCluster -->|5a. Batch Ingest (1-Min Windows)| ClickHouseCluster[("ClickHouse OLAP Database")]
    end

    subgraph BatchPath["Batch / Audit Path (Billing & Reconciliation)"]
        KafkaCluster -->|4c. Pull Events| KafkaConnect["Kafka Connect S3 Sink"]
        KafkaConnect -->|5b. Write Structured Parquet| S3Store[("AWS S3 Cold Audit Store")]
        S3Store -->|6. SQL Query| Athena[("AWS Athena Query Engine")]
        Athena -->|7. Reconcile Balances| BillingEngine["Monthly Billing Reconciliation Worker"]
    end

    %% Dashboard Query Loop
    Advertiser[Advertiser Dashboard] -->|8. Request Metrics| APIGateway["API Gateway"]
    APIGateway -->|9. Read aggregated metrics| ClickHouseCluster
    BillingEngine -->|10. Update Advertiser Balance| DBPrimary[("PostgreSQL Ledger DB")]

    classDef database fill:#0d3b66,stroke:#f4d35e,stroke-width:2px,color:#fff;
    classDef cluster fill:#2e0f38,stroke:#f4d35e,stroke-width:2px,color:#fff;
    classDef client fill:#3d5a80,stroke:#293241,stroke-width:2px,color:#fff;
    classDef loadbalancer fill:#ee6c4d,stroke:#293241,stroke-width:2px,color:#fff;
    class ClickHouseCluster,S3Store,RedisDeduplicate,Athena,DBPrimary database;
    class FlinkCluster,TrackingFleet,KafkaCluster,KafkaConnect,BillingEngine cluster;
    class User,Advertiser client;
    class LB loadbalancer;

End-to-End Architectural Workflows

1. Ingestion and Speed Path (Real-Time Dashboards)

Click Event Capture: When a user clicks an ad, their browser issues a request to the tracking gateway. The DNS Load Balancer routes the request to the nearest geographical Stateless Tracking Server Node.
Immediate Redirection: The Tracking Server records the HTTP parameters and immediately responds with an HTTP 302 Redirect to route the user to the advertiser's landing page, keeping user-facing latencies under $10\text{ms}$.
Protobuf Logging: The Tracking Server serializes the event fields into binary Protobuf format and appends it to Apache Kafka in a partition keyed by click_id.
Deduplication & Stream Aggregation: Apache Flink consumes the click events from Kafka. It deduplicates incoming clicks in real-time by checking a high-speed Redis Cluster configured with an active 10-minute TTL.
Windowed Writes: Every minute, Flink outputs the aggregated metrics (ad_id, advertiser_id, clicks_count, total_spend) and batch-writes them to ClickHouse using bulk insert APIs, avoiding the database write-amplification bottleneck.
Dashboard Querying: When an advertiser loads their dashboard, the frontend queries the API Gateway, which pulls the aggregated metrics directly from ClickHouse in under $100\text{ms}$.

2. Batch and Audit Path (Billing & Fraud Auditing)

Parquet Exporting: Simultaneously, Kafka Connect pulls the raw events from Kafka, bundles them into large structured blocks, and writes them to AWS S3 as compressed Apache Parquet files.
Batch Auditing: A daily Billing Reconciliation Worker runs large analytical queries over S3 using AWS Athena to check for fraudulent activity and audit billing records.
Ledger Commit: Reconciled spends are processed, and the final balances are written to the primary, transactional PostgreSQL Ledger Database.

4. Low-Level Design (LLD) & Data Models

Database Selection Rationale

Database	Architecture Model	Primary Role	System Justification
PostgreSQL	Relational SQL	Financial Ledger & Account Info	Advertiser accounts, billing invoices, payment profiles, and balances require strict ACID transactions, schema safety, and relational indices.
ClickHouse	Columnar OLAP	Real-Time Dashboards	advertiser analytical queries require scanning millions of aggregate records. Columnar databases allow extremely fast analytical aggregations (`SUM`, `COUNT`) without index scans.
Redis Cluster	In-Memory Key-Value	10-Min Click Deduplication	Key-value lookups must occur within $< 1\text{ms}$ to prevent stream bottlenecks. Redis hashes with TTL allow high-frequency deduplication.
AWS S3 (Parquet)	Distributed Object Store	Long-term Audit Storage	Storing raw historical click logs cheaply and durably. Columnar Parquet format minimizes storage space and query scan costs.

SQL DDL Database Schemas

PostgreSQL Ledger Database (Advertiser Account Balance)

-- Advertiser Account Registry
CREATE TABLE advertisers (
    id VARCHAR(64) PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    current_balance DECIMAL(15, 4) NOT NULL DEFAULT 0.0000,
    credit_limit DECIMAL(15, 2) NOT NULL DEFAULT 1000.00,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Transaction Ledger (Double-Entry Financial Audits)
CREATE TABLE billing_ledger (
    transaction_id VARCHAR(64) PRIMARY KEY,
    advertiser_id VARCHAR(64) NOT NULL REFERENCES advertisers(id) ON DELETE RESTRICT,
    amount DECIMAL(15, 4) NOT NULL, -- Negative for charges, positive for payments
    version_id INT NOT NULL DEFAULT 1,
    description TEXT,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_billing_ledger_advertiser ON billing_ledger(advertiser_id, created_at);

ClickHouse OLAP Table Schema (Aggregated Dashboards)

ClickHouse utilizes the specialized ReplacingMergeTree engine. This engine automatically deduplicates late-arriving click events at merge time, guaranteeing eventual consistency even during pipeline retries.

CREATE DATABASE IF NOT EXISTS csp_analytics;

-- Aggregate Minute Table for Advertisers
CREATE TABLE csp_analytics.ad_clicks_hourly (
    click_minute DateTime,
    ad_id String,
    advertiser_id String,
    campaign_id String,
    device_type LowCardinality(String),
    country_code LowCardinality(String),
    raw_clicks UInt32,
    unique_clicks UInt32,
    total_spend Decimal64(4),
    version UInt64 -- Incremented value for ReplacingMergeTree reconciliation
) ENGINE = ReplacingMergeTree(version)
PARTITION BY toYYYYMM(click_minute)
PRIMARY KEY (advertiser_id, click_minute)
ORDER BY (advertiser_id, click_minute, campaign_id, ad_id, device_type, country_code);

5. Streaming Deduplication & Exactly-Once Semantics

Billing systems cannot tolerate click double-counting. We must ensure every click is processed Exactly-Once.

graph LR
    Producer[Producer Client] -->|1. Write with PID & SeqNo| Kafka[Kafka Log]
    Kafka -->|2. Event Checkpoint| FlinkState["Flink RocksDB State"]
    FlinkState -->|3. Query Deduplication Key| Redis[Redis TTL Cache]
    Redis -->|4. If unique -> Process & Commit| Sink[ClickHouse]

1. Ingestion Tier: Kafka Idempotent Producers

To prevent duplicate writes caused by network glitches during retries, producers are configured with enable.idempotence=true.
The Kafka broker assigns each producer a unique Producer ID (PID) and tracks internal Sequence Numbers for every message. If a producer retries writing a message the broker has already appended, the duplicate write is silently discarded, ensuring exactly-once ingestion.

2. Processing Tier: Stateful Stream Deduplication (Flink)

Flink maintains local, distributed state (configured with a RocksDB state backend) mapped by click_id.
For fast, global, multi-node checking, Flink validates incoming click_id keys against a high-speed Redis Cluster using a 10-minute sliding TTL. If the key exists in Redis, Flink flags the event as a duplicate and discards it.
Checkpointing (Chandy-Lamport): Flink periodically snapshots its internal offsets and states to S3. If a streaming worker crashes, the cluster rolls back to the latest validated checkpoint, resuming consumption without data loss.

Production-Grade Flink Streaming Java Engine

Below is a complete, production-ready Apache Flink streaming engine implementation designed to consume, deduplicate, and aggregate ad click streams:

package com.codesprintpro.adclicks.streaming;

import org.apache.flink.api.common.functions.RichFlatMapFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

import java.io.Serializable;
import java.math.BigDecimal;

public class AdClickAggregatorJob {

    // Input Event Definition
    public static class RawClickEvent implements Serializable {
        public String clickId;
        public String adId;
        public String advertiserId;
        public double costPerClick;
        public long timestamp;
    }

    // Output Aggregate Definition
    public static class ClickAggregate implements Serializable {
        public String adId;
        public String advertiserId;
        public long windowEnd;
        public long clicksCount;
        public double totalSpend;

        public ClickAggregate() {}
        public ClickAggregate(String adId, String advertiserId, long windowEnd, long clicksCount, double totalSpend) {
            this.adId = adId;
            this.advertiserId = advertiserId;
            this.windowEnd = windowEnd;
            this.clicksCount = clicksCount;
            this.totalSpend = totalSpend;
        }
    }

    public static void main(String[] args) throws Exception {
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        
        // Enable Chandy-Lamport distributed snapshot checkpointing every 10 seconds
        env.enableCheckpointing(10000);

        // Define a mock stream source (would be KafkaSource in production)
        DataStream<RawClickEvent> clickStream = env.fromElements(new RawClickEvent());

        DataStream<ClickAggregate> aggregatedStream = clickStream
            // Group stream partition by clickId to isolate deduplication locally
            .keyBy(event -> event.clickId)
            .flatMap(new StatefulDeduplicationFilter())
            // Re-partition by adId to distribute window aggregation workloads
            .keyBy(event -> event.adId)
            .window(TumblingEventTimeWindows.of(Time.minutes(1)))
            .reduce(
                (val1, val2) -> {
                    val1.costPerClick += val2.costPerClick;
                    return val1;
                },
                (key, window, input, out) -> {
                    RawClickEvent finalAgg = input.iterator().next();
                    long count = 0;
                    double spend = 0.0;
                    for (RawClickEvent ignored : input) {
                        count++;
                        spend += finalAgg.costPerClick;
                    }
                    out.collect(new ClickAggregate(finalAgg.adId, finalAgg.advertiserId, window.getEnd(), count, spend));
                }
            );

        aggregatedStream.print();
        env.execute("CSP-RealTime-AdClickAggregator");
    }

    // Stateful Deduplication Filter running inside Flink's local RocksDB State Store
    public static class StatefulDeduplicationFilter extends RichFlatMapFunction<RawClickEvent, RawClickEvent> {
        private transient ValueState<Boolean> isAlreadyProcessed;

        @Override
        public void open(Configuration parameters) {
            ValueStateDescriptor<Boolean> desc = new ValueStateDescriptor<>(
                "dedup-state",
                TypeInformation.of(Boolean.class)
            );
            isAlreadyProcessed = getRuntimeContext().getState(desc);
        }

        @Override
        public void flatMap(RawClickEvent value, Collector<RawClickEvent> out) throws Exception {
            if (isAlreadyProcessed.value() == null) {
                // First time seeing this clickId; process it and update local state
                isAlreadyProcessed.update(true);
                out.collect(value);
            }
            // If state is present, this is a duplicate event; discard it
        }
    }
}

6. Scaling Challenges & System Bottlenecks

Operating a system at a baseline of 1 Billion daily events with a peak scaling capacity of 57,870 clicks/sec introduces intense physical limitations on the storage and compute tiers. Here is how we engineer around these hard limits:

1. The Database Write Saturation Bottleneck

The Bottleneck: Writing 57,870 records/sec directly to a relational store like PostgreSQL or even a wide-column store like Cassandra triggers immediate disk I/O write-amplification failures, lock contention, and high CPU usage due to excessive commit operations.
The Mitigation: Flink Pre-Aggregation. We avoid writing raw click events directly into the database. Instead, Apache Flink acts as an in-memory windowed buffer, grouping and summing clicks in 1-minute tumbling windows locally on streaming nodes using RocksDB state stores. Rather than writing 60,000 distinct clicks for a single popular ad over a minute, Flink issues a single aggregated write (e.g., updating the click counter by 60,000 in ClickHouse). This reduces database write loads by a factor of up to $60,000\times$, maintaining low IOPS on storage nodes.

2. Kafka Partition Hotspotting (Celebrity/Viral Ads)

The Bottleneck: To ensure click events for a specific ad are aggregated sequentially, the default Kafka partitioning strategy hashes payloads based on the ad_id to route them to the same partition. However, if a massive global brand launches a viral ad campaign (such as a Super Bowl event or a celebrity endorsement), millions of users click the same ad simultaneously. This concentrates the incoming QPS onto a single Kafka partition broker, causing severe CPU spikes, network buffer saturation, and extreme consumer lag, while the remaining brokers sit idle.
The Mitigation: Two-Stage Key Hashing with Dynamic Salting.
- Stage 1 (Salting & Local Aggregation): The Kafka tracking producers attach a random integer salt from a range of [0, N-1] (where $N$ is the number of parallel processing subtasks, e.g., 10) to the partition key, formatting it as ad_id_salt. This distributes the traffic for the same ad_id evenly across multiple partitions and broker nodes.
- Stage 2 (Local Reduction and Desalting): In Flink, we perform a two-stage aggregation. The stream is first keyed by ad_id_salt to locally aggregate clicks in a short window. This initial reduction strips the salt and emits a partially aggregated event. Finally, Flink re-keys the stream by the true ad_id to perform the global 1-minute aggregation. Since the volume of events entering the second-stage aggregation is already pre-reduced by a factor of 10, the hot partition issue is fully neutralized.

3. Mitigating Click Fraud (Botnets & Velocity Filters)

The Threat: Malicious actors or automated botnets can programmatically trigger millions of false click events. This is designed to rapidly deplete an advertiser's budget, distort advertising metrics, and load system components with garbage data.
The Mitigation: Stateful Real-Time Velocity Engines.
- We deploy an in-memory sliding window filter inside Flink mapped by the composite key (ip_address, ad_id) and (user_id, ad_id).
- Flink tracks click frequency with a sliding window of 10 seconds. If a specific IP or User ID exceeds a strict velocity threshold (e.g., $> 3$ clicks per second), Flink dynamically flags all subsequent click payloads from that identity within a 1-hour cooling window.
- The flagged fraudulent clicks are routed to a separate Kafka "Dead Letter Queue" (DLQ) and saved in cheap cold storage (S3) for historical audit trials and billing dispute resolutions. However, they are strictly excluded from the real-time ClickHouse aggregated counts and the advertiser's primary billing ledger, preserving advertiser trust.

7. Technical Trade-offs & Consistency Models

1. Kappa vs. Lambda Architectural Blueprints

Trade-off

Kappa vs. Lambda Architectures Historically, high-throughput systems utilized a Lambda Architecture, which ran a speed path (Storm/Flink) for real-time dashboards and a completely separate batch path (Hadoop/Spark) for highly accurate nightly billing runs. This required writing, debugging, and maintaining two separate codebases in two different frameworks, often leading to subtle, frustrating discrepancy errors when syncing metrics. To prevent this operational nightmare, we implement a pure Kappa Architecture. Apache Flink serves as the single stateful processing framework for both real-time streams and batch-like historical audits. By utilizing Kafka's long-term log retention, Flink can replay historical click streams using the exact same application logic, ensuring absolute parity between advertiser dashboards and the final financial ledgers.

2. High-Frequency State Store Performance

3. ClickHouse Write Optimization

8. Resilience & Failure Scenarios

1. Ingestion Broker Failures (Kafka Partition Splits)

If a primary Kafka partition broker node crashes, write paths to that node are blocked, which can lead to data loss and client tracking failures.

Recovery Protocol: We configure Kafka topics with a replication factor of 3 and set the producer property acks=all (ensuring acknowledgment from the leader and all in-sync replicas). If a broker node crashes, Kafka's controller node immediately promotes one of the in-sync replicas (ISR) to leader. The producer automatically shifts writes to the new leader node without dropping a single click, providing seamless fault tolerance.

2. Checkpoint Storage Downtime (Flink Failures)

If Flink experiences a cluster-wide crash, the active stream state is lost, risking duplicate aggregates.

Recovery Protocol: If Flink experiences a cluster-wide crash, it immediately restarts the job from the last successfully validated checkpoint stored in S3. Flink queries Kafka for the offset coordinates logged during that snapshot and re-consumes downstream. ClickHouse utilizes the ReplacingMergeTree engine, which uses the click_minute timestamp and version columns to overwrite stale records, ensuring eventual consistency during recovery.

9. Candidate Verbal Script (Interview Guide)

Below is an exhaustive, verbatim transcript showing how a Staff Engineer candidate navigates the system design interview:

Interviewer: "How would you design a system to capture, aggregate, and store ad clicks for billing and dashboards, handling billions of clicks per day?"

Candidate: "I will architect a real-time ingestion and aggregation system utilizing a hybrid Kappa Architecture with Exactly-Once Semantics. To handle our peak load of 57,800 clicks/sec with sub-10ms response times, I will deploy stateless tracking servers behind a DNS Anycast Load Balancer. When a user clicks an ad, the tracking server immediately responds with an HTTP 302 Redirect to route the user to the landing page. Asynchronously, it serializes the click event into Protobuf format and writes it to Apache Kafka using idempotent producers. To process these clicks, I will use Apache Flink as our stream processing engine. Flink will aggregate events in 1-minute tumbling windows grouped by ad_id and advertiser_id. Finally, these aggregated windows are batch-written to ClickHouse using bulk insert APIs, avoiding the database write-amplification bottleneck. When an advertiser loads their dashboard, the frontend queries ClickHouse, which retrieves the aggregated metrics in under 100ms."

Interviewer: "How do you ensure billing accuracy? What if the network fails and Flink re-processes a batch of clicks? Will the advertiser be double-billed?"

Candidate: "No, we guarantee exactly-once processing at every layer of the pipeline:

At Ingestion: The Kafka producer is configured with enable.idempotence=true. The Kafka broker tracks unique sequence numbers for each producer, silently discarding any duplicate writes caused by client retries.
At Processing: Flink deduplicates clicks in real-time by checking incoming click_id keys against a high-speed Redis Cluster configured with an active 10-minute TTL. Flink also leverages the Chandy-Lamport algorithm to periodically snapshot state to S3.
At Storage: ClickHouse uses the ReplacingMergeTree engine. This engine uses the click_minute and version columns to automatically overwrite duplicate entries during database background merges, ensuring eventual consistency."