Lesson 27 of 105 15 minFlagship

System Design: Designing a Real-Time Analytics Dashboard

How to visualize millions of events per second on a dashboard. Deep dive into stream aggregation, TSDBs, and WebSocket push architectures.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • **High-volume Ingestion:** Capture event streams from various sources.
  • **Aggregations:** Support sliding/tumbling window aggregates (e.g., Clicks in the last 1 minute).
  • **Low-latency Visualization:** Dashboards update in seconds.
Recommended Prerequisites
System Design Interview FrameworkSystem Design Module 13: Observability

Premium outcome

From vague architecture answers to staff-level trade-off thinking.

Backend engineers preparing for senior, staff, and architecture rounds.

What you unlock

  • A reusable system design answer framework for ambiguous prompts
  • Clear language for consistency, scaling, and reliability trade-offs
  • Case-study depth across feeds, payments, storage, and messaging systems

Real-time analytics dashboards power everything from ride-sharing surge pricing to video-streaming trending lists and real-time ad click monitoring. These platforms ingest millions of continuous events per second, run rolling time-window aggregations, and push the resulting metrics down to thousands of active client web browsers in under a second.

Building such a system is a major engineering challenge. You cannot simply write raw events to a standard relational database and query them on-demand; the sheer write load will lock tables, and analytical group-by queries over millions of rows will choke database CPUs. Instead, you must build a streaming architecture that separates the write-heavy event ingestion from the read-heavy aggregate visualizations.

This system design guide details the architectural blueprint for a highly scalable, low-latency real-time analytics system capable of ingesting 1,000,000 events per second and broadcasting updates to 50,000 active clients in real time.


System Requirements

To build a production-grade real-time analytics platform, we divide our system requirements into functional, non-functional, and scale specifications.

Functional Requirements

  • High-Volume Ingestion: Ingest raw event streams (such as user clickstreams, system metrics, and transaction logs) from multiple client and server sources.
  • Windowed Aggregations: Support calculations (such as count, sum, average, and unique counts) over tumbling (non-overlapping) and sliding (overlapping) time windows.
  • Real-Time Push Visualizations: Push updated window metrics to client dashboards in real time with sub-second propagation latency.
  • Durable Historical Storage: Store raw and aggregated data for long-term historical analysis and custom period queries.
  • Metric Filtering & Grouping: Allow dashboards to filter and group metrics dynamically by dimensions (such as country, device, and campaign ID).

Non-Functional Requirements

  • High Ingest Availability: Ingest paths must be highly available and buffered to absorb sudden traffic spikes (e.g., during major sales or advertising campaigns).
  • Sub-Second Dashboard Latency: The delay between an event occurring and its representation on the dashboard must be less than 1 second.
  • Fault-Tolerant Processing: Aggregations must be calculated with exactly-once processing guarantees, even if stream processing nodes restart.
  • Low Query Latency: Historical queries for custom time ranges on the dashboard must return in less than 500 milliseconds.

Scale Assumptions

  • Ingestion Throughput: 1,000,000 events per second.
  • Dashboard Audience: 50,000 concurrent active users viewing the real-time dashboard.

API Design and Service Contracts

The analytics system exposes HTTPS REST endpoints for ingestion, HTTP/gRPC for querying historical metrics, and WebSocket connections for real-time update delivery.

1. Ingest Event (POST /v1/events)

Invoked by client SDKs and backend agents to stream event telemetry.

Request Payload:

{
  "eventId": "evt_uuid_00981a",
  "eventType": "ad_click",
  "timestamp": 1780821960,
  "dimensions": {
    "campaignId": "camp_5541",
    "country": "US",
    "deviceType": "MOBILE",
    "userId": "usr_9981a2"
  }
}

Response Payload (202 Accepted):

{
  "status": "ACCEPTED",
  "receivedAt": "2026-06-07T11:26:00.102Z"
}

2. Query Historical Metrics (GET /v1/metrics)

Used by dashboards to retrieve past trend data during page loads.

Query Parameters:

  • metric: "ad_clicks"
  • campaignId: "camp_5541"
  • start: "2026-06-07T10:00:00Z"
  • end: "2026-06-07T11:00:00Z"
  • resolution: "1m"

Response Payload (200 OK):

{
  "metric": "ad_clicks",
  "dimensions": { "campaignId": "camp_5541" },
  "dataPoints": [
    { "time": "2026-06-07T10:00:00Z", "value": 1205 },
    { "time": "2026-06-07T10:01:00Z", "value": 1450 }
  ]
}

3. WebSocket Real-Time Subscription (ws://stream.analytics.com/v1/live)

Clients open a persistent connection to receive real-time metric updates.

Subscription Request (Client to Server):

{
  "action": "subscribe",
  "topic": "metrics:ad_clicks:campaignId:camp_5541"
}

Metric Update Broadcast (Server to Client):

{
  "topic": "metrics:ad_clicks:campaignId:camp_5541",
  "windowStart": "2026-06-07T11:25:00Z",
  "windowEnd": "2026-06-07T11:26:00Z",
  "value": 15890,
  "timestamp": "2026-06-07T11:26:00.005Z"
}

High-Level Architecture

The architecture separates the write-heavy streaming path from the read-heavy query and real-time push pathways.

The Client SDK sends events to the Collector Service. The collector acts as a lightweight buffer proxy, writing events to partitioned Kafka Topics. The Stream Processor (Apache Flink) consumes raw events from Kafka, runs windowed calculations, and updates the ClickHouse Columnar Database.

For real-time updates, Flink writes aggregates to a Redis Pub/Sub broker. WebSocket Gateway Server nodes subscribe to Redis channels and push metrics down to Client Browsers.

Streaming Ingestion and Aggregation Pipeline

This sequence diagram tracks how telemetry events travel through Kafka and Flink before being saved to ClickHouse and routed to Redis.

sequenceDiagram
    autonumber
    participant SDK as Client SDK
    participant Coll as Event Collector
    participant Kafka as Apache Kafka Buffer
    participant Flink as Flink Stream Processor
    participant Click as ClickHouse OLAP DB
    participant Redis as Redis Pub/Sub Broker
    
    SDK->>Coll: POST /v1/events (raw click event)
    Coll->>Kafka: Write Event to Partitioned Topic
    Coll-->>SDK: 202 Accepted
    
    Flink->>Kafka: Pull Batch Event Stream
    Flink->>Flink: Aggregate count inside 60s tumbling window
    Flink->>Click: Batch Insert Aggregated metrics
    Flink->>Redis: Publish Aggregate: {clicks: 15890}

WebSocket Gateway Fan-Out Topology

This diagram illustrates how metric updates are fanned out from the stream processor to thousands of connected browser dashboards.

flowchart TD
    Flink[Flink Processor] -->|1. Publish window result| Redis[Redis Pub/Sub Cluster]
    
    subgraph Gateways [WebSocket Gateway Cluster]
        WS1[WebSocket Server Node 1]
        WS2[WebSocket Server Node 2]
        WS3[WebSocket Server Node 3]
    end
    
    Redis -->|2. Fan-out events| WS1 & WS2 & WS3
    
    WS1 -->|3. Push updates| BrowserA[User Browser A]
    WS2 -->|3. Push updates| BrowserB[User Browser B]
    WS3 -->|3. Push updates| BrowserC[User Browser C]

Low-Level Design and Schema

For high-speed analytical queries, we utilize ClickHouse, a column-oriented OLAP database. This design defines our raw event log and pre-aggregated rollup schema.

ClickHouse Raw Events Table

-- Tracks raw clickstream logs
CREATE TABLE raw_clickstream_events (
    event_id UUID,
    event_type LowCardinality(String),
    event_timestamp DateTime64(3, 'UTC'),
    campaign_id String,
    country LowCardinality(String),
    device_type LowCardinality(String),
    user_id String,
    created_at DateTime DEFAULT now()
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(event_timestamp)
ORDER BY (campaign_id, event_type, event_timestamp);

ClickHouse Hourly Aggregations Table

-- Stores pre-aggregated metrics for fast dashboard queries
CREATE TABLE hourly_analytics_metrics (
    metric_hour DateTime,
    event_type LowCardinality(String),
    campaign_id String,
    country LowCardinality(String),
    click_count UInt64,
    unique_users AggregateFunction(uniq, String)
) ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(metric_hour)
ORDER BY (campaign_id, event_type, metric_hour, country);

Schema Rationale & Storage Optimization

  1. Columnar Database Engine: Standard databases store data in row format. If you execute SELECT SUM(clicks) on a table of 100 columns, a row database reads all 100 columns from disk, creating massive I/O. ClickHouse (a column store) only reads the specific metric column, yielding 10x to 100x faster execution speeds.
  2. SummingMergeTree Engine: ClickHouse automatically aggregates rows with the same sorting keys during background merge passes. This reduces storage footprints and guarantees that historical dashboard queries execute in milliseconds.
  3. LowCardinality(String): Dimensions like country and device_type have a small set of unique values. Representing them as dictionary-encoded values reduces storage size and accelerates filtering scans.

Scaling Challenges and Capacity Estimation

A platform ingesting 1,000,000 events per second requires significant compute, network, and storage capacity.

1. Ingress Network Bandwidth

  • Assumptions:

    • Event Ingress Rate ($R$) = $1,000,000$ events/second
    • Average Raw Event Size ($S$) = $200$ bytes
  • Calculations: $$\text{Ingress Volume} = R \times S = 1,000,000 \times 200\text{ bytes} = 200,000,000\text{ B/sec} = 200\text{ MB/sec}$$ $$\text{Required Network Speed} = 200\text{ MB/sec} \times 8 = 1.6\text{ Gbps}$$

To manage this traffic, the ingress layer uses a cluster of 20 Event Collectors behind a Layer-4 Load Balancer. The load balancer splits requests evenly among the collectors, limiting network ingestion on each node to a manageable 80 Mbps.

2. Daily Database Storage Footprint

  • Calculations: $$\text{Daily Event Count} = 1,000,000\text{ events/sec} \times 86,400\text{ seconds/day} = 86,400,000,000\text{ events/day}$$ $$\text{Raw Storage Size} = 86,400,000,000 \times 200\text{ bytes} = 17,280,000,000,000\text{ bytes} = 17.28\text{ TB/day}$$

Over a year, this accumulates 6.3 PB of raw data. To make this manageable:

  • We utilize ClickHouse's built-in compression (e.g., LZ4 or ZSTD), which reduces the footprint by 5x to $\approx 3.45\text{ TB/day}$.
  • We write raw events to a Materialized View that outputs aggregated hourly summaries to the hourly_analytics_metrics table.
  • Raw events are expired and deleted after $7$ days using ClickHouse TTL definitions, while pre-aggregated metrics are retained indefinitely.

3. WebSocket Broadcast Egress Bandwidth

  • Assumptions:

    • Concurrent active dashboards ($C$) = $50,000$
    • Metric update frequency = $1$ update/second per dashboard
    • Broadcast payload size = $100$ bytes
  • Calculations: $$\text{Total Updates Broadcasted} = 50,000\text{ clients} \times 1\text{ update/sec} = 50,000\text{ updates/sec}$$ $$\text{Egress Bandwidth} = 50,000 \times 100\text{ bytes} = 5,000,000\text{ B/sec} = 5\text{ MB/sec} \approx 40\text{ Mbps}$$

This egress rate is easily handled. If the update payload size increases to 1 KB or if the update frequency increases (e.g., broadcasting multiple metric topics), the egress bandwidth can scale to 400 Mbps. We distribute this load across a cluster of WebSocket gateways to prevent single-node interface saturation.


Failure Scenarios and Resilience

Real-time streaming platforms must maintain accuracy and availability even during partial outages.

1. Stream Processor Backpressure Cascade

If the ClickHouse database slows down due to a heavy query, Flink cannot write window aggregates fast enough.

  • The Threat: Flink's internal queues fill up, triggering backpressure. The ingestion pipeline backs up to Kafka, and if buffers overflow, events are lost.
  • Resilience Design:
    • We configure Kafka with a $72$-hour retention window, allowing it to act as an elastic buffer.
    • Flink uses credit-based flow control to propagate backpressure back to Kafka consumers.
    • The collector service writes to Kafka asynchronously and remains unaffected by database slowdowns. Once ClickHouse recovers, Flink catches up by reading the buffered events from Kafka at maximum speed.

2. Stream Processor Node Crashes

A Flink processing node crashes while computing a sliding window aggregate.

  • The Threat: The state of in-progress windows is lost, leading to inaccurate metrics when the node restarts.
  • Resilience Design:
    • Flink is configured with RocksDB State Store and periodic checkpointing (e.g., every 10 seconds).
    • Checkpoints are saved to an external, distributed filesystem (such as HDFS or Amazon S3).
    • When a node crashes, the cluster manager restarts the task, restores the state from the last successful checkpoint, and replays the Kafka events since that checkpoint, achieving exactly-once processing guarantees.

3. Late-Arriving Events in Sliding Windows

Due to network delays, a mobile client sends click events 2 minutes after they actually occurred.

  • The Threat: If the window is already closed, these late events will be ignored, resulting undercounted metrics.
  • Resilience Design:
    • We use Flink Watermarks to manage late-arriving data. A watermark represents the progression of event time.
    • We allow a configurable lateness window (e.g., 30 seconds). Events arriving within this window are processed.
    • If an event arrives after the watermark, it is routed to a "dead-letter" Kafka topic for asynchronous correction or logged to ClickHouse as a late adjustment.

4. WebSocket Gateway Host Outages

A WebSocket gateway node hosting 10,000 active client connections crashes.

  • The Threat: 10,000 clients lose their streams simultaneously, causing dashboards to display stale data.
  • Resilience Design:
    • The client dashboard SDK implements automated reconnection logic.
    • Upon disconnection, clients wait for a randomized delay (to prevent a thundering herd on the remaining servers) and attempt to connect to the load balancer again.
    • The load balancer routes the connections to the remaining healthy servers in the gateway pool.

Architectural Trade-offs

Choosing the components and data structures for real-time analytics requires balancing data accuracy against infrastructure costs.

Trade-off 1: Tumbling Windows vs. Sliding Windows

Tumbling windows group events in fixed, non-overlapping intervals (e.g., 9:00 to 9:05). Sliding windows group events in overlapping intervals (e.g., the last 5 minutes, moving forward every 10 seconds).

Aspect Tumbling Windows Sliding Windows
Compute Overhead Low (Each event is processed once) High (Each event falls into multiple overlapping windows)
Memory State size Low (State is cleared when the window closes) High (Must store events for the duration of the slide)
User Experience Steppy (Dashboard metrics update in block chunks) Smooth (Metrics drift smoothly over time)

Trade-off 2: Columnar ClickHouse vs. Key-Value InfluxDB

We compare ClickHouse (a general-purpose columnar database) with InfluxDB (a specialized key-value time-series database).

Aspect Columnar ClickHouse InfluxDB
Aggregation Speed Extremely High on massive datasets High on simple queries; struggles on large scans
Write Compression High (Column structure optimizes compression) Medium (Index overhead limits compression)
Metadata Tag cardinality High (Can handle millions of unique tags) Low (High cardinality tags crash the index)
SQL Support High (Supports standard SQL queries and joins) Low (Uses custom query languages)

Staff Engineer Perspective

Operating real-time analytics at scale requires managing streaming states and network egress.


Verbal Script

Interviewer: "How would you design a real-time analytics system that can calculate the count of unique users in the last 15 minutes, with a 10-second update interval, at a scale of 1,000,000 events per second?"

Candidate: "Calculating exact unique counts (cardinality) over sliding windows at this scale is a hard problem because we cannot store the list of user IDs in memory. Doing so would consume terabytes of RAM and crash the stream processor. Instead, we use HyperLogLog (HLL) sketches to estimate cardinality.

First, Flink consumes events from Kafka and groups them into 10-second tumbling windows. For each window, Flink maps user IDs to an HLL sketch (which consumes only a few kilobytes of state).

Second, we write these 10-second HLL sketches to ClickHouse. To query the unique user count for the last 15 minutes, the dashboard runs a query that aggregates the 90 HLL sketches from that window using ClickHouse's HLL merge function. This estimates unique counts with less than 1% error, consumes minimal CPU, and returns results in milliseconds."

Interviewer: "What happens if a Flink node experiences a garbage collection pause of 10 seconds, causing it to fall behind Kafka?"

Candidate: "When Flink experiences a pause, it falls behind the Kafka partition offset. Once the pause ends, the Flink coordinator detects the lag and increases consumer read speeds, consuming buffered events at maximum throughput.

To handle the burst of events, Flink allocates dynamic buffer sizes and uses credit-based flow control. If the lag continues, Flink allocates more task slots to the affected partitions to balance the load, catching up to the real-time stream without dropping events."

Interviewer: "How would you prevent a slow network connection on one client from slowing down the entire WebSocket gateway server?"

Candidate: "This is a classic Slow Consumer problem. If a client has a poor connection, the TCP socket buffer on the gateway fills up, and the gateway's write buffer blocks, consuming memory and thread resources.

To protect the gateway, we isolate client connection channels. The gateway writes updates to an internal ring buffer for each client. If a client's buffer fills up (e.g., it holds more than 50 unsent updates), the gateway drops the connection, releases the socket, and frees the resources, ensuring a slow client does not impact other users on the server."

Interviewer: "What is your strategy for backfilling historical data if the Flink aggregation code is updated with a new business rule?"

Candidate: "We use a Kappa Architecture with dual-reading paths.

When the business logic changes, we deploy the new Flink job side-by-side with the active job. We configure the new job to start reading from the beginning of the Kafka retention window (e.g., 7 days ago).

The new job writes its output to a temporary ClickHouse table. Once the new job catches up to the real-time stream, we swap the active table name alias in ClickHouse to point to the new table, and terminate the old Flink job. This backfills historical data with zero downtime and no database lock contention."


Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.