System Design: Building a Metrics Platform Like Prometheus

Metrics platforms answer operational questions quickly:

Is error rate increasing?
Which service owns the latency spike?
Did a deployment change request volume?
Which tenant is causing the queue backlog?
Are we about to run out of disk, memory, or connections?

At small scale, Prometheus plus Grafana is enough. At larger scale, the hard parts are not the line charts. The hard parts are ingestion fanout, high-cardinality labels, retention, downsampling, alert evaluation, multi-tenancy, query cost, and keeping the platform alive during the exact incidents it is supposed to debug.

This guide designs a production metrics platform inspired by Prometheus-style systems. It covers ingestion, scraping versus push, storage, label indexing, rollups, retention, alerting, multi-tenancy, query APIs, and operational guardrails.

Requirements and System Goals

A metrics platform is a specialized time-series data system. It must ingest massive streams of numeric data and support low-latency querying, even during production outages when usage spikes.

Functional Requirements

Multi-Format Collection: Ingest metrics from services using both pull-based scraping (pull) and push-based remote writing (push).
Core Metric Types: Support standard telemetry metrics including Counters, Gauges, and Histograms.
Dynamic Filtering & Grouping: Allow filtering and grouping by arbitrary key-value label tags (e.g., service="payment-api", status="500").
Query Engine & API: Expose a query engine (e.g., supporting PromQL-style expressions) for dashboard visualization.
Scheduled Alert Evaluations: Evaluate rules at configured intervals and forward firing alerts to notification targets.
Data Retention & Downsampling: Store raw high-resolution samples for short-term debugging and automatically roll them up into downsampled datasets for long-term historical trends.
Multi-Tenant Isolation: Isolate metrics, ingestion budgets, and query namespaces between distinct teams or organizations.

Non-Functional Requirements

High Write Throughput: Support ingestion of millions of samples per second from a fleet of microservices.
Sub-Second Query Latency: Dashboards and alerts must load aggregate query results in under 500 milliseconds.
Ingestion Sharding & Availability: Scale ingestion horizontally. A failure in one scraper or ingestion worker must not affect the write throughput of others.
Cardinality Guardrails: Enforce strict cardinality budgets per tenant to prevent memory exhaustion from runaway label allocation.
Isolate Alerting from Query Spikes: Ensure alert evaluation continues running on dedicated resources even if ad-hoc query traffic spikes.

API Interfaces and Service Contracts

We expose REST endpoints for remote write ingestion and PromQL-style queries. Scraper services rely on lightweight gRPC target scraping definitions.

Push Ingestion API (Remote Write)

Workloads push metrics in batches using a binary protobuf format or compressed JSON.

Endpoint: POST /v1/metrics/push
Headers:
- Content-Type: application/json
- X-Tenant-ID: tenant_payments_01
Request Payload:

{
  "samples": [
    {
      "metric": "http_requests_total",
      "labels": {
        "service": "checkout-api",
        "method": "POST",
        "route": "/v1/orders",
        "status": "200"
      },
      "timestampMs": 1775643330000,
      "value": 1842.0
    }
  ]
}

Response Payload (HTTP 202 Accepted):

{
  "ingestedCount": 1,
  "droppedCount": 0
}

Metrics Query API

Endpoint: GET /v1/metrics/query_range
Query Parameters:
- query=sum(rate(http_requests_total{service="checkout-api"}[5m])) by (status)
- start=2026-06-06T14:00:00Z
- end=2026-06-06T14:30:00Z
- step=15s
Response Payload (HTTP 200 OK):

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {
          "status": "200"
        },
        "values": [
          [1775643300, "154.2"],
          [1775643315, "158.5"]
        ]
      }
    ]
  }
}

Internal Scraper Target Service Contract (gRPC)

Scraper agents retrieve list target definitions via gRPC:

syntax = "proto3";

package codesprintpro.metrics.v1;

service ScrapeTargetCoordinator {
  rpc GetScrapeTargets (ScrapeTargetsRequest) returns (ScrapeTargetsResponse);
  rpc ReportScrapeStatus (ScrapeStatusRequest) returns (ScrapeStatusResponse);
}

message ScrapeTargetsRequest {
  string collector_id = 1;
  string cluster_id = 2;
}

message ScrapeTargetsResponse {
  repeated ScrapeTarget targets = 1;
}

message ScrapeTarget {
  string target_url = 1;         // e.g. "http://10.0.12.92:9090/metrics"
  int32 scrape_interval_sec = 2;
  map<string, string> labels = 3;
}

message ScrapeStatusRequest {
  string collector_id = 1;
  string target_url = 2;
  bool scrape_success = 3;
  int64 duration_ms = 4;
}

message ScrapeStatusResponse {
  bool acknowledge = 1;
}

High-Level Design and Visualizations

Our metrics platform separates pull-based scraping and push remote write ingestion from the heavy query and alert evaluation engines.

End-to-End Metrics Lifecycle

flowchart TD
    subgraph Targets [Workload Fleet]
        App1[App Instance A - Exposes /metrics]
        App2[App Instance B - Exposes /metrics]
        Serverless[Serverless Job - Remote Writes]
    end

    subgraph Collection [Ingestion & Sharding]
        Scraper[Scraper Agent Pool] -->|1. Pull Scrape /metrics| App1
        Scraper -->|2. Pull Scrape /metrics| App2
        PushGateway[Remote Write Push Gateway] -->|3. POST JSON / Proto| Serverless
        
        Scraper -->|4. Push Samples| IngestQueue[Kafka Ingestion Queue]
        PushGateway -->|5. Push Samples| IngestQueue
        
        IngestQueue -->|6. Dequeue & Hash Shard| Workers[Ingestion Worker Pool]
    end

    subgraph Storage [TSDB Tiered Storage]
        Workers -->|7. Write Chunk Files| MemStore[In-Memory TSDB Buffer]
        MemStore -->|8. Flush 2h Blocks| SSDStore[Local SSD Store]
        SSDStore -->|9. Compress & Ship| S3[Long-Term Object Storage - AWS S3]
        Compactor[Background Compactor & Rollup Engine] -->|10. Read/Write Rollups| S3
    end

    subgraph QueryEngine [Query & Alert Control Plane]
        Alertor[Alert Evaluation Service] -->|11. Fetch PromQL expression| MetadataDB[(Metadata PostgreSQL)]
        Alertor -->|12. Run Evaluation Query| QueryGateway[Query Execution Gateway]
        Grafana[Grafana Dashboard UI] -->|13. Refresh Charts| QueryGateway
        
        QueryGateway -->|14. Query Memory / SSD| SSDStore
        QueryGateway -->|15. Query Archive| S3
        
        Alertor -->|16. Forward firing alert| Alertmanager[Alert Router & Notification Engine]
        Alertmanager -->|17. Page On-Call| PagerDuty[PagerDuty / Slack]
    end

Scheduled Alert Evaluation and Isolation Loop

Alert evaluation runs on dedicated resources isolated from Grafana dashboard traffic. This ensures alerts are evaluated on schedule even during query storms.

flowchart TD
    Cron[Evaluation Loop - Every 15s] --> Dequeue[Fetch Active Alert Rules]
    Dequeue --> Engine[Isolated PromQL Evaluation Engine]
    
    Engine --> CheckCache{Check Active Metrics Cache}
    CheckCache -->|Cache Hit| Process[Evaluate Expression Values]
    CheckCache -->|Cache Miss| QueryGate[Query Execution Engine - SSD / S3]
    QueryGate --> Process
    
    Process --> CheckThreshold{Is Value > Threshold?}
    CheckThreshold -->|No| Clear[Resolve Pending / Firing State in MetaDB]
    CheckThreshold -->|Yes| SetPending[Mark Alert State PENDING in MetaDB]
    
    SetPending --> CheckDuration{Has Alert been active for 'for' window?}
    CheckDuration -->|No| Loop[Wait for Next Loop]
    CheckDuration -->|Yes| FireAlert[Mark Alert FIRING & Dispatch to Alertmanager]

Low-Level Design and Schema Strategies

Our metadata store is a PostgreSQL database. It stores tenant budgets, metric definitions, active targets, and alert configurations. The raw metrics samples are stored in a TSDB layout.

PostgreSQL Metadata DDLs

-- Track registered metrics configurations, owner details, and cardinality metrics
CREATE TABLE metric_definitions (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id VARCHAR(64) NOT NULL,
    metric_name VARCHAR(256) NOT NULL,
    metric_type VARCHAR(32) NOT NULL,             -- 'COUNTER', 'GAUGE', 'HISTOGRAM'
    description TEXT,
    unit VARCHAR(64),
    active_series_count BIGINT NOT NULL DEFAULT 0,
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    CONSTRAINT uk_tenant_metric UNIQUE (tenant_id, metric_name)
);

CREATE INDEX idx_metrics_tenant_lookup 
ON metric_definitions (tenant_id);

-- Enforce metric cardinality limits per tenant
CREATE TABLE tenant_budgets (
    tenant_id VARCHAR(64) PRIMARY KEY,
    max_active_series BIGINT NOT NULL DEFAULT 500000,
    max_samples_per_second BIGINT NOT NULL DEFAULT 50000,
    allowed_custom_labels_count INT NOT NULL DEFAULT 15,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
);

-- Manage configuration of alert evaluations
CREATE TABLE alert_rules (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id VARCHAR(64) NOT NULL,
    rule_name VARCHAR(128) NOT NULL,
    promql_expression TEXT NOT NULL,
    duration_interval_sec INT NOT NULL DEFAULT 15,
    for_duration_sec INT NOT NULL DEFAULT 300,    -- Alert must be active for this long to fire
    severity VARCHAR(32) NOT NULL,                -- 'CRITICAL', 'WARNING', 'INFO'
    status VARCHAR(32) NOT NULL DEFAULT 'ACTIVE', -- 'ACTIVE', 'PAUSED'
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_alert_rules_tenant 
ON alert_rules (tenant_id, status);

-- Track alert evaluation states and pending windows
CREATE TABLE alert_evaluation_states (
    rule_id UUID NOT NULL REFERENCES alert_rules(id) ON DELETE CASCADE,
    tenant_id VARCHAR(64) NOT NULL,
    current_state VARCHAR(32) NOT NULL,           -- 'OK', 'PENDING', 'FIRING'
    first_triggered_at TIMESTAMPTZ,
    last_evaluated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    last_evaluated_value DOUBLE PRECISION,
    PRIMARY KEY (rule_id)
);

Relational Representation of Time-Series Storage (For Conceptual Reference)

-- Unique time series identity table (Metric Name + sorted Labels hash)
CREATE TABLE metric_series (
    series_id BIGSERIAL PRIMARY KEY,
    tenant_id VARCHAR(64) NOT NULL,
    metric_name VARCHAR(256) NOT NULL,
    labels_hash CHAR(64) NOT NULL,                -- SHA-256 hash of sorted label array
    labels JSONB NOT NULL,                        -- Store label key-values: {"service":"checkout","status":"500"}
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    CONSTRAINT uk_tenant_series UNIQUE (tenant_id, metric_name, labels_hash)
);

CREATE INDEX idx_metric_series_hash ON metric_series (labels_hash);

-- Granular metrics samples
CREATE TABLE metric_samples (
    series_id BIGINT NOT NULL,
    timestamp TIMESTAMPTZ NOT NULL,
    value DOUBLE PRECISION NOT NULL,
    PRIMARY KEY (series_id, timestamp)
) PARTITION BY RANGE (timestamp);

Scaling and Operational Challenges

To design a system that can process millions of samples per second, we must evaluate write bandwidth, sharding logic, and memory structures.

Back-of-the-Envelope Capacity Estimations

Let us estimate capacity requirements for a large-scale system:

Active Time-Series: 10,000,000 unique series monitored across all tenants.
Collection Interval: Scraped or pushed every 10 seconds.
Ingestion Sample Throughput: $$\text{Throughput} = \frac{10,000,000\text{ series}}{10\text{ seconds}} = 1,000,000\text{ samples/second}$$
Compressed Sample Size: Using Gorilla time-series compression, we compress timestamp-value pairs to 1.5 bytes per sample on average.
Storage Ingestion Write Rate: $$\text{Storage ingestion} = 1,000,000\text{ samples/sec} \times 1.5\text{ bytes} = 1,500,000\text{ bytes/sec} \approx 1.5\text{ MB/sec}$$ This is low due to compression. However, the raw incoming JSON payload is much larger: $$\text{JSON Sample Size} \approx 150\text{ bytes (uncompressed JSON)}$$ $$\text{Raw network write ingest} = 1,000,000 \times 150\text{ bytes} = 150\text{ MB/sec} \approx 1.2\text{ Gbps}$$
Sharding & Memory Buffers: To handle 1.2 Gbps network ingress, we deploy a pool of 50 ingestion workers. Each worker buffers samples in memory before flushing:
- We buffer 2 hours of raw data in memory to support efficient compression.
- Number of samples per 2 hours: $$\text{Samples per 2h} = 1,000,000\text{ samples/sec} \times 7,200\text{ seconds} = 7.2 \times 10^9\text{ samples}$$
- Compressed memory footprint: $$\text{Memory required} = 7.2 \times 10^9 \times 1.5\text{ bytes} \approx 10.8\text{ GB}$$ This memory footprint is easily sharded across the 50 workers, requiring less than 220 MB RAM per node.

Trade-offs and Architectural Alternatives

Pull Scraping vs. Push Remote-Write

Dimension	Pull-based (Scraping)	Push-based (Remote Write)
Control Plane	Scraper determines rate and intervals; easy to detect when a target is down.	Client determines write rates; requires ingestion gateways to rate-limit clients.
Network Topology	Requires direct path access from scraper to service ports (difficult behind NAT).	Easy to integrate across Kubernetes namespaces, edge nodes, and cloud environments.
Workload Overhead	Zero client-side push queues or buffer logic.	Client must buffer and retry failed writes, consuming application CPU and memory.

Time-Series Storage Format Alternatives

Prometheus TSDB Block Layout:
- Pros: High compression ratios (Gorilla/Double-delta); optimized for date-range range scans on metrics.
- Cons: Difficult to update or delete single records; does not support relational SQL queries.
Wide-Column NoSQL (Cassandra / InfluxDB):
- Pros: Scales writes easily; supports flexible schemas and custom tags.
- Cons: High memory usage for indexes; index cardinality limit can cause tombstone failures.
General-Purpose Relational DB (PostgreSQL + TimescaleDB):
- Pros: Relational joins, transactional guarantees, standard SQL query support.
- Cons: Higher storage cost per sample; slower write path compared to custom TSDB files.

Failure Modes and Fault Tolerance Strategies

High-Cardinality Label Explosion Outage

If a developer deploys a change that adds a high-cardinality label (e.g., user_id or request_id) to a metric, every request creates a new time series. This cause index bloat and can crash the TSDB indexing service.

Mitigation: We enforce an ingestion cardinality filter. We track active series hashes in Redis using HyperLogLog. If the new series rate for a tenant exceeds 1,000/minute, we drop the high-cardinality samples and raise an alert.

Alert Evaluation Starvation During Query Storms

If dashboard queries (Grafana refreshes) and alert evaluations share the same query execution pool, a sudden spike in dashboard usage during an incident can block alert evaluation.

Mitigation: We dedicate separate read paths. Alert evaluation queries run against a read-replica database cluster, while Grafana query execution is routed to a separate, rate-limited query pool.

Scrape Target Discoverability Outage

During Kubernetes cluster scaling, scrapers may fail to resolve scrape target paths, leading to missing metrics.

Mitigation: We run local node scrapers as DaemonSets. Each DaemonSet scraper targets pods running on the same VM node using localhost addresses, eliminating dependency on central service discovery networks.

Staff Engineer Perspective

[!WARNING] Enforcing Strict Cardinality Budgets per Tenant Unbounded cardinality can crash a metrics platform. If a client injects a high-cardinality tag (like an email address or request ID), the storage index size grows exponentially. We prevent this by checking new series requests against a Redis-backed cardinality tracker. If a tenant exceeds their active series budget, we drop the new series and return a rate-limit error:
async function checkCardinalityLimit(tenantId: string, seriesHash: string): Promise<boolean> {
  const budget = await db.getTenantBudget(tenantId);
  const activeSeries = await redis.sismember(`series:${tenantId}`, seriesHash);
  
  if (!activeSeries) {
    const currentCount = await redis.scard(`series:${tenantId}`);
    if (currentCount >= budget.maxActiveSeries) {
      metricsLogger.increment("cardinality_rejections_total");
      return false; // Reject ingestion
    }
    await redis.sadd(`series:${tenantId}`, seriesHash);
  }
  return true; // Allow ingestion
}

[!NOTE] Implementing Downsampling and Rollup Policies Keeping raw telemetry data forever is expensive. We implement rollup policies to downsample data over time, saving storage costs while preserving historical trends:

Raw data (10s interval) is stored for 14 days.

Data is downsampled to 5-minute intervals for 90 days.

Data is downsampled to 1-hour intervals for 2 years. Counters must be aggregated using sums, while gauges must be aggregated using averages, minimums, and maximums.

Verbal Script

Interviewer: "How would you design a metrics platform to handle a cardinality explosion caused by a developer adding a user ID label to an HTTP request metric?"

Candidate: "I would implement guardrails at the ingestion layer. A cardinality explosion occurs when unique, high-cardinality labels create millions of time series, bloating the TSDB index and degrading query performance.

First, I would enforce an blocklist of known high-cardinality labels (e.g., user_id, request_id, email) and reject any metric containing them.

Second, I would enforce active series limits per tenant. We hash each metric's labels and register the hash in a Redis set. If the set's size exceeds the tenant's configured budget, we drop the new series and return a rate-limit error.

Finally, we track the rate of new series creation. If a tenant creates series at an unusual rate (e.g., more than 1,000 new series per minute), we throttle their ingestion and send an alert to the platform team."

Interviewer: "How do you ensure alerts continue to evaluate on schedule when users are heavily querying Grafana dashboards during an incident?"

Candidate: "We isolate query execution paths. Alert evaluation and dashboard queries should never share resources.

We dedicate a separate read gateway and replica database nodes for alert evaluations. Dashboard queries are routed to a separate query gateway that is rate-limited.

If dashboard queries surge, they are throttled at the gateway, ensuring the alert evaluator has sufficient read capacity.

We also cache dashboard query results. Since multiple users often look at the same dashboards during an incident, caching prevents redundant database scans and keeps query load stable."