System Design: Building a Usage Metering and Billing Platform

Usage Metering and Billing Platform

At first glance, usage-based billing looks like a straightforward engineering task: count what a customer consumed (e.g., api calls, server hours, active users) and multiply it by a unit price.

In production, however, a usage metering and billing system is one of the most sensitive and operationally complex systems in an enterprise.

If your platform undercounts usage, you experience direct revenue loss. If you overcount, you break customer trust, violate service level agreements (SLAs), and generate support overhead. If your billing metrics drift or are non-reproducible, finance teams cannot close monthly books. Managing these challenges requires designing a highly consistent, auditable, and idempotent event aggregation engine.

Requirements and System Goals

A production-grade usage metering and billing platform must satisfy strict operational and consistency goals:

Functional Requirements

Usage Event Ingestion: Ingest structured usage events (meters) from various application services at high write throughput.
Ingestion and Aggregation Idempotency: Guarantee exactly-once aggregation processing, preventing duplicate request retries from double-billing customers.
Flexible Aggregation Windows: Aggregate raw usage events into hourly, daily, and monthly billing buckets categorized by customer, product, and specific meter type.
Multi-Tenant Versioned Pricing: Apply complex pricing configurations (including free tiers, tiered usage, volume-based pricing, and custom negotiated contracts).
Continuous Reconciliation and Audits: Provide an automated, queryable system to audit and reconcile invoices back to individual raw usage events.

Non-Functional Requirements

Zero-Loss Data Integrity: Never drop a committed usage event. The platform must prioritize data durability over immediate availability.
Lateness Tolerance: Gracefully handle late-arriving events (e.g., from offline clients or delayed message queues) without corrupting closed billing cycles.
Auditable Ledger State: Maintain an append-only transaction history. Direct deletions or mutations of usage logs are strictly prohibited.
Operational Isolation: Isolate high-volume indexing backfills and contract recalculations from the hot real-time event ingestion path.

API Interfaces and Service Contracts

The billing platform exposes APIs for event ingestion, dynamic pricing contract configurations, and invoice generation.

graph TD
    Services[Product Services] -->|1. POST /v1/usage/events| Ingestion[Ingestion API Gateways]
    Ingestion -->|2. Buffer Events| Kafka[Kafka Event Bus]
    Billing[Billing Engine] -->|3. POST /v1/invoices| Invoice[Invoice Generator]

1. Ingest Usage Event

Application services emit usage metrics immediately after executing an operation.

Endpoint: POST /v1/usage/events
Headers:
- Content-Type: application/json
- X-Tenant-ID: tenant_aws_cloud_prod

Request Payload:

{
  "event_id": "evt_meter_20260605_abc123",
  "customer_id": "cus_enterprise_netflix",
  "product": "ai-translation-engine",
  "meter": "characters_translated",
  "quantity": 145000,
  "unit": "character",
  "occurred_at": "2026-06-05T12:40:00Z",
  "source": {
    "service": "translation-worker-pool",
    "request_id": "req_api_9921a"
  },
  "attributes": {
    "source_language": "en",
    "target_language": "es"
  }
}

Response: 202 Accepted (Event queued for processing)

2. Configure Pricing Plan Version

Finance teams define contracts using versioned tier templates:

Endpoint: POST /v1/pricing/plans
Request Payload:

{
  "plan_id": "plan_enterprise_tier_v3",
  "plan_name": "Enterprise AI Translation Tier",
  "version": 3,
  "effective_from": "2026-06-01T00:00:00Z",
  "effective_to": "2026-12-31T23:59:59Z",
  "currency": "USD",
  "meters": {
    "characters_translated": {
      "included_allowance": 1000000,
      "aggregation_model": "SUM",
      "tiers": [
        { "up_to": 5000000, "price_per_unit": 0.00005 },
        { "up_to": null, "price_per_unit": 0.00004 }
      ]
    }
  }
}

Response: 201 Created

High-Level Design and Visualizations

To scale billing operations, we separate Real-Time Event Aggregation from the Batch Invoicing and Reconciliation paths.

Billing and Aggregation Data Pipeline

sequenceDiagram
    autonumber
    participant App as Application Service
    participant Ingest as Ingestion Gateway (Deduplicator)
    participant Kafka as Kafka Event Broker
    participant Flink as Flink Stream Processor
    participant DB as Postgres Billing DB (Buckets)
    participant Raw as S3 Immutable Event Store

    App->>Ingest: POST /v1/usage/events (event_id: evt_123)
    Note over Ingest: Verify event format,<br/>check Redis unique filter
    Ingest->>Kafka: Publish event to 'usage.events' (Key: customer_id)
    Ingest-->>App: Return HTTP 202 Accepted
    
    par Stream Processing
        Kafka->>Flink: Stream events
        Note over Flink: Aggregate events in sliding windows<br/>(Deduplicate via state backend)
        Flink->>DB: UPSERT hourly usage buckets (ON CONFLICT DO UPDATE)
    and Raw Archiving
        Kafka->>Raw: Batch dump raw JSON events to S3 partition
    end

Late-Event Reconciliation Workflow

Usage events can arrive late due to client-side offline buffering or network partitions. We process these using adjustment logs rather than modifying closed database buckets.

sequenceDiagram
    autonumber
    participant App as Reconnected Client
    participant Ingest as Ingestion API
    participant DB as Postgres Billing DB

    App->>Ingest: Ingest usage event (occurred_at: June 1st, Ingested on: June 15th)
    Note over Ingest: Ingested date (June 15th) is AFTER June 10th (Billing Cycle Lock date)
    Ingest->>DB: Write to usage_events table (flagged as late-arriving)
    Note over DB: Aggregation engine bypasses June 1st closed buckets
    Ingest->>DB: Write to usage_adjustments ledger table (linked to June invoice)
    Note over DB: Finance processes late adjustment credit/charge on July invoice

Low-Level Design and Schema Strategies

To guarantee absolute auditability, all database tables are designed to support append-only entries and strict version control.

Database Schema (PostgreSQL)

-- Raw Usage Events table (Source of Truth)
CREATE TABLE usage_events (
    event_id VARCHAR(128) PRIMARY KEY,
    tenant_id VARCHAR(64) NOT NULL,
    customer_id VARCHAR(64) NOT NULL,
    product VARCHAR(128) NOT NULL,
    meter VARCHAR(128) NOT NULL,
    quantity NUMERIC(18, 4) NOT NULL,
    unit VARCHAR(32) NOT NULL,
    occurred_at TIMESTAMPTZ NOT NULL,
    ingested_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    source JSONB NOT NULL,
    attributes JSONB NOT NULL DEFAULT '{}'
);

-- Index customer usage queries and partitioned time checks
CREATE INDEX idx_usage_events_reconciliation 
ON usage_events (customer_id, occurred_at DESC, meter);

-- Aggregated Hourly Buckets table (Derived cache)
CREATE TABLE usage_hourly_buckets (
    tenant_id VARCHAR(64) NOT NULL,
    customer_id VARCHAR(64) NOT NULL,
    product VARCHAR(128) NOT NULL,
    meter VARCHAR(128) NOT NULL,
    bucket_start TIMESTAMPTZ NOT NULL,
    quantity NUMERIC(18, 4) NOT NULL DEFAULT 0.0000,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (tenant_id, customer_id, product, meter, bucket_start)
);

-- Processed events log to avoid double-aggregation during worker retries
CREATE TABLE processed_aggregation_logs (
    event_id VARCHAR(128) PRIMARY KEY,
    processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Usage adjustments ledger (Never delete/modify records)
CREATE TABLE usage_adjustments (
    adjustment_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    customer_id VARCHAR(64) NOT NULL,
    meter VARCHAR(128) NOT NULL,
    quantity NUMERIC(18, 4) NOT NULL,
    reason TEXT NOT NULL,
    related_invoice_id VARCHAR(128),
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Idempotent Aggregation Database Transaction

When aggregating consumed events from Kafka, the processing thread must atomically mark the event as processed and update the hourly buckets.

-- 1. Begin atomic transaction
BEGIN;

-- 2. Attempt to register the event ID in the processed aggregation log
INSERT INTO processed_aggregation_logs (event_id)
VALUES ('evt_meter_20260605_abc123')
ON CONFLICT DO NOTHING;

-- Note: The application checks the affected rows count. 
-- If affected rows is 0, the event is a duplicate. Commit and skip aggregation!

-- 3. Upsert aggregate bucket increment
INSERT INTO usage_hourly_buckets (tenant_id, customer_id, product, meter, bucket_start, quantity)
VALUES (
    'tenant_aws_cloud_prod',
    'cus_enterprise_netflix',
    'ai-translation-engine',
    'characters_translated',
    DATE_TRUNC('hour', TIMESTAMPTZ '2026-06-05T12:40:00Z'),
    145000
)
ON CONFLICT (tenant_id, customer_id, product, meter, bucket_start)
DO UPDATE SET
    quantity = usage_hourly_buckets.quantity + EXCLUDED.quantity,
    updated_at = NOW();

-- 4. Commit transaction
COMMIT;

Scaling and Operational Challenges

Managing usage billing for large SaaS companies requires handling high write traffic during concurrent event spikes without causing database connection pool starvation.

Stream Processing IOPS Scaling Calculations

Let us estimate the write IOPS (Input/Output Operations Per Second) reductions achieved when implementing a stream aggregation layer (e.g., Apache Flink) compared to writing every raw event directly to a relational database.

Let:

$E$ = Total incoming usage events = $100,000,000$ events/day.
$C$ = Number of active customers = $10,000$ customers.
$M$ = Average number of meters tracked per customer = $5$ meters.
$T$ = Event distribution = Uniformly distributed over 24 hours.

First, calculate the average raw events per second (Ingestion QPS):

$$\text{QPS}_{\text{raw}} = \frac{100,000,000 \text{ events}}{24 \text{ hours} \times 3600 \text{ seconds}} \approx 1,157 \text{ requests/second}$$

Without Stream Aggregation: Writing every event directly to a relational database requires $1,157$ write IOPS. During peak hours, this can spike by a factor of 10, requiring over 11,500 write IOPS, which causes high index locking overhead on PostgreSQL.
With Stream Aggregation (Apache Flink): Flink buffers events in memory, groups them by {customer_id, meter}, and flushes aggregated totals to PostgreSQL in 1-minute micro-batches.

Calculate the maximum database writes per minute with Flink:

Total unique combinations of customers and meters = $C \times M = 10,000 \times 5 = 50,000$ possible active metric streams.
Even if all 50,000 streams are active in a given minute, Flink writes at most 1 update per stream per minute.

$$\text{Writes}{\text{minute}} = 50,000 \text{ updates/minute}$$ $$\text{IOPS}{\text{database}} = \frac{50,000}{60 \text{ seconds}} \approx 833 \text{ write IOPS}$$

Scaling Insights

The database load becomes bounded by the cluster size (number of metric streams) rather than scaling linearly with raw request traffic. Even if ingestion QPS increases from 1,157 to 50,000 events/second (a 43x increase), the database write IOPS remains flat at approximately 833 IOPS, protecting your transactional database.

Trade-offs and Architectural Alternatives

Backend engineers must evaluate trade-offs in aggregation models and storage architectures:

Dimension	Stream Processing Aggregation (Apache Flink)	Batch Aggregation (Apache Spark / SQL)	Real-Time Cache Tracking (Redis)
Data Freshness	Near Real-Time (Aggregates updated in sub-seconds/minutes).	Delayed (Typically run as hourly or nightly batch cron jobs).	Instantaneous (Sub-millisecond updates).
Compute Overhead	High (Requires running and maintaining a Stateful Flink cluster).	Low (Compute resources are only spun up during scheduled runs).	Low (Simple key-value operations).
Durable Precision	Excellent (Utilizes RocksDB checkpointing state backends).	Excellent (Re-reads immutable raw storage logs).	Low (Subject to data loss if Redis nodes experience un-replicated crashes).
Auditability	High (Streams to both database buckets and cold storage archives).	High (Reconstructs data by querying raw source files).	Poor (Only maintains the current running total value; deletes historical sequences).
Typical Use Case	Real-time dashboards, SaaS overage alerts, quota check gates.	Monthly invoice runs, financial audits, historical analytics.	Short-term rate-limiting counters.

Failure Modes and Fault Tolerance Strategies

1. Ingestion Deduplication Failures

If an application retry publishes the same event twice because of a network timeout, the customer is double-billed.

Resolution Strategy: Use Distributed Bloom Filters combined with database primary key constraints. At the API Gateway, we check if the incoming event_id exists in a high-speed Redis Bloom filter (configured with a 24-hour TTL). If the filter returns a positive match, we verify against the raw database. If confirmed, we drop the duplicate event and return HTTP 200 OK, ensuring safe client retries.

2. Dropped Events during Message Ingestion

If the Kafka cluster goes offline or experiences a partition split, usage events can be lost.

Resolution Strategy: Apply the Transactional Outbox Pattern on all product services. Application nodes never write to Kafka synchronously. Instead, they write usage events to a local database transactional_outbox table in the same transaction as the user action. A separate, resilient relay process reads this table and publishes to Kafka with an at-least-once guarantee.

3. Closed Billing Period Adjustments

An offline device reconnects after 15 days, emitting usage events that occurred during the previous month's billing cycle, which has already been closed, paid, and invoiced.

Resolution Strategy: Enforce a Lockout Window (e.g., the 5th day of the next month). Any event with an occurred_at timestamp older than the current closed billing period is routed to the usage_adjustments ledger. The billing engine calculates the value of these late events and appends them as a credit or charge line item on the current active invoice, preserving the integrity of historical paid invoices.

Staff Engineer Perspective

The Golden Rule: Never Delete or Mutate Billing Logs

One of the most critical security and compliance guidelines for billing platforms is to treat raw usage logs as an immutable ledger:

If a customer is incorrectly billed because of a bug, never execute UPDATE usage_events or DELETE FROM usage_hourly_buckets to fix the numbers.
Doing so destroys the audit trail required by corporate accounting and SOC2 compliance.
Best Practice: Always execute an append-only adjustment. Write an explicit negative usage record to the usage_adjustments table linking back to the original event ID. The billing engine applies the adjustment dynamically during invoice runs, ensuring the entire correction process remains transparent and auditable.

Continuous Reconciliation Audits

To detect data drift or processing anomalies before they reach the user's invoice:

Deploy a background cron engine that runs daily Reconciliation Audits.
This engine runs a query summing all raw usage events in usage_events for a customer and compares the total against the aggregates stored in usage_hourly_buckets.
If a delta is detected, the engine raises a high-priority alert to the operations team, preventing corrupt invoices from being sent out at the end of the month.

Verbal Script

Interviewer: "How would you design a usage-based billing and metering platform for a high-volume SaaS platform, and how do you guarantee data consistency?"

Candidate:

"To design a highly consistent and audit-safe usage metering and billing platform, I would model the system as an event-driven aggregation engine based on the principle of immutable raw event storage.

When product services record consumption, they use the Transactional Outbox Pattern to write usage events containing a unique event_id to their local database outbox table. This guarantees that we never drop an event because of network failures. A relay engine reads the outbox and publishes the events to a partitioned Kafka topic, keying messages by customer ID to maintain strict ordering.

An API Gateway interceptor deduplicates incoming events using a Redis Bloom filter. To ensure data consistency and prevent double-billing, deduplication must occur at two distinct stages:

At Ingestion: We use event_id as the primary key on our Postgres usage_events table. Any duplicate network retry triggers a unique constraint violation, which we catch and safely ignore.
At Aggregation: We execute a gRPC consumer fleet or stream processor like Apache Flink. The Flink workers aggregate events in memory, writing to Postgres hourly buckets. Within the update transaction, we atomically insert the event_id into a processed_aggregation_logs table using ON CONFLICT DO NOTHING. If the row insert is skipped, we bypass the bucket increment, preventing double-aggregation.

To scale the data layer, we use Flink to write metrics in 1-minute micro-batches. This reduces database write IOPS from a linear scale of 11,500 peak requests down to a flat 833 IOPS, protecting our database.

Finally, we enforce a strict append-only policy. We never delete records to fix billing bugs; instead, we write adjustment records to a ledger. Daily reconciliation crons constantly verify that the sum of raw events matches our hourly buckets, ensuring that any drift is detected and resolved before monthly invoices are generated."