System DesignAdvancedguide

HyperLogLog at Scale: Billion-Cardinality Estimation

How to count a billion unique users using only 12KB of memory. Deep dive into the mathematics, hashing, register sizing, and merge algorithms of HyperLogLog.

Sachin SarawgiApril 20, 202614 min read14 minute lesson

Reading Mode

Reduce distractions and widen the article focus for long-form reading.

Key Takeaways

What you will learn

**Billion Scale Memory:** Storing 1 billion unique IDs in a standard Set consumes gigabytes of RAM; HyperLogLog reduces this footprint down to exactly 12KB.

**The Math Matrix:** Uses hash bits to select a register bucket and counts the position of the first leading 1, estimating cardinality via the harmonic mean.

**Merge Property:** HyperLogLog data structures are highly mergeable, enabling parallel map-reduce aggregates across regional datasets.

Mental Model

Connecting isolated components into a resilient, scalable, and observable distributed web.

Counting unique items (such as Daily Active Users - DAUs, unique page views, or IP addresses) is a classic problem in high-scale data engineering. In distributed architectures, maintaining a standard unique Set across billions of events is computationally impossible, as memory consumption grows linearly ($O(N)$). HyperLogLog (HLL) is a probabilistic algorithm that resolves this, estimating cardinality with an accuracy of 99% while utilizing a fixed, tiny memory footprint of just 12KB.


System Requirements

To estimate cardinality at massive scale, we establish the following requirements for a global analytics engine:

Functional Requirements

  • Element Insertion: The system must accept incoming user IDs or event keys and dynamically update the cardinality estimation registers.
  • Cardinality Retrieval: The system must return the estimated count of unique elements with a defined standard error.
  • Vector Merging: The system must support merging multiple separate regional HLL data vectors into a single global estimate.
  • Batch Update Operations: The ingestion pipeline must support bulk additions of elements to allow stream buffering.

Non-Functional Requirements

  • Accuracy Bounds (Standard Error): The estimated cardinality must maintain a standard error ($\sigma$) of less than 1.04%. The standard error is mathematically defined as $\sigma = \frac{1.04}{\sqrt{m}}$, where $m$ is the number of register buckets.
  • Memory Constraints: The total memory consumption per HLL data structure must not exceed 12KB, independent of the number of elements inserted (even when counting billions of unique items).
  • Throughput SLA: The ingestion path must sustain write operations under 1ms, enabling live event-stream counting.
  • High-Concurrency Support: The query engine must be able to resolve merged cardinalities across hundreds of regional vectors in less than 50 milliseconds.

API Design and Interface Contracts

To coordinate cardinality estimations across streaming architectures, we expose HLL command structures via HTTP REST and gRPC endpoints. Below is a structured JSON API payload representing the request to retrieve merged cardinality estimations across regional event logs, as well as an ingestion endpoint.

1. Ingestion Endpoint (Client to Event Ingestion Proxy)

POST /api/v1/events/track

{
  "event_id": "evt_908127391823",
  "stream_name": "user_signups:2026-06-16",
  "user_identifier": "user_usr_01jk9888az",
  "timestamp": "2026-06-16T18:15:57Z"
}

2. HLL Merge Request Payload (Analytics Service to HLL Aggregator)

POST /api/v1/cardinality/merge

{
  "operation": "MERGE_CARDINALITY",
  "target_key": "global:dau:2026-05-23",
  "source_keys": [
    "region-us:dau:2026-05-23",
    "region-eu:dau:2026-05-23",
    "region-ap:dau:2026-05-23"
  ],
  "precision_bits": 14,
  "options": {
    "enable_bias_correction": true,
    "fallback_linear_counting": true
  }
}

3. API Response (HLL Aggregator to Client)

{
  "target_key": "global:dau:2026-05-23",
  "estimated_cardinality": 120455980,
  "standard_error": 0.0081,
  "merged_vectors_count": 3,
  "completed_at": "2026-05-23T10:00:05.123Z"
}

High-Level Architecture

The core of HyperLogLog relies on the mathematical properties of hashing: if you hash random inputs, the maximum number of leading zeros in the binary representation indicates the scale of the dataset.

Our real-time ingestion pipeline consists of an Event Producer forwarding clicks to a Kafka Topic, which is consumed by Apache Flink. Flink maintains the local state registers of the HLL estimator in memory. Flink periodically flushes the register arrays to a Redis cluster, which supports native bit-level updates and vector merges.

1. Register Bucketing (Estimator Path)

When an element is inserted, it is hashed into a 64-bit integer. The first $b$ bits select a target register bucket ($m = 2^b$). The remaining bits are evaluated to count the position of the first leading 1 bit ($\rho$). The target register's value is updated to hold the maximum observed $\rho$.

graph TD
    Input[Incoming ID: "user-9988"] -->|Hash: MurmurHash3| Binary["64-bit Binary: 01001100 0000000000000000000000001001"]
    
    subgraph Slicing["Bit Slicing"]
        Binary -->|First b bits| Register["Register Index Bucket (m)"]
        Binary -->|Remaining bits| Zeros["Position of first 1 bit (\u03c1)"]
    end
    
    Register -->|Select| RegArray["Register Array [0 .. m-1]"]
    Zeros -->|Compare & Update Max| RegArray
    
    %% Style annotations
    classDef hashColor fill:#e1f5fe,stroke:#01579b,stroke-width:2px;
    class Binary,RegArray hashColor;

2. Distributed Map-Reduce HLL Merge

Because HLL registers only store the maximum observed zero counts, merging two HLL structures is mathematically simple: we perform an element-wise maximum comparison across the register arrays. This property enables parallel map-reduce aggregates without sharing raw user IDs.

sequenceDiagram
    autonumber
    participant US as US-East HLL (Array A)
    participant EU as EU-West HLL (Array B)
    participant Aggregator as HLL Aggregator
    
    US->>Aggregator: Push HLL Vector (12KB Array)
    EU->>Aggregator: Push HLL Vector (12KB Array)
    Note over Aggregator: For i in 0..m-1:<br/>Global[i] = Max(A[i], B[i])
    Aggregator-->>Aggregator: Calculate Harmonic Mean
    Aggregator-->>US: Return Global Estimated Cardinality

Low-Level Design and Schema

Below is a production-ready, compilable Java class implementing the HyperLogLog Estimator. It uses MurmurHash3, bitwise operators, and harmonic mean calculations combined with linear counting fallbacks for small datasets:

package com.codesprintpro.performance;

import java.nio.charset.StandardCharsets;

public class HyperLogLogEstimator {
    private final int b; // Number of precision bits
    private final int m; // Number of registers (m = 2^b)
    private final byte[] registers;
    private final double alphaM;

    public HyperLogLogEstimator(int precisionBits) {
        if (precisionBits < 4 || precisionBits > 16) {
            throw new IllegalArgumentException("Precision bits must be between 4 and 16");
        }
        this.b = precisionBits;
        this.m = 1 << b;
        this.registers = new byte[m];
        
        // Calculate constant alpha_m based on register scale
        if (m == 16) this.alphaM = 0.673;
        else if (m == 32) this.alphaM = 0.697;
        else if (m == 64) this.alphaM = 0.709;
        else this.alphaM = 0.7213 / (1.0 + 1.079 / m);
    }

    /**
     * Inserts an element into the HyperLogLog registers.
     */
    public void add(String element) {
        long hash = murmurHash64(element.getBytes(StandardCharsets.UTF_8));
        
        // 1. Extract register index using first b bits
        int index = (int) (hash >>> (64 - b));
        
        // 2. Count leading zeros in the remaining bits
        long remaining = (hash << b) | (1L << (b - 1)); // Ensure sentinel bit
        int rho = Long.numberOfLeadingZeros(remaining) + 1;
        
        // 3. Update register if the new zero-count is greater than the old value
        if (rho > this.registers[index]) {
            this.registers[index] = (byte) rho;
        }
    }

    /**
     * Computes the estimated cardinality of the dataset.
     */
    public long estimate() {
        double sum = 0.0;
        int zeroCount = 0;
        
        for (int i = 0; i < m; i++) {
            sum += Math.pow(2.0, -this.registers[i]);
            if (this.registers[i] == 0) {
                zeroCount++;
            }
        }
        
        // Harmonic mean estimator formula
        double rawEstimate = this.alphaM * m * m / sum;
        
        // Linear Counting fallback for small datasets to reduce bias
        if (rawEstimate <= 2.5 * m) {
            if (zeroCount > 0) {
                return Math.round(m * Math.log((double) m / zeroCount));
            }
        }
        return Math.round(rawEstimate);
    }

    private static long murmurHash64(byte[] data) {
        // Simple MurmurHash3 64-bit placeholder for Java compilation
        long h = 0xe17a1465L;
        for (byte b : data) {
            h ^= b;
            h *= 0xc6a4a7935bd1e995L;
            h >>>= 47;
        }
        return h;
    }
}

Relational Database Storage Schema

If we need to persist these registers inside a relational database (e.g. PostgreSQL), we avoid storing each register as a separate column or row. Instead, we use a single binary BYTEA column or BLOB column, indexing the record by the stream name and temporal partition (e.g., date).

CREATE TABLE hll_cardinality_snapshots (
    stream_id VARCHAR(255) NOT NULL,
    snapshot_date DATE NOT NULL,
    precision_bits INT NOT NULL DEFAULT 14,
    register_data BYTEA NOT NULL, -- The m-size (16,384 bytes for b=14) binary array
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (stream_id, snapshot_date)
);

CREATE INDEX idx_hll_snapshots_date ON hll_cardinality_snapshots(snapshot_date);

Scaling Challenges and Capacity Estimation

Probabilistic scaling patterns present distinct low-level bottlenecks when moving from prototypes to billion-cardinality streams:

1. Hash Collision Bias and Hashing Performance

If the hashing function does not distribute keys uniformly across the binary spectrum, registers will cluster. This clustering skews the harmonic mean and violates our standard error SLA.

  • Capacity Impact: To estimate 1 billion unique users with $b=14$ bits, the register count $m$ is $2^{14} = 16,384$. Using a 32-bit hash function risks collisions due to the Birthday Paradox when the cardinality exceeds $2^{16} = 65,536$.
  • Mitigation: Utilize a 64-bit or 128-bit hash function like MurmurHash3 (128-bit) or xxHash, which guarantees uniform bit distribution across all register slices and avoids hash collision bias completely.

2. Register Cache-Line Misses

Under massive ingestion streams, writing to random indexes across the 12KB registers array creates CPU cache invalidations, forcing memory reloads and degrading write speeds.

  • Hardware Sizing: A typical CPU L1 cache line is 64 bytes. A 12KB registers array spans 192 cache lines. Random writes to these registers will miss the L1 and L2 caches once the write concurrency exceeds the cache capacity.
  • Mitigation: Deploy local thread-local register buffers. Batch writes into small arrays before performing bulk register merges, keeping execution within CPU L1/L2 caches.

3. Redis Memory Optimization (Sparse vs. Dense Representation)

Redis HLL clusters optimize memory using a dynamic representation. For low cardinalities, Redis structures HLL data in a sparse format, which consumes much less memory (often less than 1KB). As unique elements are added and more registers are updated, Redis automatically converts this representation into a dense format, which consumes exactly 12KB.

  • Mitigation: When designing cardinality aggregators, ensure that client applications do not attempt to read registers in real-time. Instead, delegate updates to PFADD and read cardinalities with PFCOUNT.

Architectural Trade-offs

Selecting the optimal cardinality strategy requires balancing resource costs:

Strategy Memory Profile Time Complexity Accuracy Bounds Key Limitations
Standard Set (e.g. HashSet) Linear ($O(N)$) $O(1)$ Absolute (100% correct) Memory footprint grows to gigabytes at scale.
Bloom Filter Fixed ($O(M)$) $O(K)$ Probabilistic Cannot return exact count (only membership).
HyperLogLog Fixed (12KB max) $O(1)$ Probabilistic ($\pm 1.04%$) Does not store elements (cannot list IDs).
Count-Min Sketch Fixed $O(1)$ Probabilistic Tracks frequency of elements, not unique counts.

Architectural Evaluation

  1. HashSet vs. HyperLogLog: A HashSet of 1 billion IDs (8-byte longs) takes at least 8GB of memory. HyperLogLog does not keep the actual elements in memory; it only updates zero counters. The memory requirement drops from 8GB to 12KB, representing a reduction factor of greater than 600,000.
  2. HLL vs. Bloom Filter: A Bloom Filter can tell you if a user has visited, but you cannot query it for the total unique visitors count. Calculating cardinality from a Bloom Filter requires counting set bits and applying probabilistic correction formulas, which becomes highly inaccurate as the filter fills up.

Failure Scenarios and Resilience

Probabilistic structures require defensive configurations to survive data anomalies and unexpected system restarts:

Scenario A: Zero Ingestion Bias (Small Cardinality Skew)

At low element counts (e.g., less than 50 unique items), the harmonic mean estimator is highly inaccurate. It returns skewed estimations that break dashboard views due to extreme variance in empty registers.

  • Resiliency Mitigation: Implement Linear Counting Fallback. The estimator tracks the count of empty registers ($V$) and uses logarithmic empty-space ratios to calculate cardinality when estimates fall below $2.5 \times m$. The formula used is $E = m \ln(m / V)$.

Scenario B: Dynamic Precision Drift and Folding

If regional database structures utilize different precision bits ($b=12$ in Region A and $b=14$ in Region B), attempting to merge their register arrays directly will corrupt the math vector because the bucket indexes represent different hash prefixes.

  • Resiliency Mitigation: If a merge must occur between mismatched precisions, the aggregator must down-sample the higher-precision register (Region B) to match the lower-precision bounds. This is accomplished by "folding" the registers: dividing the high-precision array into groups of size $2^{14-12} = 4$ and taking the maximum register value in each group.

Scenario C: Out-of-Order Event Updates

In high-throughput stream processing (e.g. Apache Flink), out-of-order events can arrive late. If the HLL register state is updated from an out-of-order stream, there is a risk of updating registers with stale values or losing events.

  • Resiliency Mitigation: Because the HLL update operation is monotonic (using the Math.max operator), the order of insertion does not affect the final state of the registers. This makes the HLL algorithm naturally resilient to network latency, duplicate events, and out-of-order delivery.

Staff Engineer Perspective

Precision Tuning Rules

When choosing $b$, note that each register takes 6 bits. Since 6 bits can hold values up to 64, this fits the maximum number of leading zeros in a 64-bit hash. Choosing $b=14$ gives $m = 16,384$ registers. The memory consumed is exactly: $$\text{Memory} = \frac{16,384 \times 6 \text{ bits}}{8 \text{ bits/byte}} = 12,288 \text{ bytes} \approx 12\text{KB}$$ This provides a standard error of $\frac{1.04}{\sqrt{16,384}} = 0.008125$ or $0.81%$, which is ideal for almost all DAU dashboards. Increasing precision to $b=16$ gives $m=65,536$, consuming 48KB of memory for a standard error of $0.41%$. Always assess if the accuracy improvement justifies a fourfold increase in cache-line utilization.


Verbal Script

Verbal Script: Cardinality Estimation at Scale

Interviewer: "How would you design a real-time analytics system to track the unique Daily Active Users (DAUs) for a platform with 1 billion active profiles, using minimal memory?"

Candidate: "If we attempted to track 1 billion unique user IDs using a standard HashSet or sorted database index, the memory footprint would grow linearly. Assuming an 8-byte UUID, storing 1 billion IDs requires at least 8GB of RAM, which is completely impractical for real-time memory caches. To resolve this, I would implement HyperLogLog (HLL). HLL maps incoming IDs to a 64-bit hash, slices the hash to select a register bucket, and tracks the maximum observed leading zeros. By allocating $2^{14}$ (16,384) 6-bit registers, we can estimate cardinality up to billions with a standard error of exactly 0.81% using only 12KB of memory."

Interviewer: "Excellent. How would you handle merging DAUs across multiple regional databases without sharing user IDs over WAN?"

Candidate: "This is one of the most powerful properties of HyperLogLog. Because HLL registers only store the maximum observed leading zero counts, the data vectors are highly mergeable. Each regional database only needs to generate and maintain its own 12KB HLL register array locally. When we need a global count, the regions push their 12KB arrays to a central aggregator. The aggregator performs an element-wise maximum comparison across the arrays in $O(1)$ time, returning the global cardinality without exposing any raw user identifiers. If the precision bits vary between regions, we can fold the higher-precision register arrays by taking the maximum value of adjacent buckets, matching the lower-precision bounds."

Interviewer: "What are the limitations of this approach, and how would you handle low-cardinality streams?"

Candidate: "The primary limitation is that HyperLogLog is a lossy, probabilistic estimator; it cannot list the actual user IDs. If that is required, we must use a cold storage data warehouse. For low-cardinality streams, the raw HLL harmonic mean estimator has a high bias because most registers are still zero. To resolve this, I would implement a Linear Counting fallback. When the raw estimate is less than or equal to 2.5 times the number of registers $m$, we count the number of zero-value registers $V$ and use a logarithmic empty-bucket ratio to calculate cardinality. This ensures our accuracy remains high even for streams with very few active users."

Practical engineering notes

Get the next backend guide in your inbox

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

No spam. Just practical notes you can use at work.

Sachin Sarawgi

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Keep Learning

Move through the archive without losing the thread.

Related Articles

More deep dives chosen from shared tags, category overlap, and reading difficulty.

System DesignAdvanced

API Pagination at Scale: Why OFFSET 100,000 is a Database Killer

Designing a paginated API seems simple. Standard frameworks make it trivial: just use LIMIT 20 OFFSET 100. This works perfectly during development and for the first few pages of small tables. However, once your data scal…

Apr 20, 202611 min read
Deep DiveBackend Systems Mastery
#databases#java#performance
System DesignAdvanced

Event-Driven Architecture: CQRS and Event Sourcing in Practice

Mental Model In traditional CRUD (Create, Read, Update, Delete) architectures, the same database model is used for both writing and reading data. Under high traffic, this creates locking contention and complex SQL joins…

Apr 20, 202610 min read
Deep Dive
#performance#system-design
System DesignAdvanced

Bypassing the Kernel: User-Space Networking for Sub-Microsecond Performance

Mental Model For ultra-low-latency distributed systems—such as high-frequency trading (HFT) matching engines, real-time telemetry filters, and high-performance packet routers—even the optimized Linux kernel is too slow.…

Apr 20, 202611 min read
Deep DivePerformance & Optimization Mastery
#performance#system-design
System DesignAdvanced

gRPC Schema Evolution: Avoiding Breaking Changes

Mental Model > Connecting isolated components into a resilient, scalable, and observable distributed web. In globally distributed microservice architectures, deploying every service simultaneously to update an API is imp…

Apr 20, 202612 min read
Deep Dive
#performance#system-design

More in System Design

Category-based suggestions if you want to stay in the same domain.