Lesson 2 of 25 13 minDeep Systems

Designing a Database Sharding Strategy for 100 Million Users

A practical guide to horizontal sharding at scale: shard key selection, hot shard prevention, consistent hashing, cross-shard queries, and zero-downtime data migration with real fintech architecture examples.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • A viral merchant has 10M transactions, all landing on Shard 4
  • A batch job processes all users in `user_id` range 0–1M sequentially
  • A celebrity user account is read millions of times per day
Recommended Prerequisites
Database Sharding Part 1: The Vertical Ceiling

Premium outcome

Distributed systems mechanics for engineers building serious backend platforms.

Engineers who want stronger distributed-systems fundamentals for platform work.

You leave with

  • More confidence with consistency, causality, locking, and time in distributed systems
  • A stronger sense of which backend guarantees are expensive and why
  • The systems-level foundation needed for difficult architecture trade-offs

Vertical scaling has a ceiling. For most high-transaction applications, that ceiling arrives somewhere between 1 million and 10 million users, depending on write patterns and data size. At 100 million users, the question is not whether to shard — it's how to shard without destroying query capabilities, operational sanity, and transactional guarantees.

This article provides a production-grade playbook for horizontal database sharding at massive scale, drawing from architectures deployed at leading global fintech institutions.


System Requirements and Goals

Before diving into physical database architecture, we must define the scaling bounds and functional constraints. A system handling 100 million users cannot treat all data equally. We categorize requirements to establish a clear architectural target:

Functional Requirements

  1. User Accounts & Profiles: Storage and retrieval of user profile data.
  2. Transaction Ledger: High-throughput writes for credit/debit records.
  3. Merchant Payments: Support high-concurrency payment processing for merchant partners.
  4. Reporting Queries: Enable lookup of payment history filtered by user, date, or merchant.

Non-Functional Requirements

  1. High Write Throughput: The system must sustain a baseline of 5,000 writes per second, with peak spikes of 50,000 writes/sec (e.g., during promotional events).
  2. Low Latency: Read and write transactions must resolve within 50ms (P99) at the database tier.
  3. Strong Consistency for Ledgers: Financial account balances cannot tolerate eventual consistency; double-spending must be mathematically impossible.
  4. High Availability: 99.999% uptime (less than 5.26 minutes of unscheduled downtime per year). A single database server outage must not take down the entire system.

High-Level Design Architecture

To scale horizontally, the application tier must bypass the single-database bottleneck. This is achieved by introducing a routing layer that intercepts queries and directs them to specific database instances (shards) based on the sharding metadata.

Below is the high-level architecture showing how clients connect through a high-performance routing topology to access isolated database shards:

graph TD
    Client[Web/Mobile Clients] -->|HTTPS Requests| LB[Global Load Balancer]
    LB -->|Route| APIGW[API Gateway]
    APIGW -->|Authenticate & Forward| App[Application Microservices]
    App -->|SQL Query / Shard Key| Router[Application Shard Router / Proxy]
    Router -->|Consistent Hashing / Hash Ring| ShardMap{Shard Mapper}
    ShardMap -->|Write / Read Primary| Shard0P[(Shard 0 Primary)]
    ShardMap -->|Write / Read Primary| Shard1P[(Shard 1 Primary)]
    ShardMap -->|Write / Read Primary| Shard2P[(Shard 2 Primary)]
    
    Shard0P -->|Sync Replication| Shard0R[(Shard 0 Read Replica)]
    Shard1P -->|Sync Replication| Shard1R[(Shard 1 Read Replica)]
    Shard2P -->|Sync Replication| Shard2R[(Shard 2 Read Replica)]

In this architecture:

  • Application Microservices run statelessly and scale horizontally.
  • Shard Router (Proxy) uses consistent hashing to resolve the physical database endpoint (jdbc_url) dynamically.
  • Master-Replica Shard Pairs provide high availability. Writes always target the Primary, while heavy read traffic (such as batch reporting) can be offloaded to Read Replicas.

API Design and Endpoints

An horizontal sharded database requires API contracts to include parameters that allow the routing tier to resolve shards efficiently without parsing raw SQL queries. Below are the core API endpoints designed for this architecture:

1. Initiate a Payment Transaction

  • Endpoint: POST /v1/payments
  • Headers:
    • Content-Type: application/json
    • X-Idempotency-Key: pay_c1b2c3d4e5f6
  • Request Body:
{
  "user_id": 9876543210,
  "merchant_id": 11223344,
  "amount": 250.00,
  "currency": "USD",
  "payment_method": {
    "type": "credit_card",
    "token": "tok_visa_887766"
  },
  "description": "Premium Subscription Purchase"
}
  • Response (201 Created):
{
  "payment_id": "84c8a2b5-1234-4bc6-9477-d6b998246e7f",
  "user_id": 9876543210,
  "status": "processing",
  "created_at": "2026-05-22T17:31:00Z"
}

2. Retrieve Payment History for a User

  • Endpoint: GET /v1/users/{user_id}/payments
  • Query Parameters:
    • limit: 20
    • cursor: "eyJjcmVhdGVkX2F0IjoiMjAyNi0wNS0yMlQxNzoyMDowMFoiLCJpZCI6MTAwfQ=="
  • Response (200 OK):
{
  "data": [
    {
      "payment_id": "84c8a2b5-1234-4bc6-9477-d6b998246e7f",
      "amount": 250.00,
      "currency": "USD",
      "status": "completed",
      "created_at": "2026-05-22T17:31:00Z"
    }
  ],
  "paging": {
    "next_cursor": "eyJjcmVhdGVkX2F0IjoiMjAyNi0wNS0yMlQxNzoxNTowMFoiLCJpZCI6OTl9",
    "has_more": true
  }
}

Low-Level Design & Database Schema

In a sharded relational environment, schemas must be designed with extreme care. Cross-shard joins are structurally impossible at the database layer; therefore, related entities must either be co-located on the same shard or denormalized.

Here is the concrete PostgreSQL schema design for the core payment tables. The composite primary key includes both the shard key (user_id) and the resource ID (payment_id).

-- Create payments table, partitioned physically across 64 database hosts
CREATE TABLE payments (
    payment_id      UUID DEFAULT gen_random_uuid(),
    user_id         BIGINT NOT NULL,          -- Shard key (High Cardinality)
    merchant_id     BIGINT NOT NULL,          -- Secondary entity
    amount          DECIMAL(19,4) NOT NULL,
    currency        CHAR(3) NOT NULL,
    status          VARCHAR(20) NOT NULL,
    idempotency_key VARCHAR(255) NOT NULL,
    payment_method  JSONB NOT NULL,
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    updated_at      TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (user_id, payment_id)         -- Composite primary key co-locates rows by user_id
);

-- Index created_at on each shard for efficient cursor-based pagination
CREATE INDEX idx_payments_user_created ON payments (user_id, created_at DESC);

-- Unique index for idempotency keys within a single shard context
CREATE UNIQUE INDEX idx_payments_idempotency ON payments (user_id, idempotency_key);

-- Schema design: ledger table to store accounting entries (co-located on same shard as user)
CREATE TABLE account_ledgers (
    ledger_id       UUID DEFAULT gen_random_uuid(),
    user_id         BIGINT NOT NULL,
    account_type    VARCHAR(30) NOT NULL,
    balance_change  DECIMAL(19,4) NOT NULL,
    running_balance DECIMAL(19,4) NOT NULL,
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (user_id, ledger_id),
    FOREIGN KEY (user_id) REFERENCES payments(user_id) ON DELETE CASCADE
);

Node-level Shard Routing Code (Low-Level Router Implementation)

Here is how the application resolves the active database connection pool programmatically in TypeScript:

import { createHash } from 'crypto';

interface ShardConfig {
  shardId: number;
  jdbcUrl: string;
}

export class ShardRouter {
  private shardRing: Map<number, ShardConfig> = new Map();
  private numVirtualNodes = 150;

  constructor(private shards: ShardConfig[]) {
    this.initializeHashRing();
  }

  private hashKey(key: string): number {
    const hash = createHash('md5').update(key).digest();
    // Read first 4 bytes as unsigned 32-bit integer
    return hash.readUInt32BE(0);
  }

  private initializeHashRing(): void {
    for (const shard of this.shards) {
      for (let i = 0; i < this.numVirtualNodes; i++) {
        // vnode hash key
        const vnodeKey = `SHARD-${shard.shardId}-VNODE-${i}`;
        const hashVal = this.hashKey(vnodeKey);
        this.shardRing.set(hashVal, shard);
      }
    }
    // Sort keys to form a logical ring
    this.shardRing = new Map([...this.shardRing.entries()].sort((a, b) => a[0] - b[0]));
  }

  public getShard(userId: bigint): ShardConfig {
    const hashVal = this.hashKey(userId.toString());
    const ringKeys = Array.from(this.shardRing.keys());
    
    // Find the first vnode with hash >= userId hash
    const targetKey = ringKeys.find(key => key >= hashVal) ?? ringKeys[0];
    return this.shardRing.get(targetKey)!;
  }
}

Consistent Hashing

A naive hash-based sharding strategy (e.g., shard_id = user_id % num_shards) is catastrophic when the database needs to scale. Adding a single database node changes the modulus, requiring roughly 99% of historical data to be migrated across different physical nodes.

Consistent Hashing solves this operational nightmare by placing both database shards (via virtual nodes) and user IDs onto a logical $2^{32} - 1$ ring:

graph TD
    subgraph Consistent Hashing Ring
        Ring0[0 / 2^32] --> Ring1[Shard 0 - VNode 1]
        Ring1 --> Ring2[User A Hash]
        Ring2 --> Ring3[Shard 1 - VNode 1]
        Ring3 --> Ring4[User B Hash]
        Ring4 --> Ring5[Shard 2 - VNode 1]
        Ring5 --> Ring0
    end
    
    Ring2 -->|Mapped To| Ring3
    Ring4 -->|Mapped To| Ring5
  • When a new database instance is added to the ring, it intercepts a subset of the hashing space.
  • Only data belonging to that specific segment of the ring is migrated.
  • The average data migration load is reduced from $99%$ to exactly $1/N$, where $N$ is the new total number of shards.

To prevent skewed distributions (hot spots in the ring), each physical node is projected onto the ring multiple times as virtual nodes (vnodes). This ensures an even, statistically balanced distribution of records across all database clusters.


Scaling Challenges and Key Strategies

When sharding at 100 million users, several structural bottlenecks emerge at the application and infrastructure boundaries.

1. The Scatter-Gather Query Problem

If an administrator runs a query without the shard key, the application must execute the query across all 64 database shards in parallel and merge the results.

-- EXTREMELY EXPENSIVE: Must execute on all 64 shards
SELECT * FROM payments WHERE status = 'pending' ORDER BY created_at DESC LIMIT 10;

Mitigation Strategy:

  • CQRS (Command Query Responsibility Segregation): Direct transactional writes to sharded databases. Push updates asynchronously via change data capture (CDC) tools (like Debezium) into a unified indexing store like Elasticsearch or a columnar database like ClickHouse. All search and analytics queries are served from the unified engine, protecting primary transactional shards from complex analytical workloads.

2. The Hot Shard / Celebrity Problem

If a massive merchant (e.g., a viral e-commerce platform) processes millions of transactions, a single shard might experience $90%$ of the write throughput while the remaining shards sit idle. Mitigation Strategy:

  • Salted Sharding Keys: Append a pseudo-random suffix (a "salt") to the shard key for hot records. For example, store merchant_12345 transactions under merchant_12345_0, merchant_12345_1, up to merchant_12345_7. During reads, scatter-gather query only those 8 known sub-keys instead of all 64 shards.

Technical Trade-offs and Modulus Modulo

Choosing a sharding dimension involves deep architectural compromises. Below is a staff-level trade-off matrix analyzing sharding keys:

Shard Key Dimension Data Distribution Query Performance (Single Record) Operational Complexity Cross-Shard Joins
user_id (Hash) Excellent (Evenly balanced) Fast (Single shard lookup) Low (Static routing) High (Requires application layer joins)
merchant_id (Hash) Prone to Hot Shards Fast for Merchant Ledgers Low High (Users span multiple shards)
created_at (Range) Poor (Creates massive write hotspots on new shards) Poor (Requires searching multiple ranges) High (Requires dynamic range updates) Low (Co-locates chronological data)

Failure Scenarios and Resilience Strategy

Operating 64 independent database shards increases the probability of a database instance failure by a factor of 64. The platform must be resilient to partial database degradations.

1. Active-Standby Failover

Each database shard must be provisioned as a multi-AZ deployment with synchronous replication. In the event of primary hardware failure, the cloud provider's DNS endpoints automatically pivot to the healthy replica within 30 seconds.

2. Zero-Downtime Data Rebalancing (The Double-Write Pattern)

When adding shards, data must be re-distributed live. The application router executes a 4-phase transition:

sequenceDiagram
    autonumber
    participant App as Application Router
    participant S_Old as Old Shard
    participant S_New as New Shard (Added)

    Note over App, S_New: Phase 1: Double Write Activation
    App->>S_Old: Write active payment record
    App->>S_New: Write active payment record (Double Write)
    App->>S_Old: Read queries always target Old Shard

    Note over App, S_New: Phase 2: Historical Data Backfill
    Note right of S_Old: CDC / Background Worker copies historical data to New Shard
    S_Old-->>S_New: Backfill data rows with checksum validation

    Note over App, S_New: Phase 3: Switch Reads
    App->>S_New: Execute Read Queries (Verify consistency)
    App->>S_Old: Write active payment record
    App->>S_New: Write active payment record

    Note over App, S_New: Phase 4: Cutover
    App->>S_New: Read & Write authoritative
    Note over S_Old: Decommission historical range

3. Graceful Degradation (Partial Availability)

If Shard 12 becomes completely unavailable due to network partition:

  • Users mapped to Shard 12 will receive a graceful error or temporary fallback page.
  • Critical Outcome: The remaining $98.4%$ of the 100M user base mapped to the other 63 healthy shards continue transacting without experiencing any latency spikes or system failures.

Staff Engineer Perspective


Verbal Script & Mock Interview

Here is an architectural walkthrough simulating a Senior Staff Engineer design interview:

Interviewer: "How do you design a database sharding strategy for 100 million users on a transactional payment platform?"

Candidate: "To scale a database to 100 million users with high write throughput, I would implement horizontal database sharding using Consistent Hashing with virtual nodes. First, I would establish the correct sharding dimension. For a payment system, user_id is the ideal shard key because it has extremely high cardinality, allowing for a highly balanced distribution of rows across shards, and it naturally co-locates all of a user's transaction history on a single physical node, enabling rapid local query execution.

To handle database routing dynamically, I would build or leverage an application-layer Shard Router. This router uses an MD5 hash ring with 150 virtual nodes per physical host to map user_id hashes to specific database connections. This shields our stateless application nodes from knowing where data physically resides.

To mitigate the Scatter-Gather problem for analytics queries, I would decouple reads from writes using a CQRS pattern. We would streams transactional inserts from our PostgreSQL shards using Kafka CDC into a ClickHouse database dedicated to analytical queries, ensuring administrators can run cross-user aggregates without touching primary OLTP hosts.

Finally, for high availability, each shard is structured as a Primary-Standby pair in a multi-availability zone configuration. If a shard fails, standard failover occurs, while a network partition on a single shard ensures that only a tiny fraction of our users are impacted, ensuring the overall payment system remains highly available and robust."

Interviewer: "Excellent. How would you handle a zero-downtime database migration if we need to double the shard count?"

Candidate: "I would execute a zero-downtime migration using the Double-Write Pattern. In Phase 1, we update the shard router to double-write all incoming transactions to both the active shard and the new shard, while still reading exclusively from the active shard. In Phase 2, a background data backfiller copies historical records to the new shard, performing cryptographic checksum checks on rows. Once backfilled, in Phase 3, we switch reads to the new shard to verify data integrity. Finally, in Phase 4, we cut off writes to the old shard, safely decommissioning the migrated range without a single second of system downtime."


Production Readiness Checklist

Ensure the following system components are fully verified before promoting this sharded architecture to production:

  • High Availability: Multi-AZ failovers verified under simulated partition conditions.
  • Consistent Hash Ring: Distribution variance verified to be under 2% across vnodes.
  • Circuit Breaking: Routed connections configured with rapid fail-fast timeouts to prevent cascading connection exhaustion.
  • CQRS Ingestion: CDC pipelines to ClickHouse verified with lag checks under peak simulated transaction workloads.
  • Double-Write Validation: Backfill synchronization scripts load-tested with zero race-condition failures.

Knowledge Check

MySQL · 3 Questions

Test Your Understanding

Ready to test yourself?

Answer 3 quick questions to reinforce what you just learned. Takes under 2 minutes.

Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.