System Design: Designing a Database Proxy for Sharding (Vitess Style)

Mental Model

A relational database proxy abstracts the complexity of database sharding, connection management, and routing from the application layer. Rather than forcing application developers to write manual routing logic, manage dynamic database connection handshakes, or orchestrate multi-shard joins, the proxy acts as a single logical database endpoint. It intercepts incoming SQL, parses the abstract syntax tree in real-time, target-routes queries to specific database nodes based on shard range directory maps, and coordinates connection pooling to scale databases to millions of requests per second.

Requirements and System Goals

When designing an enterprise-grade database proxy for sharding (modeled after Youtube's Vitess), we must establish clear operational boundaries, latency budgets, and scale metrics.

1. Functional Requirements

Transparent SQL Routing: Intercept standard SQL queries and automatically direct them to the appropriate physical database shard based on a designated sharding key (e.g. user_id or tenant_id) without altering application code.
Read-Write Splitting: Direct write traffic (INSERT, UPDATE, DELETE) to the primary database master and route read traffic (SELECT) to read-only replicas.
Multi-Shard Orchestration: Perform transparent "Scatter-Gather" execution across multiple physical database partitions when queries do not include a sharding key.

2. Non-Functional Requirements & Performance Budgets

Ultra-Low Latency Overhead: The database proxy's internal processing overhead—including SQL parsing, routing lookup, and connection multiplexing—must introduce less than 2ms of latency per request.
Massive Connection Multiplexing: Support up to 100,000 concurrent application client connections while holding a small, fixed pool of less than 1,000 persistent connections to the underlying MySQL/Postgres shards.
High Availability & Fault Tolerance: Ensure automatic failure detection. If a primary database node crashes, the proxy must route traffic to a newly promoted master in less than 5 seconds without dropping active transaction sequences.

API Interfaces and Service Contracts

To manage a dynamic database proxy cluster, we must define routing and configuration contracts.

1. Shard Range Routing API Contract

This endpoint is used by the control plane (e.g. VtGate coordinating with the Topology Store) to define or update range-based sharding keys across physical database hosts.

POST /api/v1/shards/routes

Request Payload:

{
  "keyspace": "user_db",
  "sharding_scheme": "range_based",
  "shard_key": "user_id",
  "routes": [
    {
      "range_start": "00000000",
      "range_end": "7FFFFFFF",
      "physical_shard_id": "shard_00_7f",
      "primary_host": "mysql-shard-1-primary.corp.internal:3306",
      "replica_hosts": [
        "mysql-shard-1-replica-1.corp.internal:3306",
        "mysql-shard-1-replica-2.corp.internal:3306"
      ]
    },
    {
      "range_start": "80000000",
      "range_end": "FFFFFFFF",
      "physical_shard_id": "shard_80_ff",
      "primary_host": "mysql-shard-2-primary.corp.internal:3306",
      "replica_hosts": [
        "mysql-shard-2-replica-1.corp.internal:3306"
      ]
    }
  ]
}

Response Payload (200 OK):

{
  "status": "success",
  "message": "Shard routing table updated successfully",
  "version": 42,
  "synchronized_nodes": 8
}

2. SQL Parsing Representation (AST Planning Contract)

When the application client executes a query, the proxy's SQL engine parses the query into a structured Abstract Syntax Tree (AST) to evaluate the sharding key location.

Client Query:

SELECT username, email FROM users WHERE user_id = 987654;

Proxy Internal Planning JSON:

{
  "query_type": "SELECT",
  "target_tables": ["users"],
  "projection_fields": ["username", "email"],
  "routing_evaluation": {
    "shard_key_found": true,
    "shard_key_name": "user_id",
    "shard_key_value": 987654,
    "evaluated_hash": "a4f8b9e2",
    "target_shard": "shard_80_ff",
    "target_node": "mysql-shard-2-primary.corp.internal:3306"
  }
}

High-Level Design and Visualizations

Decoupling the SQL proxy gateway, global configuration registry, and the physical database engines is the key to scaling relational stores to massive workloads.

1. System Architecture Layout

The following diagram illustrates how the application client interacts with the stateless proxies (VtGate equivalent), which query the consensus topology store (etcd) to target-route requests to primary or replica databases managed by database controller sidecars (VtTablet).

graph TD
    subgraph Client Layer
        App1[App Server Pool A] -->|1. Client Connection| Proxy1[Database Proxy Node 1]
        App2[App Server Pool B] -->|1. Client Connection| Proxy2[Database Proxy Node 2]
    end

    subgraph Topology Configuration State
        Proxy1 -->|2. Fetch Routing Map| TopoStore[etcd Topology Cluster]
        Proxy2 -->|2. Fetch Routing Map| TopoStore
    end

    subgraph Shard 1 (Keys 00000000 to 7FFFFFFF)
        Proxy1 -->|3. Route Query| Sidecar1[Tablet Controller 1]
        Sidecar1 -->|4. Manage Connections| MySQL_S1_P[(MySQL Shard 1 Primary)]
        Sidecar1 -->|4. Health Checks| MySQL_S1_R[(MySQL Shard 1 Replica)]
    end

    subgraph Shard 2 (Keys 80000000 to FFFFFFFF)
        Proxy2 -->|3. Route Query| Sidecar2[Tablet Controller 2]
        Sidecar2 -->|4. Manage Connections| MySQL_S2_P[(MySQL Shard 2 Primary)]
        Sidecar2 -->|4. Health Checks| MySQL_S2_R[(MySQL Shard 2 Replica)]
    end

2. Query Routing Sequence Diagram

Below is the sequence showing how the proxy handles single-shard routing (using user_id) vs. scatter-gather queries (queries omitting the shard key).

sequenceDiagram
    autonumber
    participant App as Application Server
    participant Proxy as Database Proxy (VtGate)
    participant Topo as etcd Topo Store
    participant DB_S1 as Shard 1 Tablet
    participant DB_S2 as Shard 2 Tablet

    rect rgb(240, 248, 255)
        Note over App, DB_S2: Case A: Target-Routed Query (With Shard Key)
        App->>Proxy: SELECT * FROM users WHERE user_id = 123
        Proxy->>Proxy: Parse AST & Hash user_id 123
        Proxy->>Proxy: Resolve hash range to Shard 1
        Proxy->>DB_S1: Route targeted query (optimized connection)
        DB_S1-->>Proxy: Return SQL result set
        Proxy-->>App: Return database rows
    end

    rect rgb(255, 240, 240)
        Note over App, DB_S2: Case B: Scatter-Gather Query (Without Shard Key)
        App->>Proxy: SELECT * FROM users WHERE age = 30
        Proxy->>Proxy: Parse AST (No shard key found!)
        Par over DB_S1, DB_S2
            Proxy->>DB_S1: Parallel query broadcast
            Proxy->>DB_S2: Parallel query broadcast
        end
        DB_S1-->>Proxy: Return partial results (Shard 1)
        DB_S2-->>Proxy: Return partial results (Shard 2)
        Proxy->>Proxy: Merge, Sort, and deduplicate streams
        Proxy-->>App: Return compiled rows
    end

Low-Level Design and Schema Strategies

To coordinate range routing, the proxy leverages a global routing directory schema. This directory mapping is kept in memory inside the proxy and refreshed asynchronously.

1. Database Proxy Routing Directory Schema

This layout defines how the control plane tracks shards, keyspace definitions, and the state of physical cluster mappings.

-- Core Keyspace Registration Table
CREATE TABLE database_keyspaces (
    keyspace_name VARCHAR(64) PRIMARY KEY, -- e.g. 'user_db'
    is_sharded BOOLEAN NOT NULL DEFAULT TRUE,
    sharding_type VARCHAR(32) NOT NULL, -- 'range', 'hash', 'directory'
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- Shard Definition and Ranges
CREATE TABLE keyspace_shards (
    shard_id VARCHAR(64) PRIMARY KEY, -- e.g. 'user_db:shard_00_7f'
    keyspace_name VARCHAR(64) REFERENCES database_keyspaces(keyspace_name),
    range_start BYTEA NOT NULL, -- Binary hex boundary representation
    range_end BYTEA NOT NULL,
    current_primary_node VARCHAR(255) NOT NULL, -- host:port address
    is_active BOOLEAN DEFAULT TRUE,
    last_heartbeat TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- Tablet Controller Node Registration
CREATE TABLE shard_tablets (
    tablet_id VARCHAR(64) PRIMARY KEY,
    shard_id VARCHAR(64) REFERENCES keyspace_shards(shard_id),
    host_address VARCHAR(255) NOT NULL,
    tablet_type VARCHAR(16) NOT NULL, -- 'PRIMARY', 'REPLICA', 'BATCH', 'SPARE'
    replication_lag_seconds INT DEFAULT 0,
    health_status VARCHAR(16) DEFAULT 'HEALTHY',
    last_updated TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- Indexes for lightning fast lookups
CREATE INDEX idx_keyspace_lookup ON keyspace_shards(keyspace_name, is_active);
CREATE INDEX idx_tablet_lookup ON shard_tablets(shard_id, tablet_type, health_status);

2. Connection Multiplexing State Machine

The database proxy intercepts high-volume connections from the application and maps them onto a fixed pool of persistent database sockets.

Client Connection State	Proxy Action	Database Pool State
New Connection	Authenticates client at proxy edge; returns success instantly without hitting MySQL.	Idle connection pool remains untouched.
Active Query Received	Parses AST, selects connection from the target tablet's connection pool.	Leases a socket from the pool.
Transaction Started	Binds the client connection to a specific database socket until commit or rollback.	Sockets locked exclusively to the client.
Query Completed (No Tx)	Returns data stream to client; immediately recycles connection to pool.	Connection released back to the idle pool.
Client Disconnects	Closes client session; keeps physical database sockets open.	Physical pool size remains constant.

Scaling and Operational Challenges

1. Scatter-Gather Query Complexity Math

A scatter-gather query is triggered when the application runs a query without specifying the sharding key. For example: SELECT * FROM users ORDER BY creation_time DESC LIMIT 10;

The Math Behind the Complexity:
- Suppose the cluster has $N$ physical shards.
- The proxy must broadcast the query to all $N$ shards simultaneously.
- Each shard must compile, execute, and sort the rows locally to return its top 10 items.
- The database proxy receives $10 \times N$ total rows.
- To return the final top 10 rows to the user, the proxy's CPU must merge-sort the $10 \times N$ rows in memory.
- The computational complexity at the proxy is $O(M \times N \log(M \times N))$ where $M$ is the LIMIT size.
- The Bottleneck: As $N$ scales from 10 to 1,000 shards, a single un-sharded query consumes 1,000 physical connections across the database cluster and saturates the proxy's CPU.
Mitigation: The proxy enforces strict static query analysis. If a scatter-gather query exceeds a maximum allowed limit size ($M$ greater than 50) or is executed too frequently, the proxy blocks it instantly and throws an error recommending the developer utilize a search index (e.g. Elasticsearch) or specify the sharding key.

2. Dynamic Resharding and Split-Buffering

When a physical shard runs out of storage capacity or CPU limits, we must split it (e.g., splitting a single shard covering key-range 00-FF into two shards: 00-7F and 80-FF).

sequenceDiagram
    autonumber
    participant Proxy as Database Proxy (VtGate)
    participant Topo as etcd Topo Store
    participant OldShard as Shard [00-FF] (Primary)
    participant NewS1 as New Shard [00-7F]
    participant NewS2 as New Shard [80-FF]

    Note over OldShard, NewS2: Phase 1: Setup and Async Catchup
    NewS1->>OldShard: Replicate from [00-FF] (Filter keys 00-7F)
    NewS2->>OldShard: Replicate from [00-FF] (Filter keys 80-FF)
    Note over NewS1, NewS2: Wait until replication lag is near zero (< 1s)

    Note over Proxy, OldShard: Phase 2: Cutover and Split Buffering
    Proxy->>OldShard: Buffer incoming writes for keyspace (Hold lock)
    Note over Proxy: Requests are queued, NOT rejected (less than 1s)
    
    Proxy->>Topo: Update Routing Table (Point ranges to New Shards)
    Proxy->>NewS1: Promote to Primary (Range 00-7F)
    Proxy->>NewS2: Promote to Primary (Range 80-FF)
    
    Proxy->>Proxy: Flush buffered writes to New Shards
    Proxy->>OldShard: Decommission Old Shard

The Buffering Challenge: During the final cutover, we must avoid rejecting user writes. The proxy handles this by implementing Split-Buffering. When the routing switch happens, the proxy pauses write execution and buffers all client SQL queries inside its memory buffers for up to 2 seconds. Once the new target primary nodes are promoted, the proxy flushes the buffered queries to the new primary nodes, ensuring zero-downtime resharding.

Database Proxy Trade-offs and Architectural Decisions

Every architectural choice is a trade-off. Deciding whether to utilize a centralized Database Proxy (like Vitess) versus integrating client-side sharding drivers (such as JDBC ShardingSphere) inside your microservice code dictates global server resource efficiency and deployment complexity.

Architectural Dimension	Centralized Database Proxy (Vitess Style)	Client-Side Sharding Drivers
Connection Pooling Efficiency	Excellent (Globally multiplexed across a static pool)	Poor (Every application instance must maintain active database sockets)
Network Hop Overhead	Medium (Adds an extra hop of less than 2ms for query parsing)	Low (Direct TCP socket access directly to physical databases)
Operational and Infra Complexity	High (Requires provisioning stateless proxies and an etcd cluster)	Low (Simply import a client library dependency into app servers)
Language & Platform Agnostic	Universal (Speaks standard MySQL/PostgreSQL protocols)	Restricted (Requires writing client library implementations for each language)

Failure Modes and Fault Tolerance Strategies

1. Split-Brain Mitigation During Primary Tablet Promotion

If a network partition isolates a primary database node, a naive monitor might assume it is dead and promote a replica to primary. If the old primary is actually still alive, you get a Split-Brain condition where two primary nodes accept conflicting writes.

The Resilience Strategy: The proxy integrates with consensus-backed leases in etcd. Every database primary tablet must maintain a short-duration lease (e.g., 5 seconds) inside etcd.
If a primary tablet fails to renew its lease within the 5-second deadline, its lease expires.
The proxy detects lease expiration, blocks all query routing to the old primary, and initiates replica promotion.
Even if the old primary is still alive, it cannot accept writes because the proxy refuses to route traffic to it, ensuring absolute consistency.

2. Replication Lag Read Safety

In highly sharded environments, routing all reads to replicas is crucial to protect the primary node's CPU. However, if a replica has high replication lag, applications will suffer from stale reads.

The Solution: The tablet controllers constantly measure replica replication lag in milliseconds.
The proxy monitors these metrics. We define a maximum allowed replication lag threshold (e.g., 1,000ms).
If a replica's lag is greater than 1,000ms, the proxy dynamically marks the node as unhealthy and redirects all read requests back to the primary node.
Once the replica catches up and lag drops less than 500ms, it is safely re-introduced into the read pool, guaranteeing data freshness.

Staff Engineer Perspective

Production Readiness Checklist

Before rolling out your database sharding proxy to production, ensure these checks are passed:

Query Depth & Limit Rules: Ensure all non-sharded queries are verified and restricted to a max limit size of less than 50 rows.
Connection Pool Sizes: Set proxy-to-database connection pools to static values (e.g. 50 connections per shard) to prevent database resource starvation.
Topology Heartbeats: Configure etcd routing table heartbeat checks to run every 1,000ms.
Split-Buffering Latency Check: Test that dynamic shard-splitting buffering does not exceed a p99 threshold of 1,500ms.

Verbal Script

Interviewer: "How would you design a highly scalable database proxy for sharding a relational database, and how do you handle cross-shard queries and failovers?"

Candidate: "To design a database proxy for database sharding at scale, I would follow a decoupled architecture inspired by Vitess. The core architecture consists of three components: stateless proxy gateways called VtGate, consensus-backed configuration stores like etcd to manage the routing directory, and local tablet controllers running alongside every MySQL instance.

The stateless proxies act as standard MySQL endpoints for the application. When a client executes a query, the proxy parses the SQL Abstract Syntax Tree (AST) in less than 2ms. It looks for a designated sharding key, such as user_id.

If the sharding key is present, the proxy hashes it, compares the hash against the range boundaries stored in the cached topology routing directory, and target-routes the query directly to the specific primary or replica database node. This targeted routing ensures $O(1)$ networking latency.

If the query is a cross-shard query that lacks the sharding key—for example, a SELECT searching by user age—the proxy must coordinate a Scatter-Gather workflow. It broadcasts the query to all database shards in parallel, receives the partial datasets, and then executes in-memory sorting and deduplication before returning the rows to the application client. Because scatter-gather operations scale at $O(N \log N)$ complexity and consume active connections across the entire cluster, the proxy statically monitors queries and rejects any scatter-gather operation where the request limit is greater than 50 rows, protecting the database cluster from resource starvation.

To ensure high availability during database failovers, the proxy integrates with consensus-based leases. The primary node's controller must renew its lease in the topology store every 5 seconds. If a node fails, its lease expires, and the proxy immediately blocks all query routing to the dead primary. A replica node is promoted to primary, the topology routing table is updated in etcd, and the proxies receive the updated routing map. During this quick failover window, the proxy leverages Split-Buffering to queue incoming writes inside memory buffers for up to 2 seconds instead of rejecting them, guaranteeing a seamless failover experience without dropping active database connections."