Lesson 76 of 105 13 minFlagship

System Design: Building a Distributed Configuration Platform

Design a production distributed configuration platform with versioned config, rollout safety, snapshots, watchers, audit logs, multi-environment isolation, and safe client-side caching.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • one bad config value breaks every service instance
  • different hosts run different config versions without anyone realizing
  • secrets leak into places they should never be
Recommended Prerequisites
System Design Interview Framework

Premium outcome

From vague architecture answers to staff-level trade-off thinking.

Backend engineers preparing for senior, staff, and architecture rounds.

What you unlock

  • A reusable system design answer framework for ambiguous prompts
  • Clear language for consistency, scaling, and reliability trade-offs
  • Case-study depth across feeds, payments, storage, and messaging systems

Most systems begin with configuration as a local flat file.

That works for a while.

Then one service needs a database URL rotation without a full code redeploy. Another wants per-environment rate limits. A third wants tenant-specific processing limits. A fourth wants a kill switch for an external payment provider. Suddenly config is no longer a simple startup file loaded on boot. It becomes a dynamic runtime platform.

When configuration is handled poorly, you get the worst sort of production outages:

  • One bad config value instantly crashes every service instance across the fleet.
  • Different hosts run different configuration versions without the operators realizing.
  • Secrets leak into logging systems or general configuration stores.
  • Operators hotfix production variables through SSH or direct database writes, creating permanent drift.

This guide designs a production-grade distributed configuration platform.


Requirements and System Goals

A distributed configuration platform behaves as a distributed control plane. While config writes occur infrequently, read availability and client-side access speeds are critical path dependencies.

Functional Requirements

  • Scoped & Versioned Config: The system must store configuration keys scoped by namespace, environment (dev, staging, prod), service name, and tenant override. Every single commit must generate a new, monotonically increasing version ID.
  • Canary & Staged Rollouts: Support progressive configuration release (e.g., publish version 418 to 1% of instances, then 10%, then 100%).
  • Watcher & Subscription API: Clients must receive dynamic update notifications within seconds of activation without continuously polling the database.
  • Merge-Time Schema Validation: Config values must be validated against registered JSON Schemas before they can be activated.
  • Rollback Tooling: Operators must be able to roll back configuration changes to a previous version with a single API call.
  • Audit Logging: Every config shift must record an audit entry detailing who initiated the change, the timestamp, and the exact JSON value diff.

Non-Functional Requirements

  • Sub-Millisecond Read Latency: Client lookups must run from in-memory maps in the application process, yielding sub-millisecond response times.
  • Fail-Soft Local Snapshots: If the configuration control plane goes offline, client SDKs must continue operating using their last-known-good local disk snapshot.
  • Fleet Consistency Bounds: Updates must propagate to 99% of a global fleet of 100,000 instances in less than 5 seconds.
  • Strict Environment Isolation: Production configuration control planes and network boundaries must be completely isolated from staging and development networks.

API Interfaces and Service Contracts

To allow automated pipelines and operators to modify configuration, we expose a REST API. Client sidecars and libraries use gRPC to pull configuration and stream updates.

Propose a Configuration Draft

  • Endpoint: POST /v1/config/drafts
  • Request Payload:
{
  "namespace": "checkout",
  "environment": "prod",
  "service": "payment-api",
  "configKey": "gateway_timeouts",
  "value": {
    "connectTimeoutMs": 150,
    "readTimeoutMs": 450,
    "maxRetries": 3
  }
}
  • Response Payload (HTTP 201 Created):
{
  "draftId": "dr_00129a",
  "namespace": "checkout",
  "configKey": "gateway_timeouts",
  "proposedVersion": 418,
  "isValid": true,
  "validationMessage": "Schema validation passed."
}

Activate Configuration Version (Publish)

  • Endpoint: POST /v1/config/activate
  • Request Payload:
{
  "namespace": "checkout",
  "environment": "prod",
  "service": "payment-api",
  "configKey": "gateway_timeouts",
  "version": 418,
  "rolloutPercentage": 10
}
  • Response Payload (HTTP 200 OK):
{
  "configKey": "gateway_timeouts",
  "activeVersion": 418,
  "status": "ROLLOUT_IN_PROGRESS",
  "rolloutTargetInstances": 10000
}

Retrieve Full Resolved Configuration Snapshot

  • Endpoint: GET /v1/config/snapshot
  • Query Params: service=payment-api&environment=prod
  • Response Payload (HTTP 200 OK):
{
  "service": "payment-api",
  "environment": "prod",
  "snapshotVersion": 418,
  "updatedAt": "2026-06-06T14:20:00Z",
  "configurations": {
    "gateway_timeouts": {
      "connectTimeoutMs": 150,
      "readTimeoutMs": 450,
      "maxRetries": 3
    },
    "rate_limits": {
      "requestsPerSecond": 500,
      "burstCapacity": 1000
    }
  }
}

Client gRPC Service Contract

For efficient, low-overhead communication between client sidecars and the config server, we use gRPC:

syntax = "proto3";

package codesprintpro.config.v1;

service ConfigDeliveryService {
  rpc FetchConfigSnapshot (SnapshotRequest) returns (SnapshotResponse);
  rpc WatchConfigUpdates (WatchRequest) returns (stream ConfigUpdateEvent);
}

message SnapshotRequest {
  string environment = 1;
  string service_name = 2;
  int64 client_version = 3;
}

message SnapshotResponse {
  int64 snapshot_version = 1;
  string config_payload_json = 2;
  string sha256_hash = 3;
}

message WatchRequest {
  string environment = 1;
  string service_name = 2;
  int64 current_version = 3;
}

message ConfigUpdateEvent {
  string config_key = 1;
  int64 new_version = 2;
  string update_payload_json = 3;
}

High-Level Design and Visualizations

Our architecture separates the management path (writes, validation, approvals) from the fast-path distribution network (reading configuration snapshots and updates).

Configuration Propagation Topology

flowchart TD
    Admin[Operator / CI-CD Pipeline] -->|1. Propose / Publish Config| AdminAPI[Config Admin Service]
    AdminAPI -->|2. Validate Schema & Authz| Validator[JSON Schema Validator]
    Validator -->|3. Store Versioned Records| MetadataDB[(Metadata Store - PostgreSQL)]
    AdminAPI -->|4. Push Active Event| RedisPub[Redis Pub-Sub Cluster]
    
    RedisPub -->|5. Broadcast Update Events| SidecarPool[Client Sidecar Pool]
    
    subgraph HostInstance [Application VM / Container]
        SidecarPool -->|6. Stream Updates & Write Snapshot| DiskStorage[(Local SSD Storage - JSON Snapshot)]
        SidecarPool -->|7. Push Hot Swap| MemoryCache[In-Memory Local Config Map]
        App[Application Code] -->|8. Sub-microsecond Read Key| MemoryCache
        App -.->|9. Fallback on startup failure| DiskStorage
    end
    
    SidecarPool -->|10. Fetch full fallback snapshot on boot| Gateway[Config Read Gateway]
    Gateway -->|11. Fetch current active snapshot| MetadataDB

Client-Side Fallback Resolution State Machine

Client SDKs and sidecars use a fallback resolution path on host startup to prevent control-plane outages from affecting service availability.

stateDiagram-v2
    [*] --> Init: Application Boot
    Init --> AttemptRemote: Try to fetch remote snapshot from Config Gateway
    
    AttemptRemote --> RemoteSuccess: Remote service responds OK (200)
    RemoteSuccess --> WriteLocalDisk: Write snapshot payload and SHA-256 hash to local SSD
    WriteLocalDisk --> LoadInMemory: Load configuration map in-process
    LoadInMemory --> ActiveState: SDK initialized, serving requests
    
    AttemptRemote --> RemoteFail: Gateway timeout / HTTP 5xx / Network Down
    RemoteFail --> CheckLocalDisk: Scan local SSD for previous configuration snapshot
    
    CheckLocalDisk --> LocalExists: Valid snapshot JSON and hash match found
    LocalExists --> LoadInMemoryFallback: Load local snapshot and emit warning alert
    LoadInMemoryFallback --> ActiveState
    
    CheckLocalDisk --> LocalMissing: Local disk empty or snapshot corrupt
    LocalMissing --> LoadCodeDefaults: Load fallback properties hardcoded in application binary
    LoadCodeDefaults --> ActiveStateDegraded: Service starts in safe degraded mode

Low-Level Design and Schema Strategies

We use a PostgreSQL database to manage versioned configuration. We maintain separate tables for configuration history, currently active versions, and audit trails.

PostgreSQL Table DDLs

-- Versioned historical table holding config payloads for every commit
CREATE TABLE config_entries (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    namespace VARCHAR(64) NOT NULL,
    environment VARCHAR(32) NOT NULL,
    service VARCHAR(128) NOT NULL,
    config_key VARCHAR(128) NOT NULL,
    config_value JSONB NOT NULL,
    schema_version INT NOT NULL DEFAULT 1,
    config_version BIGINT NOT NULL,              -- Monotonically increasing version counter
    status VARCHAR(32) NOT NULL,                 -- 'DRAFT', 'ACTIVE', 'ARCHIVED'
    created_by VARCHAR(128) NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    CONSTRAINT uk_config_key_version UNIQUE (namespace, environment, service, config_key, config_version)
);

-- Index to quickly scan historical version runs
CREATE INDEX idx_config_entries_lookup 
ON config_entries (namespace, environment, service, config_key, config_version DESC);

-- Pointers to the currently active version of each configuration key
CREATE TABLE config_current_versions (
    namespace VARCHAR(64) NOT NULL,
    environment VARCHAR(32) NOT NULL,
    service VARCHAR(128) NOT NULL,
    config_key VARCHAR(128) NOT NULL,
    active_version BIGINT NOT NULL,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (namespace, environment, service, config_key)
);

-- Audit log recording diffs and operator changes
CREATE TABLE config_audit_events (
    event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    namespace VARCHAR(64) NOT NULL,
    environment VARCHAR(32) NOT NULL,
    service VARCHAR(128) NOT NULL,
    config_key VARCHAR(128) NOT NULL,
    old_version BIGINT,
    new_version BIGINT NOT NULL,
    actor_id VARCHAR(128) NOT NULL,
    action_type VARCHAR(64) NOT NULL,            -- 'CREATED', 'ACTIVATED', 'ROLLED_BACK'
    change_diff JSONB NOT NULL,                  -- JsonDiff patch payload showing changes
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_config_audit_events_history 
ON config_audit_events (namespace, environment, config_key, created_at DESC);

Scaling and Operational Challenges

To support 100,000 client instances concurrently updating their local configurations, we must evaluate control plane egress network loads and broadcast behavior.

Back-of-the-Envelope Capacity Estimations

Let us estimate the capacity requirements for streaming a config change event to a fleet of 100,000 microservice instances.

  • Configuration Event Payload Size: Let us assume the JSON configuration metadata event is approximately 1 KB in size.
  • Full Snapshot Payload Size: The full resolved configuration snapshot for a service contains multiple keys and averages 50 KB.
  • Update Event Broadcast Bandwidth: When a config update is activated, we stream the 1 KB event via Redis Pub/Sub to all 100,000 listening client sidecars. $$\text{Event egress volume} = 100,000 \times 1\text{ KB} = 100,000\text{ KB} = 100\text{ MB}$$ If the broadcast occurs within 1 second, the required egress network bandwidth is: $$\text{Egress rate} = 100\text{ MB/sec} = 800\text{ Mbps}$$ This is easily handled by a modern Redis or Kafka cluster.
  • Full Snapshot Fetch Stampede (The Thundering Herd): If a network blip causes all 100,000 clients to restart or re-fetch their full configuration snapshots concurrently, the config gateways could experience a sudden load spike: $$\text{Thundering Herd Volume} = 100,000 \times 50\text{ KB} = 5,000,000\text{ KB} = 5\text{ GB}$$ If these requests arrive within a 2-second window, the gateway servers must handle an egress rate of: $$\text{Gateway egress rate} = \frac{5\text{ GB}}{2\text{ sec}} = 2.5\text{ GB/sec} \approx 20\text{ Gbps}$$ To prevent gateway CPU and network saturation, we implement:
    • Jitter (random delays between 0 and 3 seconds) on client startup.
    • Edge caching of configuration snapshots on CDN or API Gateway proxies (e.g., NGINX / Cloudflare).
    • Client sidecars that prefer loading local SSD snapshots and verify current versions via a lightweight HEAD request (less than 100 bytes) instead of downloading the full 50 KB configuration.

Trade-offs and Architectural Alternatives

Event Broadcast Mechanism: Push vs. Pull Polling

Pattern Latency Network Overhead Complexity Reliability
Short Polling (e.g., every 5s) High (up to 5s delay) High (saturates database with read requests) Low High
Long Polling (HTTP Hang) Medium (under 500ms) Low (connections held open) Medium High
Push-Based (SSE / WebSocket) Low (less than 50ms) Extremely Low High (requires persistent stateful gateways) Medium (requires client-side reconnect handling)

We choose a hybrid approach: Clients fetch their initial full snapshot on boot via standard HTTP GET (pull), then open a persistent Server-Sent Events (SSE) or gRPC streaming connection to receive lightweight update events (push). If the stream disconnects, the client falls back to long-polling.

Client integration: Inline SDK vs. Local Sidecar Cache

  • Inline SDK (embedded inside application code):
    • Pros: Simple deployment; no external processes to monitor; lowest intra-host communication overhead.
    • Cons: Requires language-specific SDK implementations; configuration caching logic shares heap memory with application code (risk of garbage collection pauses).
  • Local Sidecar Cache (running adjacent to the application process):
    • Pros: Completely decouples config management from the application; runs in a separate process space; provides language-agnostic integration (exposes config via local port /v1/config); caches snapshots on local disk automatically.
    • Cons: Increases host resource usage; adds process orchestration complexity (Kubernetes sidecar containers).

Failure Modes and Fault Tolerance Strategies

Fleet Version Drift and Skew Detection

During rollouts, some instances may fail to receive the configuration update due to network partitions, leaving them running on an older version.

  • Mitigation: Every microservice client regularly exposes its active configuration version ID in its heartbeat metadata.
  • Alerting: The central monitoring service scans heartbeat metadata. If version skew persists for more than 5 minutes, it triggers an alert and flags the stale instances for recycled deployment.

Invalid Merged Configuration Crash

A user updates a configuration key that passes isolation testing. However, when combined with service-level overrides, the merged configuration becomes invalid and crashes the application during startup.

  • Mitigation: The validation engine must check the final merged configuration representation, not just the isolated keys.
  • Testing: The gateway runs dry-run merge validations against the registered schemas before committing any drafts:
function validateMerge(global: Config, env: Config, override: Config, schema: JsonSchema): boolean {
  const merged = mergeConfigs(global, env, override);
  return schema.validate(merged);
}

Config Store Database Outage

If the main PostgreSQL metadata database goes offline, the control plane cannot write configuration changes. However, read availability must be preserved.

  • Mitigation: Config read gateways run on active-active nodes using Redis replicas. We cache all resolved snapshots in Redis with a 24-hour TTL. If the database goes offline, clients can still fetch configuration from the cache.

Staff Engineer Perspective


Verbal Script

Interviewer: "How would you design a distributed configuration platform that guarantees that a configuration change won't take down the entire system?"

Candidate: "I would implement safety guardrails across validation, rollout, and fallback paths.

First, I would enforce merge-time schema validation. Configuration changes cannot be activated unless they pass validation against a registered JSON Schema. This validation is run on the final merged representation (global defaults, environment overrides, and service-level configurations combined) to verify keys are valid.

Second, I would avoid rolling out updates to the entire fleet at once. Instead, I would use canary rollouts, updating configuration on 1% of instances first. The client SDK monitors host error rates. If the error rate spikes, the SDK rolls back to the previous version and notifies the control plane to abort the rollout.

Finally, I would isolate client reads from control-plane availability. Client SDKs cache configuration snapshots locally on disk. If the configuration server goes offline, applications can boot or run using their last-known-good local snapshot."

Interviewer: "How does the client SDK detect that a new configuration is available without overwhelming the server?"

Candidate: "We use a hybrid push-pull approach. On startup, the client pulls a full configuration snapshot from the server. Once initialized, the client opens a persistent gRPC stream or Server-Sent Events (SSE) connection to receive lightweight update events.

These events contain only the namespace, key, and new version ID—not the full configuration payload. If the version is newer than the client's local version, the client fetches the updated configuration from the gateway. If the connection drops, the client falls back to long-polling with randomized jitter to prevent thundering herd spikes on the gateways."


Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.