System Design: Building a Feature Flag Platform

In modern software development, decoupling deployment from release is paramount. Feature flag systems (such as LaunchDarkly or Split.io) allow engineers to deploy code to production while keeping features completely dark until they are flipped. This case study designs a highly available, ultra-low latency, and resilient feature flag platform capable of serving billions of evaluations daily across millions of client devices and server nodes.

1. Requirements & Core Constraints

Functional Requirements

Flag Configuration Panel: Admins can create, edit, delete, and archive flags.
Targeting Rules (Cohorts): Target specific cohorts of users based on attributes (e.g., email matches *@company.com, country is "US", or plan tier is "Enterprise").
Percentage Rollouts: Smoothly roll out features from 0% to 100% using deterministic hashing algorithm.
Real-Time Kill Switches: Instant propagation of flag changes (sub-second latency globally) to shut off misbehaving production features.
Audit Logs & Permissions: Detailed audit history logging who modified a flag, when, and what changed.

Non-Functional Requirements

Zero-Latency In-Memory Evaluation: To avoid an expensive API call on every if (flagEnabled), evaluations must execute inside the application process memory space in sub-microsecond time.
Scale: Manage up to 100,000 active feature flags evaluated by 1 Million connected server and client SDKs.
High Propagation Reliability: Flag updates must stream to connected SDKs globally within 1 second.
Robust Client SDK Isolation: A partition or outage in the feature flag backend database must never crash or block the host application.

Back-of-the-Envelope Capacity Estimation

1. Connection Scale (SSE Gateways)

Connected Clients: 1,000,000 concurrent server/client SDK sessions.
Ingress Protocol: Server-Sent Events (SSE) connections for real-time pushing.
Memory Overhead per Connection: Let's budget 30 Kilobytes per open socket on our Node.js/Go gateway servers.
Total RAM for SSE Gateways:
- 1,000,000 * 30,000 bytes = 30 Gigabytes.
- Distributed across 30 container nodes (each handling ~33,000 sessions) = ~1 Gigabyte per container. Extremely lightweight!

2. Streaming Bandwidth Sizing

Total Feature Flags: 100,000 active flags.
Rule Set Payload size: Let's compress the rules. A typical rule set has 100 rules active at any time. Compressed JSON size is ~5 Kilobytes.
Initial Broadcast Network Burst:
- If 1,000,000 nodes restart and fetch rules concurrently:
- 1,000,000 * 5 Kilobytes = 5 Gigabytes of data.
- If they restart within a 10-second window, we need 500 Megabytes/second of exit bandwidth. A global CDN shield is mandatory.
Propagation updates: A change is published (500 bytes for single delta).
- 1,000,000 active streams * 500 bytes = 500 Megabytes pushed across the fleet.

2. API Design & Core Contracts

The platform exposes REST APIs for flag configuration and admin portals, while SDKs connect to dedicated streaming SSE endpoints.

API 1: Create a Feature Flag with Advanced Targeting

Creates a new flag definition including rules and cohort criteria.

HTTP Method: POST
Path: /api/v1/flags
Headers:
- Content-Type: application/json
- Authorization: Bearer adm_8e09f2

Request Payload

{
  "key": "payment_redesign_2026",
  "name": "New Payment Redesign Gateway",
  "description": "Beta test of the stripe payment flow",
  "variations": [
    { "id": "var_on", "value": true, "name": "Feature Enabled" },
    { "id": "var_off", "value": false, "name": "Feature Disabled" }
  ],
  "rules": [
    {
      "id": "rule_internal_testers",
      "cohort_keys": ["internal_employees"],
      "variation_id": "var_on"
    },
    {
      "id": "rule_us_premium",
      "conditions": [
        { "attribute": "country", "operator": "EQUALS", "values": ["US"] },
        { "attribute": "plan", "operator": "EQUALS", "values": ["premium"] }
      ],
      "rollout": {
        "salt": "rand_9281a",
        "distributions": [
          { "variation_id": "var_on", "weight_percentage": 25 },
          { "variation_id": "var_off", "weight_percentage": 75 }
        ]
      }
    }
  ],
  "default_off_variation": "var_off"
}

Response Payload

{
  "id": "flg_982347102",
  "key": "payment_redesign_2026",
  "status": "ACTIVE",
  "version": 1,
  "created_at": "2026-05-22T17:44:00Z",
  "updated_at": "2026-05-22T17:44:00Z"
}

3. High-Level Design (HLD)

To achieve zero latency, our feature flag architecture uses Local SDK Evaluation. Instead of executing an API call for every check, the application server downloads rules on startup, establishes an SSE stream for updates, and evaluates flags locally in process memory.

Global Flag Ingestion & Serving Diagram

graph TD
    %% Admin Actions
    Admin[Admin Panel Dashboard] -->|Save Flag / CRUD| DB[(PostgreSQL Main DB)]
    DB -->|Trigger Event| RedisStream[Redis Pub/Sub / Kafka Streams]
    
    %% Real-time Streaming
    RedisStream -->|Publish Delta Event| Gateway[Streaming SSE Gateways]
    Gateway -->|Server-Sent Events| ClientSDK[Client & Server SDKs]
    
    %% Backup & Scaling CDN Shield
    DB -->|Cron compiler| S3[AWS S3 Bucket]
    S3 -->|Edge Replication| CDN[Global CDN / Edge Cache]
    CDN -->|Initial Sync / Fallback| ClientSDK
    
    %% SDK Local evaluation
    subgraph Host Application Process
        ClientSDK -->|Pre-loaded JSON rules| LocalCache[In-Memory SDK Registry]
        AppLogic[Application Code] -->|eval.evaluateVar| LocalCache
    end

Real-Time Update Sequence

sequenceDiagram
    autonumber
    actor Admin as Platform Operator
    participant API as Config Admin Service
    participant DB as Postgres Storage
    participant PubSub as Redis Pub/Sub Event Bus
    participant Gateway as SSE Streaming Gateway
    participant SDK as Application Client SDK

    Admin->>API: Save Targeting Rules (Set Payment Flag = 50%)
    API->>DB: Write Flag Config & Increment Version
    DB-->>API: Write Success
    API->>PubSub: Publish Update Event (payment_redesign_2026, Version: 2)
    PubSub-->>Gateway: Forward Delta Packet
    Gateway->>SDK: Push Event (Event: "patch", Data: "stripe_v2_payload")
    SDK->>SDK: Apply Delta to In-Memory Map

4. Low-Level Design (LLD) & Data Models

Database Schema (PostgreSQL)

We use a relational database to store flag entities and nested rules cleanly, combined with dynamic audit logs.

-- Feature Flags Core Definition
CREATE TABLE feature_flags (
    id VARCHAR(64) PRIMARY KEY,
    key_name VARCHAR(64) UNIQUE NOT NULL,
    title VARCHAR(128) NOT NULL,
    description TEXT,
    is_enabled BOOLEAN NOT NULL DEFAULT FALSE,
    rules_json JSONB NOT NULL,
    default_on_variation VARCHAR(64) NOT NULL,
    default_off_variation VARCHAR(64) NOT NULL,
    version INT NOT NULL DEFAULT 1,
    created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);

-- Audit log for regulatory compliance and tracing bugs
CREATE TABLE feature_flag_audit_logs (
    id SERIAL PRIMARY KEY,
    flag_key VARCHAR(64) NOT NULL,
    actor_email VARCHAR(128) NOT NULL,
    action_type VARCHAR(16) NOT NULL, -- CREATE, UPDATE, TOGGLE, DELETE
    previous_value JSONB,
    new_value JSONB,
    version INT NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_audit_flag_key ON feature_flag_audit_logs (flag_key, created_at DESC);

Compilable Java Implementation: Feature Flag Evaluator

Below is a thread-safe, local evaluator utilizing MurmurHash3 to achieve deterministic and fair percentage distributions of variations based on user identifiers.

package com.codesprintpro.featureflag;

import java.nio.charset.StandardCharsets;
import java.util.Map;
import java.util.Objects;
import java.util.concurrent.ConcurrentHashMap;

public class FeatureFlagEvaluator {

    // Simulates process memory rules registry
    private final Map<String, FlagConfig> activeFlags = new ConcurrentHashMap<>();

    public static class FlagConfig {
        public final String key;
        public final String salt;
        public final int rolloutPercentage; // e.g. 25 for 25%
        public final String varOn;
        public final String varOff;

        public FlagConfig(String key, String salt, int rolloutPercentage, String varOn, String varOff) {
            this.key = key;
            this.salt = salt;
            this.rolloutPercentage = rolloutPercentage;
            this.varOn = varOn;
            this.varOff = varOff;
        }
    }

    public void updateFlag(FlagConfig config) {
        activeFlags.put(config.key, config);
    }

    public String evaluate(String flagKey, String userId, String defaultFallback) {
        FlagConfig config = activeFlags.get(flagKey);
        if (config == null) {
            return defaultFallback;
        }

        // Target override check (simulating dynamic cohorts)
        if (userId.endsWith("@company.com")) {
            return config.varOn; // Force enable for internal emails
        }

        // Deterministic hashing for percentage rollouts
        String hashInput = flagKey + ":" + config.salt + ":" + userId;
        long hashValue = getMurmurHash3(hashInput);
        long bucket = Math.abs(hashValue) % 100;

        if (bucket < config.rolloutPercentage) {
            return config.varOn;
        }
        return config.varOff;
    }

    // MurmurHash3 32-bit implementation for speed and uniformity
    private static int getMurmurHash3(String key) {
        byte[] data = key.getBytes(StandardCharsets.UTF_8);
        int length = data.length;
        int c1 = 0xcc9e2d51;
        int c2 = 0x1b873593;
        int h1 = 0;

        for (int i = 0; i < length - 3; i += 4) {
            int k1 = (data[i] & 0xff) |
                    ((data[i + 1] & 0xff) << 8) |
                    ((data[i + 2] & 0xff) << 16) |
                    ((data[i + 3] & 0xff) << 24);

            k1 *= c1;
            k1 = Integer.rotateLeft(k1, 15);
            k1 *= c2;

            h1 ^= k1;
            h1 = Integer.rotateLeft(h1, 13);
            h1 = h1 * 5 + 0xe6546b64;
        }

        // Remaining bytes
        int k1 = 0;
        int tailIndex = length - (length % 4);
        switch (length % 4) {
            case 3:
                k1 ^= (data[tailIndex + 2] & 0xff) << 16;
            case 2:
                k1 ^= (data[tailIndex + 1] & 0xff) << 8;
            case 1:
                k1 ^= (data[tailIndex] & 0xff);
                k1 *= c1;
                k1 = Integer.rotateLeft(k1, 15);
                k1 *= c2;
                h1 ^= k1;
        }

        h1 ^= length;
        h1 ^= h1 >>> 16;
        h1 *= 0x85ebca6b;
        h1 ^= h1 >>> 13;
        h1 *= 0xc2b2ae35;
        h1 ^= h1 >>> 16;

        return h1;
    }

    public static void main(String[] args) {
        FeatureFlagEvaluator evaluator = new FeatureFlagEvaluator();

        // Register 25% rollout flag configuration
        evaluator.updateFlag(new FlagConfig("beta_dashboard", "salt_xyz123", 25, "ON", "OFF"));

        // Verify deterministic behaviors
        int matches = 0;
        for (int i = 0; i < 10000; i++) {
            String result = evaluator.evaluate("beta_dashboard", "user_" + i, "OFF");
            if ("ON".equals(result)) {
                matches++;
            }
        }

        System.out.println("Rollout matches over 10,000 iterations: " + matches);
        // Expect close to ~2500 matches representing exactly 25%
    }
}

5. Scaling Challenges & Bottlenecks

1. The Dynamic Thundering Herd (Cascading Reconnections)

Problem: When a network event disrupts our SSE gateways, thousands of connected servers and clients disconnect. Upon recovery, they simultaneously reconnect and fire initial Sync HTTP requests. This thundering herd crushes backend database layers.
Mitigation:
- SDKs must implement Exponential Jitter Retry strategies.
- Implement a CDN Shield in front of initial sync endpoints. The SSE gateway should write compiled static JSON rule sets to an AWS S3 bucket, served via a global Cloudflare/CloudFront CDN. Gateways evaluate requests against cached CDN objects first before touching database partitions.

2. Client Side Flag Leakage

Problem: For front-end SDKs (e.g. mobile applications), downloading a full rule JSON exposes confidential features or internal company configurations to inquisitive users examining local application network logs.
Mitigation: Provide a specialized Public Edge Broker. When mobile applications establish connection, they supply current user metadata. The Broker evaluates targeted variations at the edge, compiling a pruned, user-specific flag-value dictionary (e.g. {"beta_dashboard": "ON"}) so that full, nested targeting rules are never sent down to client-side hardware.

6. Technical Trade-offs & Compromises

WebSockets vs. Server-Sent Events (SSE)

WebSockets Choice: Offers bi-directional channels. Useful if the client also needs to send rapid telemetry up to the server. However, WebSocket proxies break HTTP/2 pipelining, require custom keep-alive heartbeats, and do not handle load balancers cleanly.
SSE Choice: Unidirectional server-to-client push channel that runs directly over standard HTTP/2 out-of-the-box. It includes automatic client reconnection mechanisms, supports compression natively, and handles network proxies seamlessly.
Decision: We select Server-Sent Events (SSE). The unidirectional push model matches our flag update criteria perfectly, keeping server complexity at a minimum.

7. Failure Scenarios & Operational Resiliency

1. Complete CDN Outage

Scenario: Global CDN shield experiences a massive service interruption, blocking standard SDK initialization paths.
Resiliency Plan: Client SDKs embed a Bootstrap Fallback File inside application binaries during build time. If CDN and SSE hosts are unreachable, the SDK initializes utilizing these hardcoded fallback values to keep the core application up and running.

2. SSE Gateway Memory Exhaustion

Scenario: A spike in browser client connections causes high garbage collection overhead and out-of-memory crashes on SSE nodes.
Resiliency Plan: We implement strict connection limits per SSE container backed by standard Kubernetes HPA (Horizontal Pod Autoscaling) rules. When containers reach 80% RAM capacity, traffic is diverted to auxiliary replicas while active instances perform graceful connection drop-off.

3. Infinite Rule Recursion

Scenario: A user defines a circular rule where Cohort A references Cohort B, which in turn references Cohort A, causing stack overflow inside the SDK during evaluation.
Resiliency Plan: Validate rules upon ingestion inside the Admin dashboard using a Directed Acyclic Graph (DAG) depth-first validator. The SDK also includes a hard limit on rule depth evaluation (max depth = 3) to prevent local infinite loop exceptions.

8. Candidate Verbal Script

Below is a verbatim guide showing how a candidate would execute this system design interview.

Interviewer: "Design a highly available and real-time Feature Flag Platform at scale."

Candidate: "I will design this feature flag platform with two core philosophies: Zero-Latency In-Memory Evaluation and Resilient Unidirectional Streaming.

To avoid blocking the host application's runtime loop, we must never execute an API call during flag checks. Instead, our SDK will download compiled JSON rules at startup and run targeting evaluation entirely in local process memory using uniform hashing algorithms like MurmurHash3.

For real-time propagation (such as kill switches), we will set up Server-Sent Events (SSE) Gateways. When engineers modify flag configurations on our Admin Dashboard, these updates will trigger event signals inside a Redis Pub/Sub channel. The SSE gateways will immediately push these delta patches to connected client SDKs within 1 second.

To scale the initial synchronization of 1 Million concurrent client sessions without hammering our databases, I will put a CDN Shield in front of our sync endpoints. We will write compiled flag configurations directly to S3 and distribute them via CDNs.

Finally, to address mobile client security concerns, I will separate our SDK design into Server-Side (which receives the full, raw rule dictionary) and Client-Side (where a lightweight edge broker resolves targeting first, returning only resolved key-value results to client devices to prevent rule leakage)."

Interviewer: "What happens if a user gets disconnected from the SSE Gateway? How does the SDK retrieve missed flag updates?"

Candidate: "When the client reconnects, it includes the last seen flag configuration version header (Last-Event-ID). The SSE Gateway compares this version against the latest active configuration in the database. If there are missing updates, it sends a compacted patch payload. If the version gap is too wide, the gateway instructs the SDK to trigger a full refresh from the nearest CDN edge node."

Key Takeaways

Create, update, and delete flags.
Support Percentage Rollouts (e.g., enable for 5% of users).
Support Targeted Rollouts (e.g., enable for internal testers).

From vague architecture answers to staff-level trade-off thinking.