Lesson 27 of 107 3 min

System Design: Building a Feature Flag Platform

Design a production feature flag platform: flag schemas, targeting rules, percentage rollouts, local SDK evaluation, streaming updates, audit logs, kill switches, experiments, consistency tradeoffs, and flag lifecycle management.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Case Study: Design a Feature Flag System

Mental Model

Connecting isolated components into a resilient, scalable, and observable distributed web.

Feature flags allow you to decouple Code Deployment from Feature Release. You can deploy code to production but keep the feature disabled until you are ready to "flip the switch."

1. Requirement Clarification

graph LR
    Producer[Producer Service] -->|Publish Event| Kafka[Kafka / Event Bus]
    Kafka -->|Consume| Consumer1[Consumer Group A]
    Kafka -->|Consume| Consumer2[Consumer Group B]
    Consumer1 --> DB1[(Primary DB)]
    Consumer2 --> Cache[(Redis)]

Functional

  • Create, update, and delete flags.
  • Support Percentage Rollouts (e.g., enable for 5% of users).
  • Support Targeted Rollouts (e.g., enable for internal testers).

Non-Functional

  • Zero Latency: Evaluation should happen in-process. It cannot be an API call.
  • Availability: System must be up to update flags.
  • Scalability: Handle millions of SDK connections.

2. High-Level Architecture

  1. Dashboard: Where admins define rules.
  2. CDN / Edge: Stores the compiled rule set.
  3. SDK: Downloaded by the application. It fetches the rule set and evaluates it locally in RAM.

3. How Local Evaluation Works

To avoid an API call on every if (flag), the SDK:

  1. Downloads a JSON blob of rules on startup.
  2. Subscribes to a Stream (SSE or WebSockets) for real-time updates.
  3. Executes the logic: hash(userId + flagKey) % 100 < percentage to decide if the user sees the feature.

4. The Kill Switch

A special type of flag that can immediately disable a broken feature globally. It must have ultra-low propagation delay (< 1 second).

Final Takeaway

Feature flags are about Risk Mitigation. They allow you to test in production safely and roll back instantly without a full deployment.

Technical Trade-offs: Messaging Systems

Pattern Ordering Durability Throughput Complexity
Log-based (Kafka) Strict (per partition) High Very High High
Memory-based (Redis Pub/Sub) None Low High Very Low
Push-based (RabbitMQ) Fair Medium Medium Medium

Key Takeaways

  • Create, update, and delete flags.
  • Support Percentage Rollouts (e.g., enable for 5% of users).
  • Support Targeted Rollouts (e.g., enable for internal testers).

Production Readiness Checklist

Before deploying this architecture to a production environment, ensure the following Staff-level criteria are met:

  • High Availability: Have we eliminated single points of failure across all layers?
  • Observability: Are we exporting structured JSON logs, custom Prometheus metrics, and OpenTelemetry traces?
  • Circuit Breaking: Do all synchronous service-to-service calls have timeouts and fallbacks (e.g., via Resilience4j)?
  • Idempotency: Can our APIs handle retries safely without causing duplicate side effects?
  • Backpressure: Does the system gracefully degrade or return HTTP 429 when resources are saturated?

Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.