Case Study: Design a Feature Flag System
Mental Model
Connecting isolated components into a resilient, scalable, and observable distributed web.
Feature flags allow you to decouple Code Deployment from Feature Release. You can deploy code to production but keep the feature disabled until you are ready to "flip the switch."
1. Requirement Clarification
graph LR
Producer[Producer Service] -->|Publish Event| Kafka[Kafka / Event Bus]
Kafka -->|Consume| Consumer1[Consumer Group A]
Kafka -->|Consume| Consumer2[Consumer Group B]
Consumer1 --> DB1[(Primary DB)]
Consumer2 --> Cache[(Redis)]
Functional
- Create, update, and delete flags.
- Support Percentage Rollouts (e.g., enable for 5% of users).
- Support Targeted Rollouts (e.g., enable for internal testers).
Non-Functional
- Zero Latency: Evaluation should happen in-process. It cannot be an API call.
- Availability: System must be up to update flags.
- Scalability: Handle millions of SDK connections.
2. High-Level Architecture
- Dashboard: Where admins define rules.
- CDN / Edge: Stores the compiled rule set.
- SDK: Downloaded by the application. It fetches the rule set and evaluates it locally in RAM.
3. How Local Evaluation Works
To avoid an API call on every if (flag), the SDK:
- Downloads a JSON blob of rules on startup.
- Subscribes to a Stream (SSE or WebSockets) for real-time updates.
- Executes the logic:
hash(userId + flagKey) % 100 < percentageto decide if the user sees the feature.
4. The Kill Switch
A special type of flag that can immediately disable a broken feature globally. It must have ultra-low propagation delay (< 1 second).
Final Takeaway
Feature flags are about Risk Mitigation. They allow you to test in production safely and roll back instantly without a full deployment.
Technical Trade-offs: Messaging Systems
| Pattern | Ordering | Durability | Throughput | Complexity |
|---|---|---|---|---|
| Log-based (Kafka) | Strict (per partition) | High | Very High | High |
| Memory-based (Redis Pub/Sub) | None | Low | High | Very Low |
| Push-based (RabbitMQ) | Fair | Medium | Medium | Medium |
Key Takeaways
- Create, update, and delete flags.
- Support Percentage Rollouts (e.g., enable for 5% of users).
- Support Targeted Rollouts (e.g., enable for internal testers).
Production Readiness Checklist
Before deploying this architecture to a production environment, ensure the following Staff-level criteria are met:
- High Availability: Have we eliminated single points of failure across all layers?
- Observability: Are we exporting structured JSON logs, custom Prometheus metrics, and OpenTelemetry traces?
- Circuit Breaking: Do all synchronous service-to-service calls have timeouts and fallbacks (e.g., via Resilience4j)?
- Idempotency: Can our APIs handle retries safely without causing duplicate side effects?
- Backpressure: Does the system gracefully degrade or return HTTP 429 when resources are saturated?