Case Study: Design a Fraud Detection System
Mental Model
Connecting isolated components into a resilient, scalable, and observable distributed web.
Fraud detection is a Low-Latency / High-Precision problem. You must decide if a transaction is fraudulent in < 50ms before the payment is processed.
1. Requirement Clarification
graph LR
Producer[Producer Service] -->|Publish Event| Kafka[Kafka / Event Bus]
Kafka -->|Consume| Consumer1[Consumer Group A]
Kafka -->|Consume| Consumer2[Consumer Group B]
Consumer1 --> DB1[(Primary DB)]
Consumer2 --> Cache[(Redis)]
Functional
- Analyze a transaction and return a "Score" (0-100).
- Allow analysts to define rules (e.g., "Block if amount > $5000 and user is new").
- Support feedback loops (Marking a transaction as fraud manually).
Non-Functional
- Latency: Must respond in < 100ms.
- Accuracy: Minimize False Positives (don't block legitimate users).
- Availability: System must be up to prevent payment blocking.
2. High-Level Architecture
- Rule Engine: Evaluates hard-coded logic.
- ML Model Service: Generates a probabilistic score.
- Feature Store: Provides real-time data (e.g., "How many times did this user try to pay in the last 1 min?").
3. The Feature Store (Real-time Context)
Models need context.
- Architecture: Transaction Event $\rightarrow$ Flink/Spark Streaming $\rightarrow$ Redis (Feature Store).
- Lookup: During a request, the Fraud Service pulls features from Redis in $O(1)$.
4. Decision Workflow
- Score < 30: Accept.
- Score 30-70: Manual Review / Step-up Auth (MFA).
- Score > 70: Reject.
Final Takeaway
Fraud detection is about Real-time Data Enrichment. The model is only as good as the features you can pull in the few milliseconds you have.
Technical Trade-offs: Messaging Systems
| Pattern | Ordering | Durability | Throughput | Complexity |
|---|---|---|---|---|
| Log-based (Kafka) | Strict (per partition) | High | Very High | High |
| Memory-based (Redis Pub/Sub) | None | Low | High | Very Low |
| Push-based (RabbitMQ) | Fair | Medium | Medium | Medium |
Key Takeaways
- Analyze a transaction and return a "Score" (0-100).
- Allow analysts to define rules (e.g., "Block if amount > $5000 and user is new").
- Support feedback loops (Marking a transaction as fraud manually).
Production Readiness Checklist
Before deploying this architecture to a production environment, ensure the following Staff-level criteria are met:
- High Availability: Have we eliminated single points of failure across all layers?
- Observability: Are we exporting structured JSON logs, custom Prometheus metrics, and OpenTelemetry traces?
- Circuit Breaking: Do all synchronous service-to-service calls have timeouts and fallbacks (e.g., via Resilience4j)?
- Idempotency: Can our APIs handle retries safely without causing duplicate side effects?
- Backpressure: Does the system gracefully degrade or return HTTP 429 when resources are saturated?