Lesson 25 of 107 3 min

System Design: Building a Fraud Detection Platform

Design a production fraud detection platform with real-time scoring, rules, ML models, feature stores, case management, feedback loops, and safe decision workflows for payments and account abuse.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Case Study: Design a Fraud Detection System

Mental Model

Connecting isolated components into a resilient, scalable, and observable distributed web.

Fraud detection is a Low-Latency / High-Precision problem. You must decide if a transaction is fraudulent in < 50ms before the payment is processed.

1. Requirement Clarification

graph LR
    Producer[Producer Service] -->|Publish Event| Kafka[Kafka / Event Bus]
    Kafka -->|Consume| Consumer1[Consumer Group A]
    Kafka -->|Consume| Consumer2[Consumer Group B]
    Consumer1 --> DB1[(Primary DB)]
    Consumer2 --> Cache[(Redis)]

Functional

  • Analyze a transaction and return a "Score" (0-100).
  • Allow analysts to define rules (e.g., "Block if amount > $5000 and user is new").
  • Support feedback loops (Marking a transaction as fraud manually).

Non-Functional

  • Latency: Must respond in < 100ms.
  • Accuracy: Minimize False Positives (don't block legitimate users).
  • Availability: System must be up to prevent payment blocking.

2. High-Level Architecture

  1. Rule Engine: Evaluates hard-coded logic.
  2. ML Model Service: Generates a probabilistic score.
  3. Feature Store: Provides real-time data (e.g., "How many times did this user try to pay in the last 1 min?").

3. The Feature Store (Real-time Context)

Models need context.

  • Architecture: Transaction Event $\rightarrow$ Flink/Spark Streaming $\rightarrow$ Redis (Feature Store).
  • Lookup: During a request, the Fraud Service pulls features from Redis in $O(1)$.

4. Decision Workflow

  • Score < 30: Accept.
  • Score 30-70: Manual Review / Step-up Auth (MFA).
  • Score > 70: Reject.

Final Takeaway

Fraud detection is about Real-time Data Enrichment. The model is only as good as the features you can pull in the few milliseconds you have.

Technical Trade-offs: Messaging Systems

Pattern Ordering Durability Throughput Complexity
Log-based (Kafka) Strict (per partition) High Very High High
Memory-based (Redis Pub/Sub) None Low High Very Low
Push-based (RabbitMQ) Fair Medium Medium Medium

Key Takeaways

  • Analyze a transaction and return a "Score" (0-100).
  • Allow analysts to define rules (e.g., "Block if amount > $5000 and user is new").
  • Support feedback loops (Marking a transaction as fraud manually).

Production Readiness Checklist

Before deploying this architecture to a production environment, ensure the following Staff-level criteria are met:

  • High Availability: Have we eliminated single points of failure across all layers?
  • Observability: Are we exporting structured JSON logs, custom Prometheus metrics, and OpenTelemetry traces?
  • Circuit Breaking: Do all synchronous service-to-service calls have timeouts and fallbacks (e.g., via Resilience4j)?
  • Idempotency: Can our APIs handle retries safely without causing duplicate side effects?
  • Backpressure: Does the system gracefully degrade or return HTTP 429 when resources are saturated?

Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.