In modern digital ecosystems, security and operational integrity are constantly challenged by sophisticated malicious actors. From account takeovers (ATO) and promo abuse to transaction fraud and card-not-present (CNP) theft, financial platforms lose billions of dollars annually. To block malicious activities, system architects must build highly resilient systems that can detect fraud in real time.
A production-grade Fraud Detection Platform operates at the intersection of streaming data, machine learning, and rule-based gateways. It is a classic system design topic because it forces developers to grapple with tight latency budgets (under 30 milliseconds), enormous write throughput, sliding-window feature aggregation, and active feedback loops.
1. Requirements & Core Constraints
A high-performance fraud platform must serve both automatic scoring engines and security operations workflows. We define our parameters across functional boundaries and strict non-functional constraints.
Functional Requirements
- Transaction Assessment: Analyze inbound transaction contexts in real time and return a dynamic risk score between 0 and 100.
- Declarative Rules Engine: Allow risk analysts to define, validate, and execute hot-deployable business rules (e.g., "Flag if transaction amount is greater than $5000 and the user account was created in the last 24 hours").
- Real-time Feature Hydration: Aggregate and query historical and sliding-window user features (e.g., "Spend velocity over the last 5 minutes", "Number of distinct payment cards tried today").
- Asynchronous Feedback Loop: Support back-channel ingestion of manually reported fraud events, banking chargeback files, and analyst overrides to continually retrain models.
- Analyst Workflows & Case Management: Route suspicious events (scores from 30 to 70) to an operations portal for manual review.
Non-Functional Requirements & SLAs
- Ultra-Low Latency: The synchronous risk evaluation path must return a decision in under 30 milliseconds. Any delay beyond this latency budget will slow down downstream credit card processing gates.
- Extreme Availability: The platform must target 99.999% uptime. If the fraud engine experiences an outage, it must fail-open safely (with reduced capabilities) rather than shutting down the checkout stream.
- High Ingestion Throughput: The system must handle a global peak load of 10,000 Transactions Per Second (TPS) with zero event loss.
- Consistency vs Latency: Read-path queries to the feature store must execute in under 2 milliseconds, requiring an in-memory key-value layout. Sliding-window streaming updates are processed asynchronously via stream processing to protect write speed.
Capacity & Back-of-the-Envelope Estimates
Let's evaluate the capacity demands of our fraud platform under peak parameters:
-
Transaction Workload: $$\text{Peak Transaction Throughput} = 10,000 \text{ TPS}$$ $$\text{Daily Volume} = 10,000 \text{ TPS} \times 86,400 \text{ seconds/day} = 864,000,000 \text{ transactions/day}$$ $$\text{Monthly Volume} \approx 25.92 \text{ billion transactions/month}$$
-
Real-Time Storage Footprint (Redis Feature Store): We track features for active users over a rolling 30-day window. Suppose there are 100 million active users per month. Each user profile contains 50 real-time features (mostly floating-point metrics and timestamp counters). $$\text{Record size per user} \approx 500 \text{ bytes}$$ $$\text{Raw Feature Storage Size} = 100,000,000 \text{ users} \times 500 \text{ bytes} = 50,000,000,000 \text{ bytes} \approx 50 \text{ GB}$$ Accounting for overhead (indexes, Redis hash structures, and replication factor of 3), the in-memory cluster requires: $$\text{Memory Overhead Estimate} = 50 \text{ GB} \times 3 \approx 150 \text{ GB of RAM}$$ This is highly efficient and easily distributed across a sharded Redis cluster.
-
Stream Ingest Bandwidth: Each raw transaction payload averages 1 Kilobyte of JSON data. $$\text{Ingest Bandwidth} = 10,000 \text{ TPS} \times 1 \text{ KB} = 10 \text{ Megabytes/second (MB/s)}$$ $$\text{Daily Ingress} = 10 \text{ MB/s} \times 86,400 \text{ seconds} \approx 864 \text{ Gigabytes (GB) of event logs/day}$$ This shows that a standard Kafka partition fleet can easily ingest the raw transaction logs.
2. API Design & Core Contracts
The platform exposes a synchronous JSON REST API for transaction assessments and asynchronous endpoints for feedback loops.
Synchronous Transaction Assessment Endpoint
- HTTP Method:
POST - Path:
/api/v1/transactions/assess - Request Headers:
Content-Type: application/json
Request Payload:
{
"transaction_id": "tx_9876543210_abc",
"user_id": "usr_456789_xyz",
"amount_usd": 1450.50,
"currency": "USD",
"timestamp_epoch_ms": 1779471461000,
"payment_method": {
"type": "credit_card",
"card_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"billing_zip": "10001",
"card_issuer": "Chase"
},
"device_context": {
"ip_address": "198.51.100.42",
"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"session_id": "sess_88339944",
"device_fingerprint": "df_837492837498273"
},
"merchant_context": {
"merchant_id": "merch_9988",
"merchant_category_code": "5812",
"merchant_location": "New York, NY"
}
}
Response Payload:
{
"transaction_id": "tx_9876543210_abc",
"decision": "STEP_UP_MFA",
"fraud_score": 68.4,
"triggered_rules": [
{
"rule_id": "RULE_VELOCITY_5M_HIGH",
"description": "User has initiated more than 5 transactions in the last 5 minutes"
},
{
"rule_id": "RULE_GEOGRAPHIC_IMPOSSIBILITY",
"description": "Transaction location is physically impossible from the location of last transaction"
}
],
"fencing_token": 9923847293,
"recommender_duration_ms": 18
}
Asynchronous Feedback Loop Endpoint
Used to feed manual audits, analyst reviews, and chargeback notifications back into the platform for analytics and online feature adjustments.
- HTTP Method:
POST - Path:
/api/v1/fraud-feedback
Request Payload:
{
"feedback_id": "fb_77338844_qwe",
"transaction_id": "tx_9876543210_abc",
"user_id": "usr_456789_xyz",
"feedback_type": "CHARGEBACK",
"reported_at_epoch_ms": 1779475061000,
"source": "VISA_CLEARING_HOUSE",
"notes": "Cardholder reported unauthorized usage on statement."
}
Response Payload:
{
"feedback_id": "fb_77338844_qwe",
"status": "INGESTED",
"affected_user_flagged": true
}
3. High-Level Design (HLD)
To handle 10,000 TPS within 30ms, we partition the platform into two separate execution pathways:
- The Synchronous Assessment Path (Hot Path): Evaluates rules and scores the model in real time using pre-aggregated features from the in-memory store.
- The Asynchronous Streaming Path (Cold/Warm Path): Ingests transaction logs, calculates sliding-window velocity metrics, feeds the analytics data warehouse, and updates the Redis feature store.
Streaming Feature Hydration & Execution Flow
Below is the end-to-end event topology. The checkout system submits the event synchronously to the gateway. The gateway routes the event to the decision cluster while immediately emitting a copy to the event bus.
flowchart TD
Checkout[Client Checkout Application] -->|1. Sync POST Request| Gateway[API Gateway Fleet]
Gateway -->|2. Sync Call <30ms| FraudService[Fraud Decision Engine]
subgraph Synchronous Assessment Path
FraudService -->|3. Parallel Read| Redis[Redis Feature Store]
FraudService -->|4. Dynamic Score| ML[Model Scoring Service]
FraudService -->|5. Evaluate rules| Rules[Rules Gating Engine]
end
subgraph Asynchronous Streaming Analytics Pipeline
Gateway -->|6. Async Write Event| Kafka[Kafka Event Log]
Kafka -->|7. Stream Intake| Flink[Apache Flink Stream Processor]
Flink -->|8. Slide aggregations| Redis
Flink -->|9. Raw Log Archive| ColdStore[(Cassandra / Snowflake)]
end
Risk Decision Workflow
The scoring service checks the transaction and determines a risk path. Low scores bypass manual review, high scores reject immediately, and ambiguous ratings route to a step-up challenge before queuing for manual verification.
flowchart TD
Intake[Evaluate Transaction Score] --> DecisionGate{Fraud Score Gate}
DecisionGate -->|Score under 30: Low Risk| Accept[APPROVE TRANSACTION]
DecisionGate -->|30 <= Score <= 70: Medium Risk| StepUp{Step-up MFA Challenge}
DecisionGate -->|Score over 70: High Risk| Reject[REJECT TRANSACTION]
StepUp -->|MFA Verification Succeeded| ApproveMFA[APPROVE & UPDATE CACHE]
StepUp -->|MFA Verification Failed/Timeout| RejectMFA[REJECT & FLAG ACCOUNT]
Accept --> SyncQueue[Kafka Audit Logs]
Reject --> SyncQueue
ApproveMFA --> SyncQueue
RejectMFA --> SyncQueue
SyncQueue --> AnalystQueue[Analyst Case Management Portal]
AnalystQueue -->|Manual Override / Chargeback Feedback| FeedbackDB[(PostgreSQL Analytics)]
FeedbackDB --> Retrain[Retrain ML Model]
4. Low-Level Design (LLD) & Data Models
Database Schema (PostgreSQL): Fraud Metadata and Audits
While the in-memory feature store represents temporary hot states, we use a highly reliable relational schema for transactional audits, historical analysis, and chargebacks.
-- Represents raw transaction events and their structural ratings
CREATE TABLE payment_transactions (
transaction_id VARCHAR(64) PRIMARY KEY,
user_id VARCHAR(64) NOT NULL,
amount_usd NUMERIC(12, 2) NOT NULL,
currency CHAR(3) NOT NULL,
card_hash VARCHAR(64) NOT NULL,
ip_address VARCHAR(45) NOT NULL,
device_fingerprint VARCHAR(64) NOT NULL,
session_id VARCHAR(64),
merchant_id VARCHAR(64) NOT NULL,
created_at TIMESTAMP WITH TIME ZONE NOT NULL,
fraud_score NUMERIC(5, 2) NOT NULL,
decision VARCHAR(32) NOT NULL,
recommender_duration_ms INTEGER NOT NULL
);
-- Index user transactions to accelerate operational aggregations and history lookups
CREATE INDEX idx_transactions_user_time ON payment_transactions (user_id, created_at DESC);
CREATE INDEX idx_transactions_card ON payment_transactions (card_hash);
-- Tracks chargebacks and manual analyst adjustments
CREATE TABLE chargeback_feedback (
feedback_id VARCHAR(64) PRIMARY KEY,
transaction_id VARCHAR(64) NOT NULL REFERENCES payment_transactions(transaction_id),
user_id VARCHAR(64) NOT NULL,
feedback_type VARCHAR(32) NOT NULL, -- 'CHARGEBACK', 'MANUAL_COMPLAINT', 'ANALYST_OVERRIDE'
reported_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
source VARCHAR(64) NOT NULL,
notes TEXT
);
CREATE INDEX idx_feedback_user ON chargeback_feedback (user_id);
Feature Store Layout (Redis)
To fetch user context at sub-millisecond rates, features are packed inside Redis hashes.
- Key Format:
features:user:{user_id} - Hash Fields:
tx_count_5m(Integer): Total transactions in the last 5 minutes.tx_sum_5m(Float): Total spend in the last 5 minutes.distinct_cards_24h(Integer): Count of unique card hashes processed today.last_transaction_country(String): ISO country code of the last approved charge.last_transaction_timestamp(Timestamp): Epoch millisecond value.
Compilable Python Implementation: Temporal Spend Velocity Aggregator
This compilable Python class represents the core state tracking logic executed inside our feature hydration processors. It manages in-memory sliding windows for transaction events, cleanses stale records, and provides thread-safe velocity aggregations.
import time
from collections import deque
from threading import Lock
from typing import Dict, Tuple
class TransactionEvent:
def __init__(self, user_id: str, amount: float, timestamp: float):
self.user_id = user_id
self.amount = amount
self.timestamp = timestamp
class SlidingWindowFeatureAggregator:
"""
A thread-safe sliding-window event aggregator for fraud feature generation.
Calculates velocity metrics (transaction count and sum) in real-time.
"""
def __init__(self, window_size_seconds: float):
self.window_size_seconds = window_size_seconds
# Maps user_id -> deque of TransactionEvents within the active window
self.user_windows: Dict[str, deque] = {}
self.lock = Lock()
def add_transaction(self, user_id: str, amount: float) -> Tuple[int, float]:
"""
Ingests a new transaction event, purges stale events,
and returns the active window features (count, sum_amount).
"""
now = time.time()
with self.lock:
if user_id not in self.user_windows:
self.user_windows[user_id] = deque()
user_deque = self.user_windows[user_id]
# Append current transaction
user_deque.append(TransactionEvent(user_id, amount, now))
# Prune stale events outside the sliding window
cutoff = now - self.window_size_seconds
while user_deque and user_deque[0].timestamp < cutoff:
user_deque.popleft()
# Compute real-time aggregations
tx_count = len(user_deque)
tx_sum = sum(tx.amount for tx in user_deque)
return tx_count, tx_sum
def get_features(self, user_id: str) -> Tuple[int, float]:
"""
Retrieves features without adding a new transaction, maintaining window consistency.
"""
now = time.time()
with self.lock:
if user_id not in self.user_windows:
return 0, 0.0
user_deque = self.user_windows[user_id]
cutoff = now - self.window_size_seconds
# Prune stale events
while user_deque and user_deque[0].timestamp < cutoff:
user_deque.popleft()
return len(user_deque), sum(tx.amount for tx in user_deque)
if __name__ == "__main__":
# Initialize a 5-minute sliding window aggregator (300 seconds)
aggregator = SlidingWindowFeatureAggregator(window_size_seconds=300)
# Simulate high-speed ingest for a target user
user = "usr_992384"
# Ingest transactions rapidly
cnt1, sum1 = aggregator.add_transaction(user, 150.0)
cnt2, sum2 = aggregator.add_transaction(user, 450.50)
cnt3, sum3 = aggregator.add_transaction(user, 200.0)
print(f"Test 1: Verification of active window metrics for {user}")
print(f"Aggregations: count={cnt3}, sum={sum3}")
# Validate count and sums match expected behavior
assert cnt3 == 3, f"Expected 3 transactions, got {cnt3}"
assert sum3 == 800.5, f"Expected 800.5 sum, got {sum3}"
print("Test 1 Passed successfully!")
# Test read-only feature inspection
cnt_check, sum_check = aggregator.get_features(user)
print(f"Test 2: Verification of read-only features. count={cnt_check}, sum={sum_check}")
assert cnt_check == 3 and sum_check == 800.5, "Read features mismatched"
print("Test 2 Passed successfully!")
5. Scaling Challenges & Bottlenecks
Scaling a platform to handle 10,000 TPS while maintaining tight latencies presents several key bottlenecks.
Apache Flink State Store Explosion
Flink uses RocksDB as a state backend to store sliding-window transactions. Under a peak load of 10,000 TPS and a long aggregation window (e.g., tracking a 30-day velocity check), Flink's local state size can swell to multiple terabytes.
- The Solution: We implement strict TTL policies on state parameters (e.g., StateTtlConfig). We configure state eviction to automatically prune inactive keys after 24 hours if we do not need larger windows. For multi-week trends, we calculate off-line batch features using Snowflake and load them daily into Redis as static baselines, completely freeing Flink from maintaining massive states.
Partition Hotspots in Redis
A botnet attacking a specific high-volume merchant or using a single compromised account can cause massive keyspace skew. This targets a single partition node in the sharded Redis cluster, driving CPU utilization to 100% and stalling transaction assessment latencies.
- The Solution: We implement Client-side L1 Caching for read-only static features. For high-volume merchants, we bypass the user-sharded feature lookup and use local in-memory caches (e.g., Caffeine caches inside the Fraud service memory) to store basic merchant thresholds with a short 5-second expiration. This protects the backend Redis cluster from extreme hotspot failures.
Feature Hydration Latency
If a user has been inactive for months and suddenly checks out, their features are absent from the hot Redis database. Fetching them synchronously from historical SQL databases inside the 30ms assessment loop is impossible.
- The Solution: We leverage a Pre-hydration Event Queue. When a user logs in, visits the checkout page, or adds items to their shopping cart, the frontend gateway emits a pre-hydration event to Kafka. A consumer consumes this event, queries PostgreSQL/Snowflake, and populates the user's features inside Redis before the user clicks the "Pay Now" button, reducing cache misses to under 1%.
6. Technical Trade-offs & Compromises
Synchronous Blocking vs. Asynchronous Post-Authorization Screening
┌─────────────────────────────┐
│ Fraud Assessment Workflow │
└──────────────┬──────────────┘
│
┌──────────────────────────┴──────────────────────────┐
▼ ▼
┌──────────────────────────────────────┐ ┌──────────────────────────────────────┐
│ Synchronous Blocking (Hot) │ │ Asynchronous Screening (Warm) │
├──────────────────────────────────────┤ ├──────────────────────────────────────┤
│ • Decides BEFORE authorization │ │ • Decides AFTER authorization │
│ • Latency-bound (30ms limit) │ │ • Asynchronous (No latency impact) │
│ • Higher protection against loss │ │ • Catches complex patterns post-tx │
│ • Risk of false positive friction │ │ • Risk of quick chargebacks │
└──────────────────────────────────────┘ └──────────────────────────────────────┘
- Synchronous Blocking: The payment process waits for a fraud verdict before communicating with the credit card network.
- Trade-off: Maximizes protection against immediate losses, but forces a strict 30ms latency budget. False positives can create friction and abandonments during checkout.
- Asynchronous Screening: The transaction is instantly approved, and the fraud model runs in the background. If fraud is detected post-authorization, the platform cancels the shipment or suspends the account.
- Trade-off: Eliminates any latency impact on checkout, but exposes the business to chargeback costs if the product is delivered instantly (e.g., digital gift cards or ride-hailing services).
- Staff Verdict: We implement a hybrid model. High-risk actions (e.g., purchase amounts greater than $1,000, new accounts, or digital delivery) execute synchronously. Standard transactions execute asynchronously, with a post-authorization check that completes in under 2 seconds.
Rules Engine (RETE/Drools) vs. Machine Learning Models
- Rules Engine (Declarative Rules):
- Trade-off: Highly transparent, fast (sub-2ms), and directly manageable by compliance analysts. However, rules are rigid, hard to maintain as fraud tactics adapt, and cannot easily identify complex, multi-dimensional correlations.
- Machine Learning Model (e.g., XGBoost, LightGBM):
- Trade-off: Excellent at detecting complex patterns across hundreds of variables. However, models operate as a black box, require extensive pipelines to prevent feature drift, and are slower to run (10ms to 20ms).
- Staff Verdict: We use dual-engine execution. The transaction is sent to both systems in parallel. If any high-priority rule triggers a block (e.g., an explicit country blacklist), the transaction is immediately rejected, short-circuiting the model scoring path and saving CPU cycles.
7. Failure Scenarios & Operational Resiliency
Distributed systems are inherently subject to network cuts, partition splits, and database crashes.
Safe Fallback: Fail-Open vs. Fail-Closed
If the ML scoring service or the Redis cluster goes completely dark, how should the checkout gateway respond?
- Resiliency Plan: We enforce a Fail-Open Strategy. If a synchronous call to the fraud engine times out or throws an exception, the payment system logs the incident, falls back to a static local rules gateway (running in-memory with basic checks), and permits the transaction. We route the transaction to a high-priority asynchronous queue for post-authorization screening. Blocking all legitimate transactions (fail-closed) during a fraud service outage is far more expensive than accepting a brief window of potential fraud risk.
Flink Pipeline Recovery and Kafka Offset Lag
If Flink experiences an outage, it stops aggregating sliding-window velocities. Once Flink recovers, it has to process millions of buffered events in Kafka, causing feature calculation lag and feeding stale metrics to the Redis feature store.
- Resiliency Plan: We use Checkpointing and RocksDB State Savepoints in Flink. We configure Flink to persist checkpoints every 10 seconds to an S3 object store. During a crash recovery, Flink resumes exactly from the last saved state, catching up rapidly via parallel partition threads. Furthermore, the Redis records contain an epoch timestamp of the last write. If the Fraud Service reads a feature and notices
last_write_timestampis older than 5 minutes, it ignores the stale Redis feature and falls back to baseline values.
8. Candidate Verbal Script
Mock Interview Sequence
Interviewer: How would you design a system to detect transaction fraud in real time under a strict 30ms latency SLA, while processing 10,000 TPS?
Candidate: "To build a fraud platform at this scale, I would decouple the real-time decision engine from the heavy streaming computations. We will use a dual hot-cold architecture.
The hot path evaluates transaction decisions synchronously. When a checkout request arrives, the Fraud Service executes a sub-2ms parallel read from an in-memory Redis Feature Store to grab pre-computed sliding-window features (e.g., transaction count and spend sums over the last 5 minutes). In parallel, it calls an optimized ML model scoring service (e.g., running a lightweight XGBoost model in C++) and a local declarative Rules Engine. The entire assessment completes in 15 to 20ms, well within the 30ms SLA.
The cold/warm path processes updates asynchronously. The checkout gateway logs transaction events to Apache Kafka. Apache Flink consumes these events and updates sliding-window aggregations, writing the results back to Redis. This keeps heavy processing completely out of the critical path of the checkout flow.
If the Redis cluster or ML service goes offline, the system falls back to a Fail-Open Strategy, running a local in-memory rule subset to quickly vet the transaction, ensuring checkout availability is never compromised."
Interviewer: How would your Flink pipeline handle out-of-order events or late-arriving events in Kafka without corrupting the sliding-window features stored in Redis?
Candidate: "We use Flink's Event-Time Processing combined with Bounded-OutOfOrderness Watermarks instead of relying on processing time. We establish a maximum allowable skew for late events (e.g., 5 seconds). Flink will buffer events and allow them to join the sliding window calculation if they arrive within this 5-second window.
If an event arrives even later due to a severe network disconnect, it is routed to a dead-letter queue (DLQ) for asynchronous reconciliation. We also perform a conditional atomic update on the Redis feature store. Redis will only accept the write if the event timestamp is greater than the last recorded update. This prevents out-of-order events from overwriting fresh features."