System Design: Building a Feature Store for Real-Time Machine Learning

Production machine learning systems fail when the model sees different data in training than it sees in production. A feature store helps solve this problem by providing a consistent way to define, compute, store, and serve features for both offline training and online inference. It acts as the central source of truth for ML features, bridging the gap between historical data lakes and real-time operational database systems.

Think of a feature as an individual measurable property or characteristic of a phenomenon being observed. In the context of a machine learning model, features are the inputs:

user_7d_transaction_count: Number of transactions by a user in the last 7 days.
merchant_chargeback_rate_30d: The chargeback rate of a merchant over the last 30 days.
device_seen_before: Boolean flag indicating if this device hash has been associated with the user profile.
avg_order_value_90d: The average dollar value of orders placed by the user in the past 90 days.

A real-time feature store is a mission-critical infrastructure component that guarantees feature values are computed with minimal freshness latency, served with single-digit millisecond retrieve times, cataloged for easy discoverability, and exported with mathematical correctness for offline historical training.

1. Requirements & Core Constraints

To build a production-grade real-time feature store at the scale of 1 Billion active users, we must establish rigorous functional and non-functional requirements.

Functional Requirements

Feature Schema Registry: A centralized registry where engineers can define feature metadata, data types, ownership, entity associations, and fresh SLAs.
Low-Latency Online Serving: Retrieve a feature vector (a set of feature values) for one or more entity IDs in real-time during model inference.
Offline Batch Ingestion: Ingest high-volume historical features computed via batch engines like Apache Spark or Snowflake into the offline store.
Streaming Feature Computation: Ingest real-time events from streams (e.g., Apache Kafka or AWS Kinesis) and compute features on-the-fly with Apache Flink.
Historical Point-in-Time Retrieval: Provide an SDK to retrieve historical feature values for training data labels without causing future-data leakage (point-in-time correctness).

Non-Functional Requirements

Scale: Support 1 Billion active user keys, each having up to 500 individual features.
Write Throughput: Handle up to 200,000 streaming feature updates per second during peak events.
Read Throughput: Support up to 500,000 feature vector retrievals per second for real-time model inference.
Latency: P99 read latency of less than 10 milliseconds for single-key retrievals, and less than 20 milliseconds for batch retrievals.
Consistency vs. Freshness: High availability and low serving latency are prioritized for the online store. Streaming feature values should achieve sub-second freshness, while batch features can update within a 24-hour cycle.
Zero Training-Serving Skew: Absolute guarantee that the feature computation logic used for offline training is identical to the one used for online serving.

Back-of-the-Envelope Capacity Estimation

1. Storage Calculations

Active Entities: 1,000,000,000 (1 Billion) unique user keys.
Features per Entity: 100 active features per user profile.
Average Size per Feature Entry:
- Key: user_id:feature_name (32 bytes user_id + 24 bytes feature_name = 56 bytes).
- Value: Float64 feature value (8 bytes) + timestamp (8 bytes) + status code (1 byte) = 17 bytes.
- Overall overhead: Let's assume 100 bytes per feature key-value pair including database index overhead.
Total Storage (Online Store):
- 1 Billion entities * 100 features/entity * 100 bytes = 10 Terabytes.
- Adding replication factor of 3 for high availability = 30 Terabytes of high-performance RAM/SSD storage.
Total Storage (Offline Store - 5 Years History):
- Assume we record updates once per day per feature for 1 Billion users:
- 1 Billion * 100 features * 365 days * 5 years * 17 bytes = 3.1 Petabytes.

2. Network and Throughput Calculations

Write Bandwidth:
- 200,000 updates/sec * 100 bytes = 20 Megabytes/second.
Read Bandwidth:
- 500,000 reads/sec * 10 features returned/inference * 17 bytes = 85 Megabytes/second.

2. API Design & Core Contracts

The feature store must expose reliable gRPC and REST endpoints. gRPC is preferred for ultra-low latency online retrieval, while REST endpoints serve client integrations and administrative panels.

REST API: Retrieve Features for Online Inference

Retrieves a set of specific feature values for one or more entity IDs.

HTTP Method: POST
Path: /api/v1/features/retrieve
Headers:
- Content-Type: application/json
- X-API-Key: fs_prod_8f2d9c4

Request Payload

{
  "entity_name": "user",
  "entity_keys": ["usr_98234812", "usr_01928374"],
  "features": [
    "user_7d_transaction_count",
    "avg_order_value_90d",
    "device_seen_before"
  ]
}

Response Payload

{
  "entity_name": "user",
  "results": [
    {
      "entity_key": "usr_98234812",
      "features": {
        "user_7d_transaction_count": {
          "value": 14.0,
          "timestamp": "2026-05-22T17:40:00Z",
          "status": "PRESENT"
        },
        "avg_order_value_90d": {
          "value": 124.50,
          "timestamp": "2026-05-22T12:00:00Z",
          "status": "PRESENT"
        },
        "device_seen_before": {
          "value": 1.0,
          "timestamp": "2026-05-21T08:30:00Z",
          "status": "PRESENT"
        }
      }
    },
    {
      "entity_key": "usr_01928374",
      "features": {
        "user_7d_transaction_count": {
          "value": 0.0,
          "timestamp": "2026-05-22T17:44:00Z",
          "status": "PRESENT"
        },
        "avg_order_value_90d": {
          "value": null,
          "timestamp": null,
          "status": "MISSING"
        },
        "device_seen_before": {
          "value": 0.0,
          "timestamp": "2026-05-22T17:44:00Z",
          "status": "DEFAULT"
        }
      }
    }
  ]
}

3. High-Level Design (HLD)

The feature store is designed as a dual-database architecture. The "Online Store" handles low-latency key-value lookups, while the "Offline Store" serves as the historical query engine.

Ingestion & Serving Architecture

graph TD
    %% Streaming Path
    subgraph Streaming Ingestion
        A[App Event Logs] -->|JSON/Protobuf| B[Apache Kafka Topic]
        B -->|Event Stream| C[Apache Flink Engine]
        C -->|Aggregations| D[(Redis Cluster - Online Store)]
        C -->|Raw & Windowed Logs| E[Delta Lake / S3 - Offline Store]
    end

    %% Batch Ingestion Path
    subgraph Batch Ingestion
        F[(Production DBs)] -->|CDC / Debezium| G[Snowflake / Spark]
        G -->|Daily Batch Computation| E
        G -->|Bulk Loader| D
    end

    %% Serving Path
    subgraph Client Retrieval
        H[Inference Service] -->|gRPC Feature Fetch| I[Feature Store Service]
        I -->|Cache Check| J{Local Cache}
        J -->|Hit| H
        J -->|Miss| D
        D -->|Feature Vector| I
    end

Ingestion Logic Flow

Streaming Flow: Actions like credit card swipes or web clicks write to Kafka. Flink monitors these topics, runs sliding windows (e.g., aggregate clicks over the last 10 minutes), and updates the online Redis cluster. Flink also writes parquet chunks to AWS S3.
Batch Flow: Heavy database snapshots and deep historical feature calculations run overnight on Spark or Snowflake. These are saved to S3. A dedicated synchronizer loads these batch features into the online Redis Cluster.
Serving Flow: The inference service requests features for model inference. The Feature Store service checks its internal LFU cache first. On a miss, it fires a parallelized batch read against the Redis Cluster.

Real-Time Query Serving Lifecycle

sequenceDiagram
    autonumber
    actor Client as Inference Client
    participant API as Feature Serving Service
    participant Cache as Local LFU Cache
    participant Redis as Redis Cluster (Online)
    participant Registry as Feature Schema Registry

    Client->>API: RetrieveFeaturesRequest(EntityKeys, FeatureNames)
    API->>Registry: Validate features & get schemas
    Registry-->>API: Active schema & types
    API->>Cache: Lookup cached keys
    alt All keys cached
        Cache-->>API: Return feature vectors
    else Cache Miss / Partial Miss
        Cache-->>API: Return missing keys
        API->>Redis: MGET features:user:{id}
        Redis-->>API: Return serialized JSON/Protobuf values
        API->>Cache: Update LFU cache with fetched vectors
    end
    API->>API: Hydrate default values for MISSING elements
    API-->>Client: RetrieveFeaturesResponse(FeatureVectors)

4. Low-Level Design (LLD) & Data Models

For 1 Billion active users, selecting the appropriate database storage model is crucial. The online store uses a Wide-Column or Redis Hash model, whereas the offline store relies on a Partitioned Parquet file design.

Online Database Choice: Redis vs. DynamoDB

Redis Cluster is selected for ultra-low latency requirements. Its cluster mode scales horizontally, and key-value lookups are executed entirely in-memory with sub-2ms retrieval.
To store features efficiently in Redis, we use Redis Hashes where the key is the entity identifier and fields are the feature names.

Redis Data Structure Setup

Key: features:user:usr_98234812
Fields:
  "user_7d_transaction_count" -> "14"
  "avg_order_value_90d" -> "124.50"
  "device_seen_before" -> "1"
  "last_updated" -> "1779471600"

Offline Data Model: PostgreSQL Schema for Training Logs

The offline store tracks historical snapshots. For auditing and point-in-time reconstruction, we use an event-sourced database design.

-- Schema for storing historical feature computations in S3/Warehouse (Simulated SQL)
CREATE TABLE offline_feature_store (
    entity_id VARCHAR(64) NOT NULL,
    entity_type VARCHAR(32) NOT NULL,
    feature_name VARCHAR(64) NOT NULL,
    feature_value DOUBLE PRECISION NOT NULL,
    event_timestamp TIMESTAMP WITH TIME ZONE NOT NULL,
    computed_timestamp TIMESTAMP WITH TIME ZONE NOT NULL,
    partition_date DATE NOT NULL,
    PRIMARY KEY (entity_id, feature_name, event_timestamp)
) PARTITION BY RANGE (partition_date);

-- Indexing for rapid historical time-travel joins
CREATE INDEX idx_feature_lookup 
ON offline_feature_store (entity_id, feature_name, event_timestamp DESC);

Compilable Python Implementation: Feature Cache Manager

This class implements a thread-safe, local LFU (Least Frequently Used) cache to prevent overloading Redis and reduce serving latencies during high-frequency inferencing.

import threading
import time
from typing import Dict, Any, List, Optional

class FeatureCacheEntry:
    def __init__(self, value: Any, ttl_seconds: float):
        self.value = value
        self.expire_at = time.time() + ttl_seconds
        self.frequency = 1

    def is_expired(self) -> bool:
        return time.time() > self.expire_at

    def increment_frequency(self):
        self.frequency += 1

class ThreadSafeFeatureCache:
    def __init__(self, max_size: int = 1000):
        self.max_size = max_size
        self.cache: Dict[str, FeatureCacheEntry] = {}
        self.lock = threading.Lock()

    def get(self, key: str) -> Optional[Any]:
        """
        Retrieves an active cache entry. If expired, deletes it.
        """
        with self.lock:
            entry = self.cache.get(key)
            if entry is None:
                return None
            
            if entry.is_expired():
                del self.cache[key]
                return None
            
            entry.increment_frequency()
            return entry.value

    def put(self, key: str, value: Any, ttl_seconds: float = 60.0) -> None:
        """
        Inserts value. Evicts Least Frequently Used (LFU) entry if full.
        """
        with self.lock:
            if key in self.cache:
                self.cache[key] = FeatureCacheEntry(value, ttl_seconds)
                return

            if len(self.cache) >= self.max_size:
                self._evict_lfu()

            self.cache[key] = FeatureCacheEntry(value, ttl_seconds)

    def _evict_lfu(self) -> None:
        """
        Finds and removes the entry with the lowest access frequency.
        """
        lowest_freq = float('inf')
        evict_key = None

        for k, entry in self.cache.items():
            if entry.is_expired():
                evict_key = k
                break
            if entry.frequency < lowest_freq:
                lowest_freq = entry.frequency
                evict_key = k

        if evict_key is not None:
            del self.cache[evict_key]

# Simple simulation run
if __name__ == "__main__":
    cache = ThreadSafeFeatureCache(max_size=3)
    
    # Store some mock features
    cache.put("features:user:u1", {"txn_count": 5}, ttl_seconds=2.0)
    cache.put("features:user:u2", {"txn_count": 12}, ttl_seconds=5.0)
    cache.put("features:user:u3", {"txn_count": 0}, ttl_seconds=5.0)
    
    # Trigger reads to build up frequency
    _ = cache.get("features:user:u1")
    _ = cache.get("features:user:u1")
    _ = cache.get("features:user:u2")
    
    # Put another entry to trigger LFU eviction
    # u3 should be evicted since u1 (freq=3) and u2 (freq=2) are higher priority
    cache.put("features:user:u4", {"txn_count": 99}, ttl_seconds=5.0)
    
    print("User 1 Cache:", cache.get("features:user:u1"))  # Present
    print("User 3 Cache (Evicted LFU):", cache.get("features:user:u3"))  # None
    print("User 4 Cache:", cache.get("features:user:u4"))  # Present

5. Scaling Challenges & Bottlenecks

Scale targets of 1 Billion users and 500k RPS force us to resolve critical operational bottlenecks.

Hot Partition -> Redis Cluster Hash Tag Overload -> Mitigation: Virtual Shards

Hot Partitioning on Key Entity Identifiers

Problem: Certain popular users or merchants (e.g., highly active vendors in an e-commerce platform) receive highly skewed query volumes. This concentrates load on a single Redis cluster shard, exhausting CPU capacity.
Mitigation: Implement Virtual Sharding. Instead of saving a single key like features:merchant:m100, append a virtual shard suffix based on a hash of the current timestamp minute or client ID: features:merchant:m100_v1, features:merchant:m100_v2. During retrieval, load balancer clients fetch from these replica keys in round-robin mode.

Point-in-Time Joins & Feature Leakage Prevention

Problem: In ML training, retrieving historical features via a simple JOIN based on transaction timestamp risks data leakage. For example, if a fraud event occurred at 14:00, but a Flink streaming job updated a user profile at 14:02, the training set must not receive the 14:02 update.
Mitigation: Use an As-Of Join (Point-in-Time Join) query format. When matching features for training labels, select the latest feature version whose event_timestamp is less than or equal to the training event's timestamp.

-- Point-in-time correct lookup query
SELECT 
    t.transaction_id,
    t.user_id,
    t.transaction_timestamp,
    f.feature_value AS user_7d_transaction_count
FROM transactions t
LEFT JOIN LATERAL (
    SELECT feature_value 
    FROM offline_feature_store f
    WHERE f.entity_id = t.user_id 
      AND f.feature_name = 'user_7d_transaction_count'
      AND f.event_timestamp <= t.transaction_timestamp
    ORDER BY f.event_timestamp DESC
    LIMIT 1
) f ON TRUE;

6. Technical Trade-offs & Compromises

Redis in-memory Store vs. Cassandra/ScyllaDB Wide-Column Store

Redis Cluster Choice: Choosing Redis Cluster guarantees sub-5ms P99 latencies because data is stored in RAM. However, storing 30 Terabytes in RAM is extremely expensive, requiring specialized memory-optimized infrastructure.
Cassandra Choice: SSD-backed Cassandra reduces infrastructure cost by 5-10x but introduces a higher P99 latency overhead (15-30ms) due to garbage collection spikes and read-path disk seeks.
Decision: We opt for a hybrid solution: Redis Cluster for top 20% highest-velocity real-time features (user activity count, device risk flags) and DynamoDB/Cassandra with SSD caching for lower-priority static features (account age, user address metrics).

Streaming Computation: Event-Time vs. Processing-Time

Event-Time Processing: Computes windowed features (e.g., transactions in last 5 minutes) based on when the transaction actually occurred. It is highly accurate and handles out-of-order logs cleanly but requires Flink memory buffers and complex watermark configurations.
Processing-Time Processing: Ignores actual event timestamps and computes window aggregates based on when the server received the event. While faster and memory-efficient, it introduces severe training-serving skew if network lag delays Kafka log ingestion.
Decision: We mandate Event-Time Processing with a strict 30-second bounded out-of-orderness watermark policy. Events arriving beyond this window are rejected from real-time windows and handled via late-event reprocessing pipelines.

7. Failure Scenarios & Operational Resiliency

1. Flink Streaming Pipeline Failure

Scenario: A major network split causes the Apache Flink cluster to lose connection to the Kafka broker.
Resiliency Plan: Flink is configured with RocksDB state backend and periodic checkpointing to S3. Upon reconnection, Flink reprocesses events starting from the last verified checkpoint offset. For the online store, we implement a fallback to the last computed batch features in the Redis cluster until the streaming aggregation catch-up is completed.

2. Redis Cluster Shard Failure

Scenario: A master node in the Redis Cluster crashes due to an out-of-memory issue, leaving a subset of the hash slots temporarily unreachable.
Resiliency Plan: Configure Redis sentinel and replication. Slave replica nodes automatically promote to master within 3 seconds. During this transition, the Feature Store service returns pre-registered default feature values (such as 0.0 for transaction counters or system-wide averages for transactional ratios) to keep the inference service online with zero degradation to application availability.

3. Schema Evolution Incompatibility

Scenario: An engineer updates a feature type in the schema registry from Integer to String, breaking downstream ML parsing models.
Resiliency Plan: Programmatically enforce backward schema compatibility rules inside the central schema registry via CI/CD. The registry denies any type conversions or column deletion mutations. Feature changes must be registered as new feature names (e.g., user_activity_score_v2) to prevent runtime serialization exceptions.

8. Candidate Verbal Script

Below is an interactive mock interview walkthrough showing how a candidate should execute this system design scenario.

Interviewer: "Design a real-time feature store that serves 1 Billion active users at 500k RPS for ML inference."

Candidate: "I will start by defining the system's key metrics. We need to handle 1 Billion active user profiles. Our system must serve real-time model inference with extreme speed—a P99 read latency of less than 10 milliseconds. Our write-path handles up to 200,000 updates per second, driven by real-time streams like credit card swipes or clickstream logs.

To achieve this, I will design a dual-database architecture featuring an Online Store optimized for rapid key-value lookups (using Redis Cluster hashes) and an Offline Store designed for large-scale training logs (implemented on AWS S3 with Parquet partitions).

To prevent data leakage, the offline store must guarantee point-in-time correctness. When training a model for an event that occurred at 10:00 AM, our query must look up feature values as-of 10:00 AM. I will implement a lateral point-in-time SQL join to ensure that features computed after the target event are never leaked to the training pipeline.

For our real-time streaming pipeline, Kafka will capture clickstream and transactional logs. Apache Flink will consume these logs, run event-time window aggregations, and push the results directly to our online Redis Cluster.

To handle hot partitions—like highly active merchant IDs—I will implement virtual sharding by hashing entity IDs with replica suffixes. If a Redis master node goes offline, the system will automatically failover to replica nodes, while the application API serves pre-defined default feature fallback values to prevent any disruption to our client services."

Interviewer: "How do you guarantee that features computed online match the offline training set perfectly?"

Candidate: "Excellent question. This is the training-serving skew challenge. To resolve this, I will implement a centralized Feature Schema Registry. In this registry, features are defined once with their source logic. Both Flink real-time streams and Spark batch scripts utilize the exact same schema definition.

Furthermore, during online inference, the Feature Serving Service logs the exact feature vectors sent to the model along with their retrieval timestamps. We periodically compare these logged feature values against offline historical calculations for the same timestamps. If the difference exceeds a certain margin, we trigger automated alerts to notify engineers of potential drift or skew."

Key Takeaways

define feature schemas
compute batch features
compute streaming features