System Design: Designing a Real-time Recommendation Engine (TikTok / Netflix)

Case Study: Designing a Real-Time Recommendation Engine (TikTok / Netflix)

Mental Model

A high-throughput recommender is a delicate choreography of fast, low-latency approximate retrieval paired with heavy, GPU-accelerated deep ranking models.

How do TikTok, YouTube, and Netflix keep users engaged for hours? The answer lies in high-performance Real-Time Machine Learning Pipelines that process user interactions, run complex predictive scoring models, and return personalized content in under 100 milliseconds.

Requirements & Core Constraints

To build a world-class recommendation system, we must clarify our functional goals and strict scaling bounds.

Functional Constraints

Personalized Feed Generation: Provide users with a highly engaging, continuous feed of short-form videos.
Real-Time Interactions: Telemetry data (swipes, watch completion ratios, likes, shares, comments, skips) must affect recommendations within seconds.
Content Diversity & Deduplication: Avoid showing consecutive videos from the same creator, same audio track, or showing the same video twice in the same session.
Cold Start Support: Gracefully bootstrap new users with no prior interaction history and newly uploaded videos with no view history.

Non-Functional SLAs

Ultra-Low Latency: P99 read response latency for feed generation must be under 100ms to prevent UI stutter.
Massive Ingestion Scale: Ingest and process up to 10 Billion telemetry events per day.
Extreme Read Scale: Handle 100 Million Daily Active Users (DAU) with peak read loads of 70,000 requests per second (RPS).
High Availability: Target 99.99% system availability for feed serving.

Precise Latency Budget (100ms P99)

To meet our strict P99 latency SLA, we allocate the 100ms budget across our microservice mesh as follows:

API Gateway Routing & Authentication: 10ms
Stage 1: Candidate Generation (Retrieval): 15ms (parallel vector and cache lookup)
Stage 2: Filtering (Deduplication & Blacklist): 10ms (high-speed Redis Bloom evaluation)
Stage 3: Machine Learning Ranking (Inference): 45ms (batched Triton GPU scoring)
Stage 4: Re-ranking & Diversity Processing: 10ms (in-memory business rules)
Network Egress Serialization: 10ms

Back-of-the-Envelope Estimates

Let's formulate the exact physical system capacity demands:

1. Ingestion Bandwidth & Throughput

Daily Active Users (DAU): $100\text{ Million}$
Average Video Views per User per Day: $200\text{ videos}$
Total Daily Views: $100\text{M} \times 200 = 20\text{ Billion views/day}$
Telemetry Events per View: Average of 5 interactions (Impression, Watch Time telemetry, Like, Share, Skip).
Total Ingested Events: $20\text{B} \times 5 = 100\text{ Billion events/day}$
Average Ingest QPS: $100\text{B} / 86,400\text{s} \approx 1,157,400\text{ QPS}$
Peak Ingest QPS (3x Average): $\approx 3.5\text{ Million QPS}$
Telemetry Payload Size: $500\text{ bytes}$ per event.
Peak Ingest Bandwidth: $3.5\text{M QPS} \times 500\text{ bytes} \approx 1.75\text{ GB/sec}$ write throughput.

2. Vector Database & Embedding Memory Capacity

Total Active Videos in Catalog: $10\text{ Million}$ (sliding window of highly relevant/recent videos).
Vector Dimensions: $512$-dimensional floating-point vectors ($32$-bit / $4\text{ bytes}$ per dimension).
Raw Video Vector Embedding Size: $512 \times 4\text{ bytes} = 2\text{ KB}$
Total Raw Vectors Storage: $10\text{M} \times 2\text{ KB} = 20\text{ GB}$
Index Overhead (HNSW Graph Structure): Typically $2.5\times$ raw size.
Total Memory Requirement: $20\text{ GB} \times 2.5 = 50\text{ GB}$ RAM to hold active candidate embeddings in memory.

API Design & Core Contracts

The contract boundary separates the heavy recommendation core from edge clients and log collectors.

1. Feed Retrieval Endpoint

Fetches a personalized batch of recommendations.

GET /api/v1/feed

Headers:

X-User-ID: usr_9823471029 (Standard partition tracer)
X-Device-ID: dev_iphone15_2391 (Used to map device-specific capabilities like video format support)

Query Parameters:

limit: 20 (Number of items requested, validated on gateway to prevent cache exhaust)
last_item_id: vid_8374829 (Standard pagination/cursor tracking used for continuous scrolling alignment)

Response Payload:

{
  "status": "success",
  "data": {
    "items": [
      {
        "video_id": "vid_9083214",
        "video_url": "https://cdn.codesprintpro.com/videos/9083214.mp4",
        "creator_id": "creator_8372",
        "score": 0.9872,
        "recommendation_reason": "Based on your interest in distributed systems",
        "ad_tracking_token": null
      },
      {
        "video_id": "vid_1283741",
        "video_url": "https://cdn.codesprintpro.com/videos/1283741.mp4",
        "creator_id": "creator_1092",
        "score": 0.9541,
        "recommendation_reason": "Trending in your area",
        "ad_tracking_token": "token_ad_9823147"
      }
    ],
    "next_cursor": "vid_1283741"
  }
}

2. Telemetry Feedback Webhook

Receives streaming user interaction logs asynchronously to feed the real-time feature store.

POST /api/v1/telemetry

Request Payload:

{
  "user_id": "usr_9823471029",
  "device_id": "dev_iphone15_2391",
  "events": [
    {
      "event_id": "evt_73841029",
      "video_id": "vid_9083214",
      "interaction_type": "watch_time",
      "value": "12.4",
      "completed": true,
      "timestamp": 1779435420000
    },
    {
      "event_id": "evt_73841030",
      "video_id": "vid_9083214",
      "interaction_type": "like",
      "value": "1",
      "completed": false,
      "timestamp": 1779435422000
    }
  ]
}

High-Level Design (HLD)

The recommendation engine runs two core flows concurrently: the Real-Time Streaming Feature Pipeline and the Dual-Stage Inference Pipeline.

1. User Event Streaming Pipeline

This pipeline processes user telemetry events to dynamically update feature vectors in memory. The ingestion process must be non-blocking and separate from the critical rendering path.

graph LR
    Client[Mobile/Web Client] -->|Telemetry Batches| APIGW[API Gateway / Load Balancer]
    APIGW -->|Route Event| Ingestion[Telemetry Ingestion Service]
    Ingestion -->|Produce Stream| Kafka[Apache Kafka Cluster]
    
    Kafka -->|Raw Event Stream| Flink[Apache Flink Stateful Stream Processor]
    Flink -->|Write Sliding Aggregate Features| Redis[(Redis Feature Store / Feast)]
    
    Kafka -->|Durable Event Log| S3[(Object Storage / HDFS)]
    S3 -->|Batch Feature Processing| Spark[Apache Spark ML Pipeline]
    Spark -->|Update Offline Features| Cassandra[(Cassandra User Profile DB)]

Detailed Streaming Components:

Telemetry Ingestion Service: A highly scalable Go-based stateless service that receives payloads from clients, validates event signatures, and writes immediately to Apache Kafka. It utilizes client-side batching to minimize network roundtrips.
Apache Kafka Cluster: Serves as our durable backplane, sharded into hundreds of partitions based on user_id to guarantee message ordering per user.
Apache Flink: Performs low-latency stateful stream calculations. Using sliding time windows (e.g., 5-minute and 30-minute intervals), Flink computes click-through rates (CTR) and topic engagement vectors for each user, immediately updating the active feature store.
Apache Spark ML Pipeline: Runs scheduled daily batch jobs to process cold historical logs, training deep recommendation model weights and generating static historical user profiles.

2. Dual-Stage Inference Pipeline

When a user requests their feed, this system generates, ranks, and diversifies candidates in under 100ms.

graph TD
    UserRequest[User Requests Feed] --> FeedServ[Feed Serving Service]
    
    subgraph Retrieval [Stage 1: Retrieval / Candidate Generation]
        FeedServ --> VectorAnn[Vector DB ANN Lookup]
        FeedServ --> CollabFilter[Collaborative Filtering Cache]
        FeedServ --> TrendingCache[Trending / Popularity Index]
    end
    
    VectorAnn --> MergeCand[Merge Candidates ~1,200 items]
    CollabFilter --> MergeCand
    TrendingCache --> MergeCand
    
    subgraph Filtering [Stage 2: Filtering]
        MergeCand --> BloomFilter[Redis Bloom Filter - Remove Seen Items]
        BloomFilter --> BlacklistFilter[Exclude Blocked Creators/Flagged Content]
    end
    
    BlacklistFilter -->|~1,000 Clean Candidates| Ranking[Stage 3: Deep Ranking Service]
    
    subgraph Ranking [Stage 3: Heavy Machine Learning Ranking]
        Ranking --> Triton[Triton/TF Serving GPU Cluster]
        RedisFeat[(Redis Feature Store)] -->|Fetch Real-time User/Item Features| Triton
        Triton -->|MMoE Deep Neural Network Scoring| RankedList[Scored Video List]
    end
    
    RankedList --> ReRank[Stage 4: Re-ranking & Diversity]
    
    subgraph ReRanking [Stage 4: Business Rules & Re-ranking]
        ReRank --> Diversity[Creator & Audio Diversity Rules]
        ReRank --> AdInserter[Ad Injection Engine]
    end
    
    Diversity --> Return[Top 20 Personalized Videos]
    AdInserter --> Return
    Return --> UserRequest

Detailed Dual-Stage Pipeline Walkthrough:

Stage 1: Candidate Generation (Retrieval): Prunes the vast catalog from 10 Million videos down to a high-quality subset of approximately 1,200 items. To prevent any single retrieval algorithm from blinding recommendations, we execute three models in parallel:
- Vector ANN (Approximate Nearest Neighbors): Queries our Vector DB using the user's active preference embedding, retrieving videos whose content embeddings are closest in vector space.
- Collaborative Filtering Cache: Looks up precomputed user-collaborative recommendations (e.g., items viewed by users with similar historical consumption profiles).
- Trending/Popularity Index: Acts as a backup fallback, pulling global or geolocated trending content to aid discovery and capture viral surges.
Stage 2: Filtering: Cleans the candidate list by filtering out videos the user has already viewed in the last 15 days (using a space-efficient Redis Bloom Filter per user), blocked creators, or content containing flagged reports.
Stage 3: Scoring & Ranking: Evaluates the cleaned candidates using our heavy GPU-backed Deep Learning models. The serving service constructs comprehensive feature tensors by merging real-time Flink features from Redis and historical user data.
Stage 4: Re-ranking & Diversity Rules: Polishes the raw ML predictions to guarantee a high-quality user experience. It dynamically enforces category boundaries, creator separation, and audio track variety, inserting sponsored ad items before sending the payload.

Low-Level Design (LLD) & Data Models

We store raw event telemetry, user profiles, video metadata, and dynamic features across different databases depending on read/write requirements.

Database Selection & Schema DDL

1. Cassandra User-Item Telemetry Store (Offline Processing)

We choose Apache Cassandra for raw interaction logging due to its high-throughput linear write scaling. The partition key is user_id to enable fast offline batch reads per user, while timestamp is used as a clustering column.

CREATE KEYSPACE recommendation_store 
WITH replication = {'class': 'NetworkTopologyStrategy', 'us-east': 3, 'us-west': 3};

USE recommendation_store;

CREATE TABLE user_interactions (
    user_id text,
    timestamp timestamp,
    video_id text,
    interaction_type text, -- 'impression', 'watch_time', 'like', 'share', 'skip'
    value double,
    device_id text,
    PRIMARY KEY (user_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

2. Redis Feature Store Layout (Online Latency Boundary)

Online inference requires sub-millisecond retrieval of user and video features. We use Redis as our feature store.

User Dynamic Interaction Vector Key: user:feat:{user_id} (Type: Redis Hash)
Video Static Metadata Embeddings: video:vector:{video_id} (Type: Redis Hash or binary blob)

Redis Key-Value Schema Example:

HSET user:feat:usr_9823471029 \
  tag_distributed_systems_weight 0.88 \
  tag_ai_ml_weight 0.94 \
  average_completion_rate 0.82 \
  last_active_timestamp 1779435422

Core Vector Retrieval & Cosine Similarity Implementation

During Candidate Generation, we perform multi-threaded vector similarity lookups to retrieve items close to the user's interaction profile. Here is a production-grade Python implementation demonstrating the mathematical normalization and score extraction from raw database output.

import numpy as np
from typing import List, Dict, Tuple

class VectorRetrievalEngine:
    def __init__(self, dimension: int = 512):
        self.dimension = dimension
        self.epsilon = 1e-9  # Prevent division by zero

    def normalize_vector(self, vector: np.ndarray) -> np.ndarray:
        """
        L2 normalizes the input vector to speed up cosine similarity calculations.
        If normalization is precomputed, similarity is simplified to a dot product.
        """
        norm = np.linalg.norm(vector)
        if norm < self.epsilon:
            return np.zeros(self.dimension)
        return vector / norm

    def retrieve_top_candidates(
        self, 
        user_embedding: np.ndarray, 
        candidate_catalog: Dict[str, np.ndarray], 
        top_k: int = 100
    ) -> List[Tuple[str, float]]:
        """
        Computes cosine similarity between user embedding and candidate items.
        Returns the top_k candidate IDs paired with their similarity score.
        """
        if user_embedding.shape[0] != self.dimension:
            raise ValueError(f"User embedding dimension must be {self.dimension}")

        normalized_user = self.normalize_vector(user_embedding)
        scored_candidates = []

        for item_id, item_raw_vector in candidate_catalog.items():
            if item_raw_vector.shape[0] != self.dimension:
                continue
            
            # L2 normalize candidate vector
            normalized_item = self.normalize_vector(item_raw_vector)
            
            # Since vectors are L2 normalized, Cosine Similarity is the Dot Product
            similarity_score = float(np.dot(normalized_user, normalized_item))
            
            scored_candidates.append((item_id, similarity_score))

        # Sort candidates descending by similarity score
        scored_candidates.sort(key=lambda x: x[1], reverse=True)
        
        return scored_candidates[:top_k]

# Unit validation block
if __name__ == "__main__":
    engine = VectorRetrievalEngine(dimension=4)
    user_vector = np.array([0.5, 0.5, 0.0, 0.0])
    
    catalog = {
        "video_ml_1": np.array([0.9, 0.9, 0.0, 0.0]),
        "video_cooking_2": np.array([0.0, 0.0, 1.0, 1.0]),
        "video_ml_3": np.array([0.4, 0.5, 0.1, 0.0])
    }
    
    results = engine.retrieve_top_candidates(user_vector, catalog, top_k=2)
    print("Top matched candidates:", results)
    assert results[0][0] == "video_ml_1"
    print("Verification Successful!")

Scaling Challenges & System Bottlenecks

1. Vector Search Tail-Latency Bottlenecks

Linear scanning of 10M vectors takes several seconds, violating our sub-100ms SLA.

Mitigation (HNSW Indexing & Quantization): We shard our Vector DB by location and content tags. We implement Hierarchical Navigable Small World (HNSW) proximity graphs alongside Product Quantization (PQ) to compress vector payloads. This limits retrieval to evaluating a sparse subset of clusters, reducing search latency to under 5ms.

2. Cache Stampedes and Feature Store Saturation

During peak events, the Feature Store receives over 70,000 read requests per second, which can easily trigger cache stampedes.

Mitigation (Probabilistic Early Expiration & Lock Striping): We apply XFetch algorithm parameters (Probabilistic Early Expiration) on the user feature store, refreshing cache data in the background before keys expire. Dynamic lock-striping ensures that if a cache key is missing, only a single thread queries Cassandra while concurrent requests block or read a stale version.

3. Feedback Loop Bubbles & Filter Exhaustion

If users only see what they click, they get locked into echo chambers, increasing long-term user attrition.

Mitigation (Epsilon-Greedy Exploration & SimHash Distance Check): We inject exploration candidates (5-10% of the feed) chosen via contextual bandit algorithms (e.g., LinUCB). Additionally, we enforce mathematical SimHash diversity checks during the Re-ranking stage, filtering out candidates that are more than 90% syntactically similar to videos viewed in the current session.

System Trade-offs & CAP Posture

1. AP vs CP Posture on Ingest vs Inference

Inference Feed serving is strictly AP (Availability / Partition Tolerance): If a region's feature store falls out of sync, we do not fail the request. We fallback to cached static rankings or generalized trending lists. High availability and latency boundaries are prioritized over absolute accuracy.
Ingestion Pipeline is Eventual Consistency: As telemetry queues buffer inside Kafka, Flink computes features out-of-order. It is perfectly fine if a user's swipe takes 2-3 seconds to influence their next recommendations. The business values high-throughput ingestion over instant consistency.

Failure Scenarios & Resilience

1. Heavy Machine Learning Model Inaccessibility

If the Triton GPU cluster becomes overloaded or fails, candidate scoring halts, causing the feed request to fail.

Resilience (Decoupled Graceful Degradation): We deploy a Fallback Rules Engine utilizing a circuit breaker. If GPU clusters return 5xx errors or exceed 50ms processing times, the circuit breaker trips. The request degrades gracefully, skipping deep ranking entirely and serving items directly from the Collaborative Filtering Cache and Trending Cache processed by CPU clusters.

[ Inference Gateway ] 
       │
       ├──(Normal Path)──> [ Triton GPU Cluster (Heavy DNN Model) ] (If OK)
       │
       └──(Fallback / Circuit Broken)──> [ Collaborative Filtering Cache (CPU) ]

2. Kafka Network Partition & Telemetry Data Loss

If a Kafka broker partition becomes unavailable, event telemetry gets stuck, blinding Flink processors.

Resilience (Local Storage Buffering & Idempotency): Clients store telemetry local events in an encrypted SQLite database on the mobile device. In the event of network disruption or Gateway 503 errors, the client buffers data and retries with exponential backoff. Ingestion APIs enforce a uuid duplicate-key check to ensure delayed events are not double-counted.

Staff Engineer Perspective (Operational Deep Dive)

[!WARNING] The Real-World Pitfall of Model Drift and Concept Drift A highly optimized machine learning model can degrade dramatically in production. Concept Drift occurs when the underlying relation between user attributes and interaction changes over time (e.g., seasonal trends like holiday shopping or viral memes).

To prevent system degradation:

Set up real-time monitoring of the Population Stability Index (PSI) and Kolmogorov-Smirnov (KS) statistic between inference inputs and training data.

Implement continuous training loops that automatically trigger model retraining when PSI exceeds 0.2.

Mitigating the Thundering Herd in Feature Stores

In high-throughput recommendation systems, millions of users click "refresh" simultaneously. If a feature cache partition crashes, a massive queue of requests immediately cascades into backend relational databases. To resolve this, always implement Request Collapsing inside your serving layer. If 1,000 queries request the same trending category static features, collapse them into a single flight request using a single-flight cache wrapper before hitting the primary storage layer.

Verbal Script & Mock Interview Guide

An illustrative dialogue between an interviewer and a Principal Systems Architect.

Interviewer: "I see you opted for a dual-stage pipeline (Retrieval and Ranking). Why not just run the Deep Neural Network model over the entire database of 10 million videos to find the best recommendations?"

Candidate: "Running a complex Deep Ranking model (like an MMoE model with thousands of parameters) requires significant GPU processing power. Evaluating a single user-video pair might take 5ms. If we ran this over 10 million candidate videos, the inference phase would require 50,000 seconds of computation per user request. This would completely fail our 100ms P99 latency SLA. By decoupling this into a fast Retrieval stage, we can use simple, high-speed mathematical filters (like HNSW vector proximity search) to quickly prune the catalog from 10 million items down to 1,000 relevant candidates in under 5 milliseconds. We then reserve our computationally expensive Deep Neural Network for scoring only those 1,000 candidates, completing the entire process well within our strict latency SLA."

Interviewer: "How do you handle real-time feature updates without exhausting database connections?"

Candidate: "If every user interaction immediately triggered a database write to Cassandra, our storage engine would saturate. Instead, we divide the data path. Ingestion services stream raw telemetry events directly into Apache Kafka. Apache Flink consumes these events and aggregates user activities over sliding time-windows (e.g., tag weights in the last 10 minutes). Flink performs these aggregates in-memory and executes non-blocking batch updates directly to our Redis-backed online feature store. The inference API gateway only queries Redis to construct the model feature vectors, completely isolating our core databases from write-heavy telemetry peaks."

Interviewer: "Excellent. How would you handle the cold-start problem for a new video?"

Candidate: "For a new video, there are no historical interaction patterns. To solve this, we generate an initial content embedding using raw assets (audio transcripts, video tags, visual frames). We then inject these new videos into the retrieval pools of a small subset of random users (contextual bandits). Once the video obtains its first 1,000 telemetry views, we recalculate its engagement score and transition it into the main ranking pipeline."