Project Case Study: Designing YouTube (Video Streaming at Global Scale)

Case Study: Design YouTube (Video Streaming)

Mental Model

Connecting isolated components into a resilient, scalable, and observable distributed web.

Designing a video streaming platform like YouTube or Netflix is the ultimate test of your ability to handle High Bandwidth and Global Content Delivery.

1. Requirement Clarification

Functional

Users can upload videos.
Users can watch videos on any device (Web, Mobile).
Users can search for videos.
View counts and real-time analytics.

Non-Functional

Scalability: Handle 1M+ uploads/day.
Availability: 99.99%.
Reliability: No loss of uploaded videos.
Latency: No buffering during playback.

2. High-Level Architecture

Ingestion: Receives the raw video file.
Transcoding: Converts video into multiple formats and resolutions (360p, 720p, 4k).
Storage: Metadata in SQL, Raw files in Blob Storage (S3).
CDN: Serves content from edge nodes near the user.

3. The Transcoding Pipeline

Transcoding is CPU-intensive. We use an Asynchronous Pipeline:

Raw Video $\rightarrow$ S3 $\rightarrow$ Kafka $\rightarrow$ Workers $\rightarrow$ Transcoded Segments $\rightarrow$ S3.

4. Adaptive Bitrate Streaming (DASH/HLS)

The system doesn't send one giant file. It breaks the video into 2-5 second segments. The player automatically switches between resolutions based on the user's network speed.

5. View Count (The Big Data Problem)

Writing to a single DB row for a viral video will crash your database.

Fix: Use a distributed counter (Redis) and periodically flush aggregates to the main DB.

Final Takeaway

Video streaming is about Decoupling Ingestion from Delivery. Ingestion needs reliable pipelines; delivery needs a massive global CDN.

Technical Trade-offs: Messaging Systems

Pattern	Ordering	Durability	Throughput	Complexity
Log-based (Kafka)	Strict (per partition)	High	Very High	High
Memory-based (Redis Pub/Sub)	None	Low	High	Very Low
Push-based (RabbitMQ)	Fair	Medium	Medium	Medium

Key Takeaways

Users can upload videos.
Users can watch videos on any device (Web, Mobile).
Users can search for videos.

Production Readiness Checklist

Before deploying this architecture to a production environment, ensure the following Staff-level criteria are met:

High Availability: Have we eliminated single points of failure across all layers?
Observability: Are we exporting structured JSON logs, custom Prometheus metrics, and OpenTelemetry traces?
Circuit Breaking: Do all synchronous service-to-service calls have timeouts and fallbacks (e.g., via Resilience4j)?
Idempotency: Can our APIs handle retries safely without causing duplicate side effects?
Backpressure: Does the system gracefully degrade or return HTTP 429 when resources are saturated?

Verbal Interview Script

Interviewer: "How would you ensure high availability and fault tolerance for this specific architecture?"

Candidate: "To achieve 'Five Nines' (99.999%) availability, we must eliminate all Single Points of Failure (SPOF). I would deploy the API Gateway and stateless microservices across multiple Availability Zones (AZs) behind an active-active load balancer. For the data layer, I would use asynchronous replication to a read-replica in a different region for disaster recovery. Furthermore, it's not enough to just deploy redundantly; we must protect the system from cascading failures. I would implement strict timeouts, retry mechanisms with exponential backoff and jitter, and Circuit Breakers (using a library like Resilience4j) on all synchronous network calls between microservices."