Case Study: Design YouTube (Video Streaming)
Mental Model
Connecting isolated components into a resilient, scalable, and observable distributed web.
Designing a video streaming platform like YouTube or Netflix is the ultimate test of your ability to handle High Bandwidth and Global Content Delivery.
1. Requirement Clarification
Functional
- Users can upload videos.
- Users can watch videos on any device (Web, Mobile).
- Users can search for videos.
- View counts and real-time analytics.
Non-Functional
- Scalability: Handle 1M+ uploads/day.
- Availability: 99.99%.
- Reliability: No loss of uploaded videos.
- Latency: No buffering during playback.
2. High-Level Architecture
- Ingestion: Receives the raw video file.
- Transcoding: Converts video into multiple formats and resolutions (360p, 720p, 4k).
- Storage: Metadata in SQL, Raw files in Blob Storage (S3).
- CDN: Serves content from edge nodes near the user.
3. The Transcoding Pipeline
Transcoding is CPU-intensive. We use an Asynchronous Pipeline:
- Raw Video $\rightarrow$ S3 $\rightarrow$ Kafka $\rightarrow$ Workers $\rightarrow$ Transcoded Segments $\rightarrow$ S3.
4. Adaptive Bitrate Streaming (DASH/HLS)
The system doesn't send one giant file. It breaks the video into 2-5 second segments. The player automatically switches between resolutions based on the user's network speed.
5. View Count (The Big Data Problem)
Writing to a single DB row for a viral video will crash your database.
- Fix: Use a distributed counter (Redis) and periodically flush aggregates to the main DB.
Final Takeaway
Video streaming is about Decoupling Ingestion from Delivery. Ingestion needs reliable pipelines; delivery needs a massive global CDN.
Technical Trade-offs: Messaging Systems
| Pattern | Ordering | Durability | Throughput | Complexity |
|---|---|---|---|---|
| Log-based (Kafka) | Strict (per partition) | High | Very High | High |
| Memory-based (Redis Pub/Sub) | None | Low | High | Very Low |
| Push-based (RabbitMQ) | Fair | Medium | Medium | Medium |
Key Takeaways
- Users can upload videos.
- Users can watch videos on any device (Web, Mobile).
- Users can search for videos.
Production Readiness Checklist
Before deploying this architecture to a production environment, ensure the following Staff-level criteria are met:
- High Availability: Have we eliminated single points of failure across all layers?
- Observability: Are we exporting structured JSON logs, custom Prometheus metrics, and OpenTelemetry traces?
- Circuit Breaking: Do all synchronous service-to-service calls have timeouts and fallbacks (e.g., via Resilience4j)?
- Idempotency: Can our APIs handle retries safely without causing duplicate side effects?
- Backpressure: Does the system gracefully degrade or return HTTP 429 when resources are saturated?
Read Next
Verbal Interview Script
Interviewer: "How would you ensure high availability and fault tolerance for this specific architecture?"
Candidate: "To achieve 'Five Nines' (99.999%) availability, we must eliminate all Single Points of Failure (SPOF). I would deploy the API Gateway and stateless microservices across multiple Availability Zones (AZs) behind an active-active load balancer. For the data layer, I would use asynchronous replication to a read-replica in a different region for disaster recovery. Furthermore, it's not enough to just deploy redundantly; we must protect the system from cascading failures. I would implement strict timeouts, retry mechanisms with exponential backoff and jitter, and Circuit Breakers (using a library like Resilience4j) on all synchronous network calls between microservices."