System Design: Designing a Video Conferencing System (Zoom / MS Teams)

Designing a real-time video conferencing system like Zoom or Microsoft Teams is fundamentally different from building a video streaming service like YouTube or Netflix. While streaming services prioritize video quality and tolerate seconds of buffering, video conferencing prioritizes Latency.

In human conversation, an end-to-end latency exceeding 200 milliseconds makes interactive communication difficult, leading to speakers talking over one another. A latency greater than 400 milliseconds renders the service practically unusable. To achieve sub-200ms latency globally, we must design a custom networking architecture that bypasses standard TCP connection patterns, handles varying network conditions, and scales to thousands of concurrent participants in a single meeting.

This system design guide details the architectural blueprint for designing a real-time, low-latency video conferencing platform capable of scaling to 1,000 participants per meeting.

System Requirements

To design a video conferencing platform, we divide the requirements into functional requirements, non-functional requirements, and explicit scale parameters.

Functional Requirements

Real-time Media Streaming: Deliver bi-directional video and audio streams between participants with low latency.
Dynamic Meeting Sign-in: Enable users to create, join, and leave meetings using unique meeting identifiers.
Screen Sharing: Support high-resolution, low-frame-rate screen capture broadcasting.
Active Speaker Detection: Automatically identify the dominant speaker and highlight their video feed in the user interface.
Roster & Metadata Management: Support in-call features like chat messages, hand-raising, and muting states.

Non-Functional Requirements

Sub-200ms End-to-End Latency: Maintain audio and video packet transmission delay below 200ms under standard network conditions.
Adaptive Bitrate Streaming: Smoothly adapt video resolution and frame rate when a participant's network bandwidth fluctuates.
High Availability & Fault Tolerance: Ensure meetings do not terminate if a media routing server crashes; the system must transition streams to a healthy server immediately.
Security & Encryption: Protect all media feeds with end-to-end encryption or hop-by-hop encryption using Secure Real-time Transport Protocol (SRTP).

Scale Assumptions

Meeting Scale: Support up to 1,000 participants in a single high-profile call.
Concurrency: Support 100,000 concurrent meetings globally.
Active Video Feeds: Average of 100 participants with cameras enabled in a 1,000-user meeting; the remaining 900 act as passive viewers.

API Design and Interface Contracts

The interface boundary of our video platform is divided into standard REST/gRPC endpoints for administrative actions and a real-time WebSocket signaling protocol.

1. Create Meeting Session (HTTP POST `/v1/meetings`)

Invoked by host clients to allocate a global meeting room.

Request Payload:

{
  "hostId": "usr_998273_alpha",
  "meetingTitle": "Weekly Architecture Review",
  "allowGuestJoin": true,
  "defaultMuted": true
}

Response Payload (201 Created):

{
  "meetingId": "mtg-qwerty-9012-zxc",
  "hostId": "usr_998273_alpha",
  "joinUrl": "https://meet.codesprintpro.com/j/mtg-qwerty-9012-zxc",
  "accessToken": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
  "createdAt": "2026-06-07T11:48:00Z"
}

2. Signaling Service gRPC Contract

Before establishing a direct media stream, clients must negotiate session attributes (codecs, resolutions, network paths) via a central signaling pathway. We define this streaming protocol using gRPC.

syntax = "proto3";

package codesprintpro.video.signaling.v1;

service SignalingService {
  rpc EstablishSession (stream ClientMessage) returns (stream ServerMessage);
}

message ClientMessage {
  string meeting_id = 1;
  string token = 2;
  oneof payload {
    SessionDescription sdp = 3;
    IceCandidate candidate = 4;
    Heartbeat ping = 5;
  }
}

message ServerMessage {
  oneof payload {
    SessionDescription sdp = 1;
    IceCandidate candidate = 2;
    ParticipantUpdate update = 3;
    ErrorDetails error = 4;
  }
}

message SessionDescription {
  enum Type {
    OFFER = 0;
    ANSWER = 1;
  }
  Type type = 1;
  string sdp_plaintext = 2;
}

message IceCandidate {
  string candidate_sdp = 1;
  string sdp_mid = 2;
  int32 sdp_m_line_index = 3;
}

message Heartbeat {
  int64 client_timestamp_ms = 1;
}

message ParticipantUpdate {
  string participant_id = 1;
  enum Action {
    JOINED = 0;
    LEFT = 1;
    MUTED = 2;
    UNMUTED = 3;
  }
  Action action = 2;
}

message ErrorDetails {
  int32 code = 1;
  string message = 2;
}

High-Level Architecture

The video conferencing system relies on two distinct operational planes: the Signaling Plane (which sets up connections) and the Media Plane (which routes raw video/audio bytes).

Signaling Plane (WebSocket Handshake and Allocation)

The signaling plane uses WebSockets or gRPC streams to negotiate network paths between client browsers and media servers. It assigns participants to media nodes based on geographic proximity.

graph TD
    Client1[Client App A] -->|1. Authenticate & Join| Gateway[API Gateway / Auth]
    Gateway -->|2. Route Join Request| SigService[Signaling Service]
    SigService -->|3. Get Assigned Server| Coord[Media Coordinator]
    Coord -->|4. Query Server Resource Loads| Consul[(Service Registry / Consul)]
    
    SigService -->|5. Store Session Map| Redis[(Redis Active Meetings DB)]
    SigService -->|6. Return Media Server IP| Client1
    
    Client1 -->|7. Exchange WebRTC SDP Offer| SigService
    SigService -->|8. Forward SDP to Media Node| SFU1[SFU Media Server Node A]
    SFU1 -->|9. Return SDP Answer| SigService
    SigService -->|10. Forward Answer| Client1

Media Plane (Selective Forwarding Unit Stream Fan-Out)

Once signaling is complete, the client establishes a direct WebRTC peer connection to the assigned Selective Forwarding Unit (SFU) using UDP. The SFU acts as a media router, receiving incoming streams and forwarding them to other participants.

graph TD
    subgraph Client Space
        PubClient[Publishing Client]
        RecvClient1[Receiving Client A]
        RecvClient2[Receiving Client B]
    end

    subgraph Media Infrastructure
        SFU[SFU Media Router Node]
    end

    PubClient -->|1. WebRTC Upload: UDP/SRTP| SFU
    note over PubClient, SFU: Sends 3 Simulcast Layers:<br/>- High: 720p (1.5 Mbps)<br/>- Med: 360p (500 Kbps)<br/>- Low: 180p (150 Kbps)

    SFU -->|2. Forward High Layer: 1.5 Mbps| RecvClient1
    note over SFU, RecvClient1: Client A has high bandwidth<br/>and uses Grid Layout

    SFU -->|3. Forward Low Layer: 150 Kbps| RecvClient2
    note over SFU, RecvClient2: Client B has weak cellular network<br/>or active speaker is minimized

Low-Level Design and Schema

While the audio/video streams are stateless UDP packets, the system must track meeting allocations, participant connections, and active media servers in a SQL database.

-- Tracks global active and historic meeting sessions
CREATE TABLE meeting_sessions (
    meeting_id VARCHAR(64) PRIMARY KEY,
    title VARCHAR(256) NOT NULL,
    host_user_id UUID NOT NULL,
    session_status VARCHAR(32) NOT NULL DEFAULT 'ACTIVE', -- ACTIVE, ENDED
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    ended_at TIMESTAMPTZ,
    max_participants_count INT NOT NULL DEFAULT 0
);

CREATE INDEX idx_meetings_status ON meeting_sessions (session_status, created_at DESC);

-- Tracks active media servers in the pool and their load status
CREATE TABLE media_servers (
    server_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    ip_address VARCHAR(45) NOT NULL UNIQUE,
    datacenter_region VARCHAR(64) NOT NULL,
    current_cpu_utilization DECIMAL(5, 2) NOT NULL DEFAULT 0.00,
    current_bandwidth_egress_mbps INT NOT NULL DEFAULT 0,
    active_connections_count INT NOT NULL DEFAULT 0,
    server_status VARCHAR(32) NOT NULL DEFAULT 'ONLINE', -- ONLINE, DRAINING, OFFLINE
    last_heartbeat_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_servers_load ON media_servers (datacenter_region, server_status, active_connections_count ASC);

-- Maps active participants to their assigned media and signaling servers
CREATE TABLE meeting_participants (
    participant_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    meeting_id VARCHAR(64) NOT NULL REFERENCES meeting_sessions(meeting_id) ON DELETE CASCADE,
    user_id UUID NOT NULL,
    connection_status VARCHAR(32) NOT NULL DEFAULT 'CONNECTING', -- CONNECTING, CONNECTED, DISCONNECTED
    assigned_media_server_id UUID NOT NULL REFERENCES media_servers(server_id),
    join_time TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    leave_time TIMESTAMPTZ,
    last_ping_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_participants_lookup ON meeting_participants (meeting_id, connection_status);
CREATE INDEX idx_participants_server ON meeting_participants (assigned_media_server_id) WHERE connection_status = 'CONNECTED';

-- Records connection quality issues for operations and support
CREATE TABLE participant_quality_logs (
    log_id BIGSERIAL PRIMARY KEY,
    meeting_id VARCHAR(64) NOT NULL,
    user_id UUID NOT NULL,
    event_type VARCHAR(64) NOT NULL, -- PACKET_LOSS_ALERT, ICE_RESTART, DISCONNECT
    packet_loss_percentage DECIMAL(5, 2) NOT NULL DEFAULT 0.00,
    latency_rtt_ms INT NOT NULL DEFAULT 0,
    codec_in_use VARCHAR(32),
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_quality_meeting ON participant_quality_logs (meeting_id, created_at DESC);

Schema Rationale & Index Optimization

idx_servers_load: This composite index handles target server allocation queries. When a new participant joins, the coordinator queries for the server with the lowest connection count in the target region (e.g., WHERE region = 'us-east-1' AND status = 'ONLINE' ORDER BY active_connections_count ASC). The index allows the coordinator to fetch this server in sub-millisecond times, avoiding table scans.
idx_participants_server: A partial index restricted to connection_status = 'CONNECTED'. If a media server crashes, the coordinator uses this index to identify all currently connected participants on that server and trigger reconnect alerts.
ON DELETE CASCADE on meeting_participants: Cleans up mapping rows when a historical meeting record is archived, simplifying table maintenance.

Scaling Challenges and Capacity Estimation

Designing real-time media forwarding for 100 concurrent streams in a 1,000-user meeting requires evaluating downstream bandwidth limits, upstream bandwidth limits, CPU scheduling, and packet-routing limits.

1. Bandwidth Calculation for a 100-User Meeting

Assumptions:
- Total meeting participants = $100$
- Active video publishers (with cameras enabled) = $20$ users
- Audio-only participants = $80$ users
- Video streams are broadcast using three Simulcast layers:
  - High: 720p at 30fps (Bitrate = 1.5 Mbps)
  - Medium: 360p at 24fps (Bitrate = 500 Kbps)
  - Low: 180p at 15fps (Bitrate = 150 Kbps)
- Audio bitrate (Opus codec) = 50 Kbps
Upstream Bandwidth (Per Publishing Client): Each publishing client uploads all three quality layers simultaneously to allow the SFU to choose the destination resolution dynamically: $$\text{Upstream Bandwidth} = 1.5\text{ Mbps} + 0.5\text{ Mbps} + 0.15\text{ Mbps} + 0.05\text{ Mbps (Audio)} = 2.2\text{ Mbps}$$
Downstream Bandwidth (Per Client): A client interface cannot display 20 videos at high resolution simultaneously due to screen space and client CPU limits. The application displays a grid showing 4 dominant speakers in medium resolution and 8 other speakers in low resolution, while the remaining 8 streams are hidden or muted. $$\text{Video Download} = (4 \times 500\text{ Kbps}) + (8 \times 150\text{ Kbps}) = 2,000\text{ Kbps} + 1,200\text{ Kbps} = 3.2\text{ Mbps}$$

To save downstream audio bandwidth, the SFU mixes the audio of the top 3 loudest speakers and forwards a single combined audio stream: $$\text{Audio Download} = 1 \times 50\text{ Kbps} = 50\text{ Kbps}$$ $$\text{Total Downstream Bandwidth} = 3.2\text{ Mbps} + 0.05\text{ Mbps} = 3.25\text{ Mbps}$$

2. Media Server Network Throughput (For a single 100-user meeting)

Calculations: $$\text{Total Input Rate} = (20\text{ publishers} \times 2.2\text{ Mbps}) + (80\text{ audio-only} \times 0.05\text{ Mbps}) = 44\text{ Mbps} + 4\text{ Mbps} = 48\text{ Mbps}$$ $$\text{Total Output Rate} = 100\text{ clients} \times 3.25\text{ Mbps} = 325\text{ Mbps}$$

For 100 concurrent meetings on a single SFU server, the egress network card must support at least $32.5$ Gbps of throughput.

3. Server Packet Routing Capacity

Calculations: Standard network routing uses Ethernet frames with a Maximum Transmission Unit (MTU) of 1,500 bytes. This means the payload size inside an IP packet is approximately 1,400 bytes once headers are added. $$\text{Throughput in Bytes} = 325\text{ Mbps} = 40.625\text{ MB/second}$$ $$\text{Packets per Second} = \frac{40.625\text{ MB/s}}{1400\text{ bytes/packet}} \approx 29,017\text{ packets/second}$$

For a server hosting 50 concurrent meetings, the CPU must process: $$\text{Packets to Route} = 29,017 \times 50 \approx 1,450,850\text{ packets/second}$$

Standard Linux kernel network stacks incur context-switching overhead when routing packets at this scale. To achieve stable performance, media servers must utilize kernel-bypass frameworks like DPDK (Data Plane Development Kit) or eBPF (Extended Berkeley Packet Filter) to process packets directly in user space, avoiding system call overhead.

Failure Scenarios and Resilience

Real-time video applications must handle variable network connections. Packets will drop, IP addresses will change, and hardware will fail.

1. High Packet Loss Recovery (Audio & Video degradation)

A user's network connection drops packets (e.g., 25% packet loss on a mobile network).

The Threat: Video frames freeze or become blocky, and audio cuts out, interrupting the conversation.
Resilience Design:
- We use Forward Error Correction (FEC). The Opus audio encoder generates redundant parity data within each audio packet. If packet $N$ is lost, the receiver can reconstruct the basic audio signal using parity bytes embedded in packet $N+1$.
- We use Negative Acknowledgment (NACK). For video streams, if a client detects a missing sequence number, it sends a NACK request via RTCP (Real-time Transport Control Protocol) to the SFU. The SFU retrieves the packet from its short-term memory buffer (e.g., 200ms cache) and retransmits it.
- To prevent delay cascades, NACK is disabled if the round-trip latency (RTT) is greater than 150ms. In high-latency scenarios, the client instead requests a new keyframe (PLI - Picture Loss Indicator) to refresh the stream.

2. Media Server Node Crash Recovery

A media server hosting 50 active meetings crashes due to hardware failure.

The Threat: Media streams freeze for all participants in those meetings, and the call is disconnected.
Resilience Design:
- The Signaling Service monitors media servers using heartbeats. If a server fails to send a heartbeat within 2 seconds, the service marks it as dead.
- The Signaling Service identifies the affected meetings, queries the Media Coordinator for healthy servers, and sends a MEDIA_RECONNECT command to all affected clients over their existing WebSocket connection.
- The clients perform an ICE Restart, negotiating a new WebRTC connection with the replacement server. The transition completes in less than 2 seconds, resuming the call without terminating the user session.

3. Cellular-to-WiFi Network Handover

A user starts a call on their mobile phone using a cellular connection and walks inside their home, where the device switches to the local WiFi network.

The Threat: The client's public IP address changes, causing the existing WebRTC UDP association to drop.
Resilience Design:
- We use Connection Migration via ICE Restart. When the client detects a change in the active network interface, it does not close the peer connection.
- It sends a new session description protocol (SDP) offer to the SFU over the signaling WebSocket containing new ICE candidate configurations.
- While the new path is being tested, the SFU continues to send media to the old cellular IP address. Once the WiFi socket confirms connectivity (using STUN binding requests), the SFU switches the media routing to the new WiFi path, ensuring a seamless transition.

4. Signaling Reconnection Storms

A core signaling container restarts, dropping 20,000 WebSocket connections simultaneously.

The Threat: All 20,000 clients attempt to reconnect at the exact same moment, creating a thundering herd problem that overloads the database and API gateways.
Resilience Design:
- Clients must implement Exponential Backoff with Jitter for all reconnection attempts. The reconnect delay is calculated as follows: $$\text{Reconnect Delay} = \min(\text{max_delay}, \text{base_delay} \times 2^{\text{attempt}}) \pm \text{random_jitter}$$
- The API Gateway uses rate-limiting token buckets to drop excess requests, protecting the signaling instances from overload while connections recover.

Architectural Trade-offs

Choosing the routing model and transport protocols involves balancing server costs against client device limitations.

Trade-off 1: SFU vs. MCU vs. Peer-to-Peer (Mesh)

The media routing model determines how streams are distributed among participants.

Feature / Metric	Peer-to-Peer (Mesh)	MCU (Multipoint Control Unit)	SFU (Selective Forwarding Unit)
Client Upload Bandwidth	High. Scales linearly with participants: $O(N)$.	Low. Only uploads 1 stream to the server: $O(1)$.	Low. Uploads 1 stream (or 3 simulcast layers): $O(1)$.
Client Download Bandwidth	High. Downloads from every peer: $O(N)$.	Low. Only downloads 1 mixed stream: $O(1)$.	Medium. Downloads selected active streams: $O(K)$.
Server CPU Utilization	Zero. All traffic is client-to-client.	High. Server must decode, resize, mix, and re-encode all streams.	Low. Server acts as a packet router without transcoding media.
Scaling Limit (Participants)	Very low. Breaks down with greater than 4 users.	Medium. Limited by server GPU/CPU limits.	High. Scales to 1,000+ users per meeting.

Trade-off 2: UDP vs. TCP for Media Delivery

The transport layer protocol determines how data packets are sent over the network.

Feature / Metric	UDP (User Datagram Protocol)	TCP (Transmission Control Protocol)
Error Correction	None. Packets are sent without confirmation.	Automatic. Missing packets are requested and retransmitted.
Head-of-Line Blocking	None. A lost packet does not stop subsequent packets.	High. If packet 1 is lost, packets 2 and 3 wait in the OS buffer.
Latency Profile	Low. Minimal delay; dropped packets are skipped.	High. Retransmission loops introduce latency spikes.
Application Suitability	High. Recommended for real-time video/audio.	Low. Limited to non-real-time actions (signaling, text chat).

Staff Engineer Perspective

Operating real-time media systems requires managing network sockets and hardware limitations.

Verbal Script

Interviewer: "How does Zoom support 1,000 participants in a single meeting without crashing client devices?"

Candidate: "Zoom scales large meetings by combining a Selective Forwarding Unit (SFU) architecture with client-side active speaker detection and Simulcast video streams.

If a client had to download 1,000 video feeds, its download bandwidth would require gigabits per second, and its CPU would overheat trying to decode the streams.

To prevent this, the SFU does not forward all 1,000 streams. Instead, it tracks participant activity:

It only forwards video for the active speaker and the 12 most recent speakers, downscaling or hiding the rest.
For the non-speaking participants, the SFU only forwards a low-bitrate audio stream.
We use Simulcast. Every publisher uploads three resolutions of their video (180p, 360p, and 720p).
If a participant is viewed in a small grid, the SFU sends them the 180p stream. If they are highlighted as the active speaker, the SFU upgrades their stream to 720p.
This reduces the client download requirement from hundreds of megabits to about 3.5 Mbps, allowing the call to run on standard home networks."

Interviewer: "What is your strategy for handling a media server crash during an active meeting?"

Candidate: "Our strategy is to keep the signaling plane and the media plane completely decoupled, allowing us to perform an ICE Restart without dropping the user's call.

The signaling connections are handled by a stateless WebSockets cluster. This cluster stores the active meeting routing states in a shared Redis database.

The media servers (SFUs) are registered with a service registry like Consul. If a media server crashes:

The signaling service detects the loss of heartbeat from that server.
The service queries Redis to identify all active meetings and participants connected to the failed node.
The signaling service then selects a healthy SFU in the same region.
It sends a renegotiation event to the affected clients over their existing WebSocket connection.
The client receives the message, creates a new WebRTC peer connection, and performs an ICE restart to establish media routing to the new server.
Because the signaling connection remains active, the user only experiences a brief 1-to-2 second pause in audio and video, and the call continues."

Interviewer: "How does WebRTC negotiate network paths through firewalls and NATs, and how would you optimize this for low connection latency?"

Candidate: "WebRTC negotiates network paths using the ICE (Interactive Connectivity Establishment) framework, which combines STUN and TURN servers.

When a client initiates a connection:

It queries a STUN (Session Traversal Utilities for NAT) server to discover its own public IP address and port mapping.
If both peers are behind symmetric firewalls that block direct connections, they fall back to a TURN (Traversal Using Relays around NAT) server, which acts as a media relay.
The client collects these path options (candidates) and exchanges them with the peer via the signaling channel to find the most direct connection path.

To optimize connection latency and reduce the time it takes to join a meeting:

We use Trickle ICE. Instead of waiting to gather all possible candidates before sending the SDP offer, the client sends the initial offer immediately.
As it discovers candidates, it sends them one by one over the WebSocket signaling channel. The receiver can begin testing connection paths immediately, reducing connection setup times by up to 3 seconds.
We also deploy TURN servers close to users (at the network edge) using Anycast routing, keeping relay latency low when direct routing is blocked."

System Design: Designing a Video Conferencing System (Zoom / MS Teams)

From vague architecture answers to staff-level trade-off thinking.

System Requirements

Functional Requirements

Non-Functional Requirements

Scale Assumptions

API Design and Interface Contracts

1. Create Meeting Session (HTTP POST `/v1/meetings`)

2. Signaling Service gRPC Contract

High-Level Architecture

Signaling Plane (WebSocket Handshake and Allocation)

Media Plane (Selective Forwarding Unit Stream Fan-Out)

Low-Level Design and Schema

Schema Rationale & Index Optimization

Scaling Challenges and Capacity Estimation

1. Bandwidth Calculation for a 100-User Meeting

2. Media Server Network Throughput (For a single 100-user meeting)

3. Server Packet Routing Capacity

Failure Scenarios and Resilience

1. High Packet Loss Recovery (Audio & Video degradation)

2. Media Server Node Crash Recovery

3. Cellular-to-WiFi Network Handover

4. Signaling Reconnection Storms

Architectural Trade-offs

Trade-off 1: SFU vs. MCU vs. Peer-to-Peer (Mesh)

Trade-off 2: UDP vs. TCP for Media Delivery

Staff Engineer Perspective

Verbal Script

Read Next

Want to track your progress?

System Design: Designing a Video Conferencing System (Zoom / MS Teams)

From vague architecture answers to staff-level trade-off thinking.

System Requirements

Functional Requirements

Non-Functional Requirements

Scale Assumptions

API Design and Interface Contracts

1. Create Meeting Session (HTTP POST /v1/meetings)

2. Signaling Service gRPC Contract

High-Level Architecture

Signaling Plane (WebSocket Handshake and Allocation)

Media Plane (Selective Forwarding Unit Stream Fan-Out)

Low-Level Design and Schema

Schema Rationale & Index Optimization

Scaling Challenges and Capacity Estimation

1. Bandwidth Calculation for a 100-User Meeting

2. Media Server Network Throughput (For a single 100-user meeting)

3. Server Packet Routing Capacity

Failure Scenarios and Resilience

1. High Packet Loss Recovery (Audio & Video degradation)

2. Media Server Node Crash Recovery

3. Cellular-to-WiFi Network Handover

4. Signaling Reconnection Storms

Architectural Trade-offs

Trade-off 1: SFU vs. MCU vs. Peer-to-Peer (Mesh)

Trade-off 2: UDP vs. TCP for Media Delivery

Staff Engineer Perspective

Verbal Script

Read Next

Want to track your progress?

1. Create Meeting Session (HTTP POST `/v1/meetings`)