Case Study: Designing WhatsApp (Real-time Messaging at Scale)
Real-time instant messaging platforms like WhatsApp, WeChat, or Discord represent one of the most demanding networking and data storage challenges in computer science. These systems must support hundreds of millions of concurrent socket connections, guarantee low-latency delivery, order messages strictly, and persist data at a rate of petabytes per month.
This case study details the system design of a global real-time chat platform capable of scaling to 500M Daily Active Users (DAU) sending 50 billion messages daily, maintaining sub-100ms end-to-end latency.
1. Requirements & Core Constraints
Functional Requirements
- 1-on-1 Real-time Chat: Low-latency, full-duplex messaging between pairs of clients.
- Delivery & Read Receipts: Real-time delivery ticks (sent, delivered, read status).
- Presence Management: Real-time "online/last seen" status indicator.
- Group Chat Operations: Message delivery to small/medium group sizes (up to 1,000 members) with strict order guarantees.
Non-Functional Requirements (SLAs)
- Sub-100ms Latency: Total transit time of a message between two online users must remain under 100ms (P99).
- High Concurrency: The system must sustain up to 50 million concurrent persistent WebSocket connections globally.
- Strict Ordering: Messages within a specific conversation must be rendered in chronological sequence.
- Resilient Durability: Zero message loss. Undelivered messages must reside in persistent storage until successfully flushed.
Back-of-the-Envelope Estimation (500M DAU)
- Total Daily Messages: $50,000,000,000$ messages/day
- Average Message QPS: $$\text{Average QPS} = \frac{50,000,000,000}{86400} \approx 578,700 \text{ QPS}$$
- Peak Messaging QPS (3x Factor): $$\text{Peak QPS} = 578,700 \times 3 \approx 1,736,100 \text{ QPS}$$
- WebSocket Server Footprint:
- A standard Linux virtual server optimized for high network I/O can support approximately $1,000,000$ concurrent TCP connections.
- Active Connections: 50,000,000 peak concurrent connections.
- Servers Required: $$\text{Servers} = \frac{50,000,000}{1,000,000} = 50 \text{ optimized socket nodes}$$
- To account for redundancy, connection spikes, and multi-region deployment, we target a 100-node Gateway Cluster.
- Storage Sizing (Cassandra):
- Average message payload: 128 bytes of text + metadata.
- Daily Storage volume: $$\text{Storage/Day} = 50 \times 10^9 \text{ messages} \times 128 \text{ bytes} \approx 6.4 \text{ TB/day}$$
- Yearly raw storage (no replication): $$\text{Storage/Year} = 6.4 \text{ TB} \times 365 \approx 2.33 \text{ PB}$$
- With a 3x replication factor, this demands approximately 7 PB/year of physical disk capacity.
2. API Design & Core Contracts
Since HTTP adds considerable handshake overhead, real-time messaging is handled primarily through long-lived full-duplex WebSocket connections. However, traditional HTTPS endpoints are retained for metadata, group management, and user configuration.
HTTP Schema: Create Group Conversation
- Endpoint:
POST /v1/conversations/groups - Request Payload:
{ "group_name": "Distributed Systems Mastery", "member_user_ids": [ "usr_9912a", "usr_8830b", "usr_7721c" ] } - Response Payload (HTTP 201 Created):
{ "conversation_id": "group_conv_8892f3a", "created_at": 1774312860, "status": "active" }
WebSocket Frame Schema: Send Real-Time Message
When connected to the Gateway, clients submit compact JSON frames:
- Target Action:
SEND_MESSAGE - Payload:
{ "message_id": "msg_f89b1c20-a192", "conversation_id": "conv_usr_12_usr_99", "sender_id": "usr_12", "receiver_id": "usr_99", "content_type": "text", "payload": "Hi, are you online?", "timestamp": 1774312860 }
3. High-Level Design (HLD)
The architecture splits real-time persistent connections (Stateful Gateway Zone) from microservices and backing storage engines (Stateless Processing Zone).
graph TD
ClientA[Client A Browser / App] -->|1. WebSocket Connection| WS1[WebSocket Gateway Node 1]
ClientB[Client B Browser / App] -->|8. Push Message| WS2[WebSocket Gateway Node 2]
subgraph Stateful Gateway Cluster
WS1
WS2
end
subgraph Connection Registry
WS1 -->|2. Register Session| RedisRegistry[(Redis Session Store)]
WS2 -->|Register Session| RedisRegistry
end
subgraph Core Processing Zone
WS1 -->|3. Route Message| MsgBroker[Kafka Messaging Cluster]
MsgBroker -->|4. Push Event| MsgProcessor[Message Processing Service]
MsgProcessor -->|5. Persist| Cassandra[(ScyllaDB / Cassandra Message Store)]
MsgProcessor -->|6. Query Session| RedisRegistry
MsgProcessor -->|7. Route to Node 2| WS2
end
subgraph Metadata & Presence Zone
ClientA -->|Heartbeat Pin| PresenceWS[Presence Service WebSockets]
PresenceWS -->|Update Status| RedisPresence[(Redis Presence Store)]
end
style Stateful Gateway Cluster fill:#1e40af,stroke:#fff,stroke-width:2px,color:#fff
style Connection Registry fill:#047857,stroke:#fff,stroke-width:2px,color:#fff
style Cassandra fill:#b91c1c,stroke:#fff,stroke-width:2px,color:#fff
End-to-End Delivery Flow:
- Client A Sends Message: Client A pushes a JSON frame containing
msg_f89over its persistent WebSocket connection toWebSocket Gateway Node 1. - Session Registration: The gateway queries the
Redis Session Storeto record the connection's node mapping. - Kafka Buffer Routing: Gateway Node 1 writes the raw event directly to a partitioned
Kafka Messaging Cluster, separating logs byconversation_idto guarantee ordering. - Message Processor Consumer: A consumer process pulls the message from Kafka, performs spam checks, and writes the persistent record directly into Cassandra.
- Session Routing: The processor queries Redis to find Client B's location. Redis shows Client B is connected to Gateway Node 2. The processor forwards the message to Gateway Node 2.
- Active Delivery: Gateway Node 2 pushes the message down the socket to Client B's active session.
4. Low-Level Design (LLD) & Data Models
1. Database Selection Rationale: Cassandra vs. Relational
- High Write Volume: Instant messaging systems generate massive time-series write operations.
- LSM-Tree Rationale: Cassandra and ScyllaDB utilize LSM-Tree (Log-Structured Merge-tree) storage engines. LSM-trees convert random disk writes into high-speed sequential appends in memory (SSTables), providing far higher write performance than B-Trees (standard SQL DBs).
- Wide-Row Time Series: Cassandra partitions data naturally by
conversation_id, ordering records by timestamp on disk, which matches chat history querying patterns.
2. Cassandra DDL Schema
-- Represents persistent chat logs
CREATE KEYSPACE chat_platform WITH replication = {
'class': 'NetworkTopologyStrategy',
'us-east': 3,
'eu-west': 3
};
USE chat_platform;
-- Persists message history per unique conversation
CREATE TABLE conversation_messages (
conversation_id text,
message_time timestamp,
message_id uuid,
sender_id text,
content text,
content_type text,
delivery_status text, -- 'SENT', 'DELIVERED', 'READ'
PRIMARY KEY (conversation_id, message_time)
) WITH CLUSTERING ORDER BY (message_time DESC);
-- Offline message delivery queue (stores messages until target comes online)
CREATE TABLE offline_delivery_queue (
receiver_id text,
message_id uuid,
conversation_id text,
sender_id text,
content text,
message_time timestamp,
PRIMARY KEY (receiver_id, message_id)
);
5. Scaling Challenges & System Bottlenecks
1. Scaling the Stateful Connection Tier
Traditional web servers spin up one OS thread per HTTP connection. However, spinning up 50 million OS threads is physically impossible on standard system kernels.
- The Solution: Build stateful socket gateways using non-blocking I/O multiplexing (epoll on Linux) using light-weight user-space green-threads (Erlang/BEAM processes, Go goroutines, or Node event loops).
- Optimizing memory allocations per socket connection is critical. Reducing buffer sizes from 64KB to 4KB allows a 16GB memory server to host up to 1 million sockets concurrently.
2. Presence Heartbeat Congestion (The $O(N^2)$ Problem)
If 1 billion users check-in every 10 seconds via pings, the database would face 100M QPS of presence updates, crashing the network.
- The Solution: Limit status updates using a subscriber architecture instead of a global broadcast:
- The system stores status values in localized Redis aggregates.
- When User A opens a chat pane with User B, the client establishes an active subscription channel.
- Presence is only pushed to active conversation subscribers rather than a user's entire contact list.
sequenceDiagram
autonumber
Client A->>Gateway Node 1: WebSocket Frame: "Hi B"
Gateway Node 1-->>Client A: HTTP ACK (Status: SENT)
Gateway Node 1->>Kafka Broker: Push Message Event
Kafka Broker->>Message Processor: Consume Message
Message Processor->>Cassandra DB: Write Message Record
Message Processor->>Redis Registry: Query Client B Location
alt Client B is Online
Redis Registry-->>Message Processor: Returns "Gateway Node 2"
Message Processor->>Gateway Node 2: Forward Message
Gateway Node 2->>Client B: WebSocket Push
Client B-->>Gateway Node 2: Delivery ACK
Gateway Node 2->>Message Processor: Update Status to DELIVERED
Message Processor->>Cassandra DB: UPDATE delivery_status
Message Processor->>Gateway Node 1: Forward ACK to Node 1
Gateway Node 1->>Client A: WebSocket Push (Double Tick)
else Client B is Offline
Redis Registry-->>Message Processor: Offline
Message Processor->>Cassandra DB: Write to offline_delivery_queue
end
6. Resilience & Failure Scenarios
1. Packet Losses & Reliable Delivery ACKs
In mobile environments, connections frequently drop. If the socket reports successful writes but the phone actually has no service, messages will vanish.
- Resilience Plan: Implement client-to-server and server-to-client Application-Level ACKs:
- The sender assigns a local sequence key to the message.
- The Gateway processes the message, registers it in the DB, and sends a
SENT_ACKback to the sender. The sender shows a single check-mark. - Once the receiver confirms receiving the frame, it sends a
DELIVERED_ACKback to the gateway. - The server updates the database status and pushes a double-tick update to the sender.
2. WebSocket Node Crashes & Mass Reconnections
If a WebSocket gateway server hosting 1 million active users crashes, all 1 million clients will immediately attempt to reconnect simultaneously, triggering a massive thundering herd event that can easily crash load balancers and database backends.
- Resilience Plan: Configure clients to use Exponential Backoff with Jitter when reconnecting. Use Anycast DNS routers to shift connection profiles dynamically across other surviving nodes.
7. Staff Engineer Perspective & Key Technical Trade-offs
1. WebSockets vs. HTTP/2 Server-Sent Events (SSE)
- WebSockets (Full-Duplex):
- Pros: Symmetric communication. Low overhead for bidirectional messaging.
- Cons: Stateful, bypasses traditional HTTP firewalls, requires custom load-balancing strategies.
- SSE (HTTP/2 Multiplexed):
- Pros: Uni-directional push from server to client. Uses HTTP/2 channels, resolving proxy blockages.
- Cons: Clients must execute separate REST posts to submit messages, causing high client connection overhead.
- Trade-off Decision: We select WebSockets (or custom TCP/XMPP) for mobile messaging apps. WebSockets avoid HTTP header overhead, saving precious battery and data consumption on mobile devices.
8. Candidate Verbal Mock Interview Script
Interviewer: "You chose Cassandra to store messages. How do you handle group messaging where one user sends a message to a group of 1,000 members? Do you write 1,000 separate records to Cassandra?"
Candidate: "This is a classic 'Write Fan-Out' vs 'Read Fan-Out' trade-off decision.
For groups up to 1,000 members, writing 1,000 separate records (write fan-out) is extremely expensive and wastes database disk space. Instead, we implement a Read Fan-Out with Group Inbox pattern.
Instead of duplicating the message for every group member on write, we write a single record into our group_messages table:
CREATE TABLE group_messages (
group_id text,
message_time timestamp,
message_id uuid,
sender_id text,
content text,
PRIMARY KEY (group_id, message_time)
);
When a user opens a group chat, they query the single partition for group_id directly, fetching the conversation thread efficiently in chronological order on disk.
To handle offline indicators (which messages are read by whom), we maintain a tiny metadata registry in a separate wide-column table:
CREATE TABLE group_read_status (
group_id text,
user_id text,
last_read_message_time timestamp,
PRIMARY KEY (group_id, user_id)
);
By storing a single message per group and tracking user offsets, we achieve near-instantaneous writes while keeping database read queries optimized under our < 100ms SLA."