Mental Model
Designing a real-time chat system (comparable to WhatsApp, Slack, or Telegram) at a scale of 500 Million daily active users requires shifting from traditional stateless request-response models to a highly coordinated, stateful, event-driven architecture. Because message delivery must execute in less than 100ms globally, the system maintains millions of open, bidirectional WebSocket connections on stateful Gateway nodes. The primary architectural challenge is routing: when User A sends a message to User B, the system must dynamically locate which physical Gateway node holds User B's active TCP socket, route the packet across a pub/sub backplane, log the transaction durably in wide-column storage, and update real-time presence heartbeats.
Requirements and System Goals
To engineer an instant messaging platform of global scale, we must define quantitative performance and capacity boundaries.
1. Functional Requirements
- Bidirectional 1:1 and Group Chats: Deliver real-time text messages, reactions, and attachments. Group chats must support up to 1,000 members.
- Message Delivery States: Expose instant delivery receipt handshakes (Sent $\rightarrow$ Delivered $\rightarrow$ Read).
- Real-time User Presence Status: Display online/offline states and "last seen" timestamps.
- Durable Message History: Provide paginated, search-optimized message history queries spanning up to 5 years.
2. Non-Functional Requirements & Scale Budgets
- Ultra-Low Latency Delivery SLA: Deliver messages end-to-end to online clients in less than 100ms.
- Massive Concurrent Scale: Support 500 Million Daily Active Users (DAU). At an average of 1 Billion messages/day, the system handles an average of 11,600 messages/sec, with peak burst throughput targeting 35,000 messages/sec.
- At-Least-Once Delivery Guarantee: Ensure that no message is silently dropped. Client-side deduplication keys are utilized to handle network reconnect retries.
- Time-Ordered Consistency: Guarantee strict sequence ordering for all messages sent within an individual conversation.
API Interfaces and Service Contracts
Real-time chat platforms rely on binary or JSON WebSocket frame protocols for chat events, alongside standard REST APIs for session setup.
1. Send Message WebSocket Frame Contract
When a client sends a message over an active WebSocket socket, it packs the payload with a client-generated idempotency key.
WebSocket Frame Payload:
{
"event": "send_message",
"client_msg_id": "msg_8a2b3c4d-9988-7766-5544-33221100aabb",
"conversation_id": "conv_9918237462",
"sender_id": "usr_alpha_123",
"content": "Let us finalize the system design plan.",
"content_type": "text",
"sent_timestamp_ms": 1780416350000
}
2. Delivery Receipt Handshake Frame
When the recipient's device receives the message, it returns an asynchronous acknowledgment frame back to the server, which is routed back to the original sender to trigger the "double checkmark" icon.
WebSocket Frame Payload:
{
"event": "message_receipt",
"message_id": "srv_msg_982736451",
"conversation_id": "conv_9918237462",
"recipient_id": "usr_beta_456",
"sender_id": "usr_alpha_123",
"status": "DELIVERED", // SENT, DELIVERED, READ
"receipt_timestamp_ms": 1780416350050
}
High-Level Design and Visualizations
Decoupling stateful TCP connection gateways from stateless routing logic, event queues, and analytical stores is critical to scaling chat platforms without connection drops.
1. High-Level Stateful Gateway Architecture
This diagram outlines how stateful WebSocket servers maintain open sockets, using a global Redis presence registry to route messages across different nodes.
graph TD
subgraph Client Tier
ClientA[User A Client] -->|1. Active WebSocket| GW1[WebSocket Gateway Node 1]
ClientB[User B Client] -->|1. Active WebSocket| GW2[WebSocket Gateway Node 2]
end
subgraph Presence Directory
GW1 -->|2. Register Connection| RedisPresence[(Redis Presence Cluster)]
GW2 -->|2. Register Connection| RedisPresence
end
subgraph Event Pipeline
GW1 -->|3. Publish Outgoing Event| Kafka[Kafka Chat Event Queue]
end
subgraph Service and Storage Tier
Kafka -->|4. Consume & Route| DeliverySvc[Message Delivery Service]
Kafka -->|4. Persist Message| StorageSvc[Cassandra Storage Service]
StorageSvc --> MessageStore[(Cassandra Cluster)]
DeliverySvc -->|5. Query Host Gateway| RedisPresence
DeliverySvc -->|6. Inter-Node gRPC Forward| GW2
end
2. End-to-End Chat Ingestion and Receipt Flow
The sequence diagram below displays how a message is routed between users on different gateways, including fallback channels when users are offline.
sequenceDiagram
autonumber
participant A as User A Client
participant GW1 as WebSocket Gateway 1
participant Kafka as Kafka Event Bus
participant Deliv as Message Delivery Service
participant Redis as Redis Registry
participant GW2 as WebSocket Gateway 2
participant B as User B Client
A->>GW1: Stream message frame (client_msg_id)
GW1->>GW1: Verify authentication & check duplicates
GW1->>Kafka: Publish event (conversationId partition key)
GW1-->>A: Send server ACK (message received at edge)
Deliv->>Kafka: Pull message event batch
Deliv->>Redis: Query: Where is User B connected?
rect rgb(240, 255, 240)
Note over Deliv, B: Case A: Recipient Online (On Gateway 2)
Redis-->>Deliv: Return Gateway_2 IP Address
Deliv->>GW2: Forward message via gRPC
GW2->>B: Stream WebSocket frame
B-->>GW2: Return ACK (DELIVERED)
GW2->>Kafka: Publish RECEIPT event
Deliv->>Kafka: Pull RECEIPT event
Deliv->>Redis: Query: Where is User A connected?
Redis-->>Deliv: Return Gateway_1 IP Address
Deliv->>GW1: Forward RECEIPT via gRPC
GW1->>A: Stream WebSocket receipt frame (Double Check!)
end
rect rgb(255, 240, 240)
Note over Deliv, B: Case B: Recipient Offline
Redis-->>Deliv: Return null (No active socket)
Deliv->>Deliv: Route to Offline Push Queue (FCM / APNs)
end
Low-Level Design and Schema Strategies
To support years of chat history with high write volumes, the system uses wide-column persistence, and fast heartbeats are stored in memory.
1. Message History Storage Schema (Cassandra)
We utilize Apache Cassandra for storing message history. Cassandra's wide-column, log-structured merge-tree architecture is optimized for high write volumes and time-ordered range queries.
- Partitioning Strategy: We partition by
(conversation_id, time_bucket). Thetime_bucketrepresents a monthly slice (e.g.2026-06) to ensure that a highly active conversation partition does not grow greater than 100MB, preventing read latency degradation. - Clustering Strategy: We cluster by
message_id(aTIMEUUID) in descending order to fetch the latest messages instantly.
-- Core Message Storage Table
CREATE TABLE conversation_messages (
conversation_id UUID,
time_bucket INT, -- e.g. 202606 for June 2026
message_id TIMEUUID, -- Unique, embeds timestamp, sortable
sender_id UUID,
content TEXT,
content_type VARCHAR(16), -- 'text', 'image', 'reaction'
media_url TEXT,
is_deleted BOOLEAN,
PRIMARY KEY ((conversation_id, time_bucket), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC)
AND compaction = {
'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'DAYS',
'compaction_window_size': 7
};
-- Indices for user conversation listings
CREATE TABLE user_conversations (
user_id UUID,
conversation_id UUID,
last_message_id TIMEUUID,
last_message_preview TEXT,
unread_count INT,
PRIMARY KEY (user_id, conversation_id)
);
2. Redis Presence and Session Mapping Structures
The stateless delivery engines lookup active sockets using Redis hashes and sorted sets for heartbeats.
-- Redis Key: presence:user_session:<user_id>
-- Redis Hash DataType
HSET presence:user_session:usr_beta_456 active_gateway "gw_node_2.corp.internal"
HSET presence:user_session:usr_beta_456 client_ip "192.168.1.50"
HSET presence:user_session:usr_beta_456 last_active 1780416350000
-- Redis TTL Key for Active Presence Status
-- Expired automatically if heartbeat is missed
SET presence:status:usr_beta_456 "online" EX 30
Scaling and Operational Challenges
1. Group Message Fan-Out Complexity Math
When a message is sent in a group chat, the system must deliver it to all members. This is the Fan-Out pipeline.
- The Mathematics of Fan-Out:
- Suppose User A sends a message to a group containing $N = 1,000$ members.
- The Delivery Service must fetch the members list, query Redis $N$ times to find each member's gateway host, and execute $N$ network forwards.
- Peak Burst Scale: If 1,000 groups each send 1 message per second: $$\text{Fan-Out writes} = 1,000 \times 1,000 = 1,000,000 \text{ operations/sec}$$
- A single un-isolated delivery queue will instantly saturate, delaying 1:1 messages by minutes.
- The Bulkhead Solution:
- We isolate delivery queues using Bulkheads. 1:1 messages are routed through high-priority delivery queues.
- Group messages are routed through low-priority, partitioned queues.
- Inter-Gateway Coalescing: Instead of making $N$ separate network calls to Gateway Node 2 for 50 members who happen to be connected to Gateway 2, the delivery service groups the recipients by gateway host. It executes exactly one gRPC payload containing the 50 recipient IDs to Gateway Node 2, reducing global network overhead from $O(N)$ to $O(G)$ where $G$ is the number of active gateway servers.
2. Redis Idempotency Deduplication
At-least-once message delivery means clients will retry message uploads under poor network conditions, causing duplicates.
- The Deduplication Strategy:
- Before saving to Cassandra, the API Gateway parses the client-generated
client_msg_id. - We store a deduplication key in Redis:
dedup:conv_id:client_msg_idwith a value ofsrv_msg_id. - The Lifecycle: We set a 24-hour TTL on this key.
- If a retried
client_msg_idarrives within 24 hours, the gateway blocks duplicate database writes and instantly returns the existingsrv_msg_idACK, ensuring a clean user interface.
- Before saving to Cassandra, the API Gateway parses the client-generated
Real-Time Messaging Systems Trade-offs
Choosing a messaging architecture requires selecting between delivery latency, database durability, and ordering guarantees.
| Architectural Dimension | Log-Based Event Brokers (Kafka) | Memory-Based Pub/Sub (Redis) | Push-Based Queues (RabbitMQ) |
|---|---|---|---|
| Message Ordering | Strict (Guaranteed within partition via key) | None (No native partition ordering) | Fair (Depends on consumer pool scale) |
| Durability Tier | Extreme (Written to persistent disk blocks) | Low (Primarily stored in volatile RAM) | Medium (Buffered on local host disk) |
| Throughput Capacity | Ultra-High (Peak throughput > 1M writes/sec) | High (RAM bound scale) | Medium (Slower due to routing logic) |
| System Complexity | High (Requires managing partitions and offset state) | Very Low (Simple HSET and publish/subscribe) | Medium (Requires AMQP topology setups) |
| Best Use Case | Core persistent message stream pipelines. | Volatile, high-frequency presence heartbeat broadcasts. | Enterprise business task routing, single-consumer queues. |
Failure Modes and Fault Tolerance Strategies
1. WebSocket Gateway Server Crashes
If a stateful Gateway Node holding 50,000 active connections crashes due to hardware failure, 50,000 TCP sockets are dropped instantly.
- The Fault Tolerance Blueprint:
- The clients detect connection loss (via TCP keepalives or missed application-level pings).
- Clients enter an Exponential Backoff Reconnect Loop with random mathematical jitter to prevent a thundering herd stampede from crashing the remaining healthy gateway nodes.
- Upon reconnecting to a new gateway node (e.g. Gateway Node 5), the client sends its last received
message_id. - Gateway 5 registers the new connection mapping in Redis, fetches missed offline messages from Cassandra, and streams them to the client, guaranteeing zero lost messages.
2. Network Partition Client-Reconnect Stampedes
When a localized cellular carrier experiences an outage and recovers, millions of clients attempt to reconnect and authenticate simultaneously, saturating the auth microservice's CPU.
- The Mitigation Strategy:
- The Gateway Nodes execute Pre-Auth Handshake Tokens.
- When a client first logs in, it receives a short-lived token.
- During reconnects, the client passes this token directly to the WebSocket gateway handshake.
- The gateway validates the token locally in memory using cryptographic signatures (JWT verification) without hitting the database or authorization service, shielding the core system from CPU starvation.
Staff Engineer Perspective
Production Readiness Checklist
Before launching your real-time chat application to millions of active users, verify:
- Kernel Tunings Applied: Confirm file descriptor limits (
ulimit -n) are set to at least 1,000,000 on all gateway nodes. - TimeWindowCompaction Active: Verify Cassandra tables use
TimeWindowCompactionStrategyto optimize time-based compactions. - Presence Heartbeat TTLs: Configure client heartbeats to send every 20 seconds, with Redis status TTL set to 30 seconds.
- Bulkhead Ingest Partitioning: Ensure group message fan-out pipelines utilize isolated thread pools from 1:1 message routes.
Read Next
- Saga Orchestration: Managing Distributed Transactions
- System Design: Designing a Database Proxy for Sharding (Vitess Style)
- High Availability: Building a Five Nines Infrastructure
Verbal Script
Interviewer: "How would you design a highly scalable real-time chat application like WhatsApp or Slack handling 1 Billion messages per day, and how do you handle message routing and delivery guarantees?"
Candidate: "To design a real-time chat system capable of handling 500 Million daily active users and 1 Billion messages per day, I would build an architecture optimized for stateful connections and event-driven decoupled pipelines. The core constraint is persistent connectivity: we deploy a stateful WebSocket Gateway cluster where each node maintains up to 50,000 open TCP sockets.
When User A sends a message to User B over their WebSocket socket, the gateway intercepts the frame. The gateway immediately publishes the message event to a Kafka cluster, partitioned by conversation_id. Partitioning by conversation_id is crucial because it guarantees that all messages sent within the same conversation are ordered sequentially on the same broker.
The message is consumed by two independent services in parallel. First, a Storage Service writes the message durably to Apache Cassandra. We partition the Cassandra table by (conversation_id, time_bucket) where the time_bucket represents a monthly slice. This capping strategy prevents any single partition from growing greater than 100MB, maintaining fast read performance.
Second, the Message Delivery Service handles routing. It queries a global Redis Presence Cluster to locate where User B is connected.
If User B is online on a different server, the delivery service forwards the message to that specific gateway node via gRPC. The gateway streams the frame over User B's open socket. When User B's client receives the message, it returns an asynchronous acknowledgment frame, which is routed back to User A, updating their UI with a 'delivered' checkmark.
If User B is offline, the service routes the message to an offline push queue, triggering an FCM or APNs push notification.
To scale group chats up to 1,000 members without bottlenecking 1:1 messages, I would implement two strategies. First, we use Bulkhead isolation, routing group pipelines through dedicated lower-priority queues. Second, we implement Inter-Gateway Coalescing: when delivering a group message, we group the recipients by their active gateway hosts. Instead of making $N$ individual network calls, the delivery service executes exactly one bulk gRPC payload to each target gateway node, reducing inter-node network traffic from $O(N)$ down to $O(G)$ where $G$ is the number of active gateway servers.
Finally, we guarantee at-least-once delivery and handle client reconnect retries by implementing a Redis-based idempotency layer. The gateway caches client-generated message IDs with a 24-hour TTL, blocking duplicate writes at the edge and ensuring absolute data consistency."