When a software application transitions from hundreds of local beta users to millions of global production users, the underlying infrastructure must adapt to handle the growing transaction load. This ability to handle growth by adding computing resources is known as Scalability.
In system design, engineers scale applications using two primary methodologies: Vertical Scaling (Scaling Up) or Horizontal Scaling (Scaling Out). Deciding between these models requires evaluating resource limitations, state persistence, load balancing strategies, and operational budgets.
System Requirements
To design a scalable web architecture, we define the parameters that govern the transition from vertical to horizontal configurations.
Functional Requirements
- Dynamic Session Persistence: The system must persist user sessions and application states as users scale, guaranteeing that successive requests are processed correctly regardless of which server handles the request.
- Load Distribution: Incoming connection traffic must be distributed evenly across available compute instances.
- Elastic Instance Management: The compute pool must support dynamically registering and deregistering server instances as traffic scales.
Non-Functional Requirements
- High Availability (No SPOF): The system must eliminate Single Points of Failure (SPOF) at the application, database, and load balancer levels.
- Sub-10ms Session Retrieval: Session lookup overhead from the shared state cache must be less than 10 milliseconds to prevent latency degradation.
- Connection Capacity Sizing: The web tier must support up to 10,000,000 concurrent active connections.
API Design and Interface Contracts
Enforcing stateless horizontal communication requires clear API contracts at the load balancer, application, and session layers.
1. User Session Verification Endpoint (HTTP POST /v1/sessions/verify)
Used by internal microservices to validate user session tokens from the shared Redis session cache.
{
"sessionToken": "sess_tok_908124_xyz",
"clientIp": "192.168.1.50",
"userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)",
"validationParameters": {
"requireMfa": false,
"allowRegionShift": true
}
}
2. Load Balancer Health Check Endpoint (HTTP GET /healthz)
Used by Layer-7 load balancers and auto-scaling groups to verify if a stateless server instance is healthy and ready to accept traffic.
{
"instanceId": "web_node_us_east_002",
"status": "UP",
"systemMetrics": {
"cpuUtilizationPercent": 42.50,
"freeMemoryBytes": 8589934592,
"activeConnections": 1240
},
"dependencyStatus": {
"databaseConnection": "CONNECTED",
"redisSessionCache": "CONNECTED"
}
}
3. Server Registry Heartbeat Schema (gRPC Protocol)
Used by application instances to register their operational state with the active directory service during autoscaling.
syntax = "proto3";
package codesprintpro.scalability.registry.v1;
service ServerRegistryService {
rpc ReportActive (ActiveRequest) returns (ActiveResponse);
rpc ReportDecommission (DecommissionRequest) returns (DecommissionResponse);
}
message ActiveRequest {
string server_id = 1;
string ip_address = 2;
int32 port = 3;
int64 bootstrap_timestamp_ms = 4;
}
message ActiveResponse {
bool is_registered = 1;
int64 heartbeat_interval_seconds = 2;
}
message DecommissionRequest {
string server_id = 1;
string graceful_drain_timeout_seconds = 2;
}
message DecommissionResponse {
bool is_decommission_started = 1;
}
High-Level Architecture
The architectural transition models how a stateful monolithic structure bottlenecks performance, and how a stateless horizontal structure resolves the limitation.
1. Stateful Monolithic Server Bottlenecks
In a stateful monolith, client session states are stored locally in the server's JVM memory. This binds users to a single physical server, creating a Single Point of Failure (SPOF) and preventing horizontal scaling.
graph TD
Client1[User 1] -->|Session Pinned| Monolith[Stateful Monolith Server]
Client2[User 2] -->|Session Pinned| Monolith
subgraph Monolith Memory
Monolith --> LocalSession[JVM Memory: Session 1 & Session 2]
end
Monolith --> DB[(Single RDBMS)]
2. Stateless Horizontally Scaled Architecture with Centralized Session Cache
In a stateless design, the application servers do not store session data locally. The Load Balancer routes traffic dynamically across a pool of stateless web nodes. User session states are offloaded to a high-performance, centralized Redis Session Cluster.
graph TD
ClientD1[User 1] --> GeoDNS[DNS Geo-Router]
ClientD2[User 2] --> GeoDNS
GeoDNS --> LB[Layer-7 Load Balancer]
LB -->|Route Connection| Web1[Stateless Web Node 1]
LB -->|Route Connection| Web2[Stateless Web Node 2]
LB -->|Route Connection| Web3[Stateless Web Node 3]
Web1 & Web2 & Web3 -->|Fetch Session Token| Redis[(Central Redis Session Cluster)]
Web1 & Web2 & Web3 -->|Write Database| DBShard[(Sharded DB Cluster)]
Low-Level Design and Schema
To catalogue active server nodes and maintain stateless user sessions across the cluster, we declare tables in PostgreSQL.
-- Registry of all active web nodes in the scaling group
CREATE TABLE server_nodes (
server_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
hostname VARCHAR(128) NOT NULL UNIQUE,
ip_address VARCHAR(45) NOT NULL,
active_connections_count INT NOT NULL DEFAULT 0,
node_status VARCHAR(32) NOT NULL DEFAULT 'ACTIVE', -- ACTIVE, DRAINING, DECOMMISSIONED
last_heartbeat_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_active_nodes ON server_nodes(node_status, last_heartbeat_at);
-- Distributed user sessions lookup catalog
CREATE TABLE user_sessions (
session_id VARCHAR(256) PRIMARY KEY,
user_id UUID NOT NULL,
session_data JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
expires_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX idx_sessions_expiry ON user_sessions(expires_at);
Schema Rationale & Index Optimization:
- Partial Index Query (
idx_active_nodes): Allows load balancer registries to execute quick index scans to identify healthy servers (e.g.WHERE node_status = 'ACTIVE' AND last_heartbeat_at is greater than NOW() - INTERVAL '10 seconds'). - JSONB Serialization (
session_dataJSONB type): Allows storing dynamic session variables (such as items in a shopping cart or MFA tokens) in a binary format, facilitating schema changes without forcing migration scripts.
Scaling Challenges and Capacity Estimation
Moving from vertical scaling limits to horizontal architectures requires calculating connection limits and memory capacities.
1. Vertical Scaling Hardware Ceilings (The Vertical Wall)
Upgrading a single database server with more resources hits physical motherboard boundaries.
-
Assumptions:
- High-end server motherboard RAM capacity = $2$ TB
- Memory consumed per read database connection = $10$ MB
- Max network bandwidth supported by standard NIC = $10$ Gbps
-
Calculations: $$\text{Maximum Theoretical Connections} = \frac{2\text{ TB}}{10\text{ MB}} = \frac{2,000,000\text{ MB}}{10\text{ MB}} = 200,000\text{ connections}$$
In practice, CPU context switching overhead degrades performance when active threads exceed 10,000. Additionally, if the database server has a 10 Gbps network card and the average query payload is 50 KB, the card saturates at: $$\text{Bandwidth Saturation} = \frac{10\text{ Gbps}}{8} = 1.25\text{ GB/second} = 1,250,000\text{ KB/second}$$ $$\text{Max Queries/second} = \frac{1,250,000\text{ KB/s}}{50\text{ KB}} = 25,000\text{ queries/second}$$
Once traffic exceeds 25,000 queries/second, a vertical database configuration drops packets, resulting in application-wide timeouts. To scale, the system must use horizontal sharding.
2. Redis Session Cache Memory Capacity
When offloading session states from JVM memory to a centralized Redis cluster, we must calculate the required memory capacity.
-
Assumptions:
- Total concurrent active user sessions = $10,000,000$ sessions
- Average memory footprint of a single user session string = $1.5$ KB
- Redis clustering memory overhead factor = $1.2$ (20% overhead for routing tables and indices)
-
Calculations: $$\text{Raw Session RAM} = 10,000,000 \times 1.5\text{ KB} = 15,000,000\text{ KB} \approx 14.3\text{ GB}$$ $$\text{Total Memory Required} = 14.3\text{ GB} \times 1.2 \approx 17.16\text{ GB of RAM}$$
A single Redis instance can easily store 17.16 GB of data. To ensure high availability and prevent a single point of failure (SPOF), we split this dataset across a Redis cluster with 3 shards, each containing 1 master and 1 replica node, consuming approximately 5.7 GB of RAM per server node.
Failure Scenarios and Resilience
Stateless horizontally scaled systems introduce unique failure modes at the network and routing layers.
1. Load Balancer Health Check Flapping
- The Threat: An application server experiences transient Garbage Collection (GC) pauses. The load balancer health check times out, marks the node as dead, and routes traffic away. As soon as the GC pause ends, the server appears healthy again, causing connection flapping that disrupts active user requests.
- Resilience Design:
- Configure Conservative Health Checks on the load balancer.
- Require a node to fail 3 consecutive checks (e.g. 5 seconds apart) before marking it offline, and require 3 consecutive successful checks before routing traffic back to the node.
2. Autoscaling Thundering Herd Storm
- The Threat: Traffic spikes rapidly (e.g. during a flash sale). The autoscaler detects CPU load and spins up 20 new application server instances simultaneously. When they boot, all 20 nodes establish connection pools to the database at the same time, exhausting database connection descriptors and causing a database crash.
- Resilience Design:
- Configure Autoscaling Cooldown Periods and limit maximum step scaling boundaries (e.g., add at most 5 nodes at a time).
- Implement Lazy Connection Initialization in the application connection pool (e.g. HikariCP), preventing servers from opening all connections during the bootstrap phase.
Architectural Trade-offs
Selecting between vertical scaling and horizontal scaling requires evaluating engineering complexity against operational costs.
Trade-off 1: Vertical Scaling vs. Horizontal Scaling
| Feature / Metric | Vertical Scaling (Scale Up) | Horizontal Scaling (Scale Out) |
|---|---|---|
| Complexity | Low. Requires zero modifications to application code or database topology. | High. Requires load balancing, stateless session caching, and sharded databases. |
| High Availability | Low. The server remains a Single Point of Failure (SPOF). | High. Automatically routes traffic around failed nodes. |
| Upgrade Downtime | High. Upgrading RAM or CPU forces server reboot. | Zero. Supports rolling upgrades with zero downtime. |
| Resource Ceiling | Low. Bound by hardware limits of a single motherboard. | High. Infinite scaling potential by adding commodity nodes. |
Trade-off 2: Layer-4 TCP Load Balancing vs. Layer-7 HTTP Load Balancing
| Metric | Layer-4 TCP Load Balancing | Layer-7 HTTP Load Balancing |
|---|---|---|
| Routing Intelligence | Low. Routes packets based on IP address and port 4-tuples. | High. Inspects HTTP headers, cookies, query parameters, and URL paths. |
| CPU Overhead | Low. Does not decrypt TLS or inspect application data. | High. Must decrypt TLS and parse HTTP request content. |
| Session Sticky Support | Poor. Cannot parse session cookies to route traffic. | High. Can parse cookie headers to enforce sticky routing. |
Staff Engineer Perspective
Operating horizontally scaled systems requires understanding caching limits and database boundaries.
Verbal Script
Interviewer: "What is the difference between vertical and horizontal scaling, and when should you choose one over the other?"
Candidate: "Vertical scaling, or scaling up, involves adding more resources—like CPU, memory, or disk space—to a single server.
Horizontal scaling, or scaling out, involves adding more independent commodity servers to the pool behind a load balancer.
Vertical scaling is simpler because it requires no changes to the application code; you are running the same software on larger hardware.
However, vertical scaling is limited by physical hardware limits and forms a single point of failure (SPOF).
Horizontal scaling has no theoretical limits and provides high availability through redundancy, but it introduces architectural complexity, requiring stateless application servers and load balancing layers.
We choose vertical scaling for lightweight services, development environments, or when velocity is critical and traffic is highly predictable.
We transition to horizontal scaling when transaction volumes exceed the capacity of a single physical server, or when the business demands high availability and zero downtime during upgrades."
Interviewer: "Why must application servers be stateless to scale horizontally, and how do you handle user sessions in a stateless architecture?"
Candidate: "Application servers must be stateless to ensure that any server in the horizontal pool can handle any request from any user.
If a server stores session data locally in its JVM memory, the load balancer is forced to use sticky sessions, routing that specific user to the same server for every request.
This creates hotspots, complicates auto-scaling (we cannot easily terminate an instance if it holds active sessions), and results in session loss if a server crashes.
To handle user sessions in a stateless architecture, we offload session state to a centralized, high-performance data store, typically a Redis cluster.
When a user logs in, the authentication service generates a unique session token, stores the session data in Redis keyed by that token, and returns the token to the client.
For subsequent requests, the client passes this token in the HTTP Authorization header.
The web server that receives the request retrieves the session data from Redis using the token.
This lookup is fast—taking less than 2 milliseconds—and allows the application tier to remain stateless and scale horizontally."
Interviewer: "How do you prevent a database from becoming a bottleneck when you scale the application tier horizontally?"
Candidate: "Scaling the application tier horizontally does not scale the database tier; instead, it increases connection pressure and query loads on the database.
To prevent the database from becoming a bottleneck, I implement three strategies: Caching, Read-Replicas, and Database Sharding.
First, I introduce a caching layer (like Redis) in front of the database to intercept read queries.
Frequently accessed, static, or slow-changing data is served from the cache, reducing database read load.
Second, I deploy read-replicas.
I configure the application context to route write operations (such as insert or update) to the primary database instance and route read operations to the replicas, scaling read throughput horizontally.
Third, for write-heavy workloads that exceed a single primary instance's disk I/O, I implement database sharding.
I partition the dataset horizontally across multiple database servers using consistent hashing.
For example, partitioning users based on user ID hashes ensures that writes are distributed evenly across the shards, preventing any single database server from becoming a bottleneck."