Lesson 74 of 105 12 minFlagship

System Design: Building a Session Management Platform

Design a production session management platform with login sessions, refresh tokens, revocation, multi-device control, risk signals, expiry, and safe cache-backed validation.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • one user is logged in on five devices
  • an access token leaks
  • refresh tokens need rotation
Recommended Prerequisites
System Design Interview Framework

Premium outcome

From vague architecture answers to staff-level trade-off thinking.

Backend engineers preparing for senior, staff, and architecture rounds.

What you unlock

  • A reusable system design answer framework for ambiguous prompts
  • Clear language for consistency, scaling, and reliability trade-offs
  • Case-study depth across feeds, payments, storage, and messaging systems

Sessions feel simple until they become a security boundary.

A user logs in. The system gives them a token or cookie. Requests work. End of story.

Then reality arrives:

  • One user is logged in on five devices.
  • An access token leaks.
  • Refresh tokens need rotation.
  • Logout should invalidate active sessions quickly.
  • Suspicious logins should trigger re-authentication.
  • Customer support needs to see which sessions are active.
  • A cache outage should not turn into a global authentication outage.

That is when "just use JWT" stops being an architecture and starts being a slogan.

This guide designs a production session management platform.


Requirements and System Goals

A session management platform tracks, validates, and terminates user authentication state across client endpoints.

Functional Requirements

  • Session Creation & Issuance: Create a secure session record upon login, issuing a short-lived access token and a long-lived refresh token.
  • Refresh Token Rotation: Rotate the refresh token on every execution, detecting reuse to prevent replay hijack attacks.
  • Granular Revocation: Enable terminating a single session, all sessions for a user, or all sessions except the current active connection.
  • Multi-Device Tracking: Expose active sessions with browser, OS, IP, and location metadata for user security dashboards.
  • Idle & Absolute Timeouts: Enforce idle timeouts (e.g., expire after 30 minutes of inactivity) and absolute session lifetimes (e.g., force re-login after 14 days).
  • Risk-Based Step-Up: Detect anomalies (e.g., impossible travel or IP changes) and trigger step-up MFA validation.

Non-Functional Requirements

  • Sub-5ms Validation Latency: Validate access tokens in-memory at the API Gateway or local caches to minimize database lookups.
  • Strong Revocation Bounds: Guarantee that a revoked session is rejected globally within 10 seconds.
  • High Ingestion Scaling: Scale session write endpoints to handle millions of active users and high-concurrency request spikes.
  • Fault-Tolerant Cache Degradation: If the session cache cluster goes offline, the validation path must gracefully fall back to metadata stores without crashing services.

API Interfaces and Service Contracts

We expose REST endpoints for authentication and session management, while internal gateways query session states using gRPC.

Create Session (User Login)

  • Endpoint: POST /v1/sessions/login
  • Request Payload:
{
  "email": "user@example.com",
  "password": "hashed_password_payload",
  "deviceId": "dev_mac_9918",
  "deviceName": "Chrome on macOS",
  "clientType": "web"
}
  • Response Payload (HTTP 201 Created):
{
  "sessionId": "sess_09ab-2281-4c12",
  "accessToken": "eyJhbGciOi...",
  "refreshToken": "rf_72bd-1102-998a",
  "expiresAt": "2026-06-20T14:30:00Z"
}

Refresh Session Tokens

  • Endpoint: POST /v1/sessions/refresh
  • Request Payload:
{
  "refreshToken": "rf_72bd-1102-998a"
}
  • Response Payload (HTTP 200 OK):
{
  "accessToken": "eyJhbGciOi...",
  "refreshToken": "rf_bc89-1122-887e",
  "expiresAt": "2026-06-20T14:35:00Z"
}

Revoke Single Session

  • Endpoint: POST /v1/sessions/sess_09ab-2281-4c12/revoke
  • Response Payload (HTTP 200 OK):
{
  "sessionId": "sess_09ab-2281-4c12",
  "status": "REVOKED",
  "revokedAt": "2026-06-06T14:32:00Z"
}

Internal Session Validation gRPC Service Contract

API Gateways query the validation state of incoming requests via gRPC:

syntax = "proto3";

package codesprintpro.session.v1;

service SessionValidationService {
  rpc ValidateSession (ValidationRequest) returns (ValidationResponse);
  rpc InvalidateCache (InvalidateRequest) returns (InvalidateResponse);
}

message ValidationRequest {
  string session_id = 1;
  string tenant_id = 2;
  string user_id = 3;
}

message ValidationResponse {
  bool is_active = 1;
  string user_role = 2;
  int64 idle_expiration_time = 3;
}

message InvalidateRequest {
  string session_id = 1;
  string user_id = 2;
  string reason = 3;
}

message InvalidateResponse {
  bool success = 1;
}

High-Level Design and Visualizations

Our session management platform uses a fast-path cache validation strategy. The diagram below shows the login sequence, cache check, and revocation flow.

Active Validation Topology

flowchart TD
    Client[Client App] -->|1. Request with Bearer JWT| Gateway[API Gateway]
    Gateway -->|2. Validate JWT signature locally| JWTVerifier[In-Process JWT Verifier]
    
    subgraph FastPath [Low-Latency Cache Check]
        Gateway -->|3. Check Session Status| RedisCache[(Redis Session Cache Cluster)]
        RedisCache -->|4. Return Active status| Gateway
    end
    
    Gateway -->|5. Forward request| ProductAPI[Product Core Service]
    
    subgraph SlowPath [Metadata Database Fallback]
        RedisCache -.->|6. Cache Miss| SessionDB[(Metadata PostgreSQL)]
        SessionDB -.->|7. Populate Cache & Return| RedisCache
    end
    
    subgraph Revocation [Revocation Path]
        Admin[Support / User Settings] -->|8. Revoke Session request| AdminService[Session Admin Service]
        AdminService -->|9. Update Status to REVOKED| SessionDB
        AdminService -->|10. Purge session key| RedisCache
        AdminService -->|11. Broadcast revoke event| Gateway
    end

Refresh Token Family Rotation and Hijack Detection

We enforce refresh token rotation. If an attacker steals and reuses an old refresh token, the system detects the replay attempt, invalidates the entire token family, and terminates the user's session.

sequenceDiagram
    autonumber
    actor Client as Legitimate Client
    participant Vault as Session Service
    actor Attacker as Attacker (Stole RT_1)
    
    Note over Client, Vault: Normal token refresh operation
    Client->>Vault: 1. Send RT_1 to refresh
    Vault->>Vault: 2. Verify RT_1 is active current token hash. Rotate!
    Vault->>Vault: 3. Move RT_1 hash to 'previous', generate RT_2
    Vault-->>Client: 4. Return new Access Token + RT_2
    
    Note over Attacker, Vault: Attacker tries to use stolen RT_1
    Attacker->>Vault: 5. Send stolen RT_1 to refresh
    Vault->>Vault: 6. Hash token. Match found in 'previous' hashes!
    Vault->>Vault: 7. Replay Attack Detected! Mark family compromised.
    Vault->>Vault: 8. Revoke session and delete Redis cache entries
    Vault-->>Attacker: 9. Return HTTP 401 Unauthorized
    
    Note over Client, Vault: Legitimate client tries to use RT_2
    Client->>Vault: 10. Send RT_2 to refresh
    Vault-->>Client: 11. Return HTTP 401 Unauthorized (Session terminated, must re-login)

Low-Level Design and Schema Strategies

We use a PostgreSQL database to manage versioned configuration. We maintain separate tables for configuration history, currently active versions, and audit trails.

PostgreSQL Table DDLs

-- Track core session state and client metadata
CREATE TABLE user_sessions (
    session_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id VARCHAR(64) NOT NULL,
    user_id VARCHAR(128) NOT NULL,
    session_status VARCHAR(32) NOT NULL,         -- 'ACTIVE', 'REVOKED', 'EXPIRED'
    device_id VARCHAR(256),
    device_name VARCHAR(256),
    client_type VARCHAR(64) NOT NULL,            -- 'web', 'ios', 'android'
    ip_address INET,
    country VARCHAR(128),
    user_agent TEXT,
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    last_seen_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    expires_at TIMESTAMPTZ NOT NULL,             -- Absolute expiry timestamp
    idle_expires_at TIMESTAMPTZ NOT NULL,        -- Sliding window expiry timestamp
    revoked_at TIMESTAMPTZ,
    revoked_reason TEXT
);

-- Index for scanning active sessions of a user
CREATE INDEX idx_user_sessions_lookup 
ON user_sessions (tenant_id, user_id, session_status);

-- Manage refresh token rotation and compromise state
CREATE TABLE refresh_token_families (
    family_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    session_id UUID NOT NULL REFERENCES user_sessions(session_id) ON DELETE CASCADE,
    tenant_id VARCHAR(64) NOT NULL,
    user_id VARCHAR(128) NOT NULL,
    current_token_hash CHAR(64) NOT NULL,        -- SHA-256 hash of active refresh token
    previous_token_hash CHAR(64),                 -- SHA-256 hash of previous token to detect replay
    rotation_counter BIGINT NOT NULL DEFAULT 0,
    compromised BOOLEAN NOT NULL DEFAULT FALSE,
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_token_families_lookup 
ON refresh_token_families (current_token_hash);

-- Record session-related audit events
CREATE TABLE session_audit_events (
    event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id VARCHAR(64) NOT NULL,
    user_id VARCHAR(128) NOT NULL,
    session_id UUID,
    event_type VARCHAR(64) NOT NULL,              -- 'LOGIN', 'REFRESH', 'REVOKE', 'REPLAY_DETECTION'
    actor_type VARCHAR(64) NOT NULL,              -- 'USER', 'SYSTEM', 'SUPPORT'
    client_ip INET,
    metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_session_audits_user 
ON session_audit_events (tenant_id, user_id, created_at DESC);

Scaling and Operational Challenges

To design a session management platform that scales to millions of users, we must evaluate write bandwidth and memory sizing.

Back-of-the-Envelope Capacity Estimations

Let us estimate capacity requirements for a platform with 50,000,000 active concurrent sessions.

  • Active Sessions count: 50,000,000 sessions.
  • Active Session Payload Size: Assume each session record in Redis is approximately 250 bytes.
  • Total Redis Memory Footprint: $$\text{Memory required} = 50,000,000 \times 250\text{ bytes} = 12,500,000,000\text{ bytes} = 12.5\text{ GB}$$ With Redis replica nodes and cluster overhead, we need a 32 GB Redis Cluster, which is easily managed.
  • Idle Timeout Update Throughput: To update idle_expires_at sliding windows, the gateway writes updates to Redis.
    • Assume a peak traffic rate of 50,000 active requests/second.
    • Writing to Redis on every single request would saturate the cluster with write IOPS.
    • To optimize write IOPS, we apply a lazy write threshold: We only update idle_expires_at in Redis if more than 5 minutes (or 20% of the idle window) has elapsed since the last update.
    • If a user sends 100 requests within 5 minutes, we execute only 1 write operation, reducing our Redis write rate from 50,000 writes/sec to: $$\text{Optimized write rate} = \frac{50,000}{300} \approx 167 \text{ writes/second}$$ This reduces our Redis write load by 99.6%, protecting the cluster from performance saturation.

Trade-offs and Architectural Alternatives

Session Validation Models: Stateless JWT vs. Stateful Opaque Tokens

Dimension Stateless Signed JWTs Stateful Opaque Tokens
Lookup Latency Low; validated locally at the API Gateway using public keys. High; requires a cache lookup on every request.
Revocation speed Subject to JWT expiration delay (e.g. valid for 15m after revocation). Immediate (revoked in cache).
Token Size Large; contains claims and signatures (bloats HTTP headers). Small; contains only a random session ID string.

We choose a hybrid approach: We issue short-lived signed JWT access tokens (valid for 15 minutes) to avoid database lookups. However, the API Gateway checks the session ID against a Redis blacklist. This provides both low latency and immediate revocation.

Session Cache Topology: Database vs. In-Memory Shared Cache

  • Database-Only Storage (PostgreSQL):
    • Pros: Strong consistency, durable audit trails, simple architecture.
    • Cons: High read latency; scaling to 50k requests/sec requires expensive database scaling.
  • In-Memory Cache (Redis Cluster):
    • Pros: Sub-millisecond reads; handles high write throughput easily.
    • Cons: Data loss risk on node crashes; requires synchronization logic to keep the database and cache in sync.

Failure Modes and Fault Tolerance Strategies

Redis Cluster Outage

If the Redis cache cluster goes offline, the API Gateway cannot check session states, which can cause request failures.

  • Mitigation: We enforce a graceful degradation path. If Redis is unreachable, the gateway falls back to local JWT signature verification and queries PostgreSQL replicas only for sensitive paths (e.g., payments). This preserves availability during cache outages.

Split-Brain Synchronization Drift

During network partitions, a session marked revoked in the database may fail to sync to a Redis replica node, leaving the session active in some regions.

  • Mitigation: We use a transaction outbox. When a session is revoked in PostgreSQL, we write a revocation event to an outbox table. An event processor reads the outbox and broadcasts the revocation via Kafka to all regional Redis clusters, ensuring eventual consistency.

Session Replay Hijack Attempt

An attacker steals a user's refresh token and attempts to refresh the session.

  • Mitigation: We implement Refresh Token Family Rotation. When the attacker presents the stolen token, the platform detects that the token has already been rotated (its hash matches a 'previous' hash). The platform immediately invalidates the entire token family, revokes the user's session, and requires re-authentication.

Staff Engineer Perspective


Verbal Script

Interviewer: "How would you design a session management platform to ensure that when a user logs out, they are instantly logged out from all devices globally?"

Candidate: "I would use a hybrid architecture combining short-lived JWT access tokens with a centralized Redis cache blacklist.

When a user logs out, the admin service updates the session status to REVOKED in the database.

Next, it purges the session ID from the Redis cluster and publishes a revocation event via Redis Pub/Sub to all regional API Gateways.

When the gateway receives the event, it adds the session ID to its local memory blacklist.

When a request arrives, the gateway validates the JWT signature and checks its local blacklist. If the session ID is blacklisted, the request is rejected immediately. This provides global revocation in under 1 second without querying the database on every request."

Interviewer: "What happens if an attacker steals a user's refresh token and attempts to refresh the session?"

Candidate: "We use Refresh Token Family Rotation to detect and mitigate token theft.

Each session is mapped to a refresh token family. When a refresh token is used, we issue a new refresh token and move the old one's hash to a 'previous' hash list.

If the attacker presents the stolen refresh token, we look up its hash. If the hash matches the 'previous' hash, it indicates that the token has already been rotated.

This triggers a replay detection alarm. We immediately mark the token family as compromised, revoke the session, and delete the active session cache in Redis. When the legitimate user attempts to use the new token, they are rejected and forced to re-login, terminating the attacker's access."


Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.