Lesson 13 of 25 12 minDeep Systems

System Design: Designing Stateless Authentication

A comprehensive guide on stateless authentication using JWT in microservices.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • **Signature:** Always use asymmetric signing (RS256 or EdDSA). The Auth Server keeps the **Private Key** (to sign); Microservices keep the **Public Key** (to verify).
  • **Short-lived tokens:** Tokens should expire in 15-60 minutes to limit the blast radius if stolen.
  • **Refresh Tokens:** Use a longer-lived refresh token stored in an HTTP-only cookie to issue new access tokens.
Recommended Prerequisites
API Design: REST vs. GraphQL vs. gRPC

Premium outcome

Distributed systems mechanics for engineers building serious backend platforms.

Engineers who want stronger distributed-systems fundamentals for platform work.

You leave with

  • More confidence with consistency, causality, locking, and time in distributed systems
  • A stronger sense of which backend guarantees are expensive and why
  • The systems-level foundation needed for difficult architecture trade-offs

In high-throughput microservices architectures, traditional server-side session management (storing user sessions in memory or a shared relational database) becomes a massive bottleneck. Scaling to billions of requests requires decoupling services from a central state check. Stateless Authentication using JWT (JSON Web Tokens) solves this by moving session information directly to the client's token, allowing services to verify identity cryptographically and in isolation.

This case study designs a production-grade, stateless authentication platform at the scale of 1 Billion user accounts, detailing signature verification topologies, revocation architectures via Redis Bloom Filters, and key rotation strategies.


1. Requirements & Core Constraints

Functional Requirements

  • Issue Token Pair: Generate short-lived access tokens and long-lived refresh tokens upon successful credential validation.
  • Stateless Verification: Allow downstream microservices to verify access token signatures locally without querying a database or auth service.
  • Session Revocation: Enable instant session logout or revocation (e.g., if a device is stolen, or a user changes their password).
  • Graceful Key Rotation: Support automatic, zero-downtime rotation of cryptographic signing keys.
  • Multitenant Claims Integration: Support custom tenant metadata, roles, and authorization scopes inside the JWT payload.

Non-Functional Requirements

  • Scale: Design for 1 Billion registered user accounts, with up to 100 Million daily active users (DAUs).
  • Authentication QPS: Downstream services collectively evaluate up to 1,000,000 requests per second.
  • Latency: Signature checks must occur in process memory, introducing less than 1 millisecond of latency.
  • Revocation Check: The revocation check must execute in less than 2 milliseconds, maintaining a tiny memory footprint.
  • Security Strength: Implement asymmetric encryption (e.g., RS256 or Ed25519) to prevent microservices from generating fraudulent tokens.

Back-of-the-Envelope Capacity Estimation

1. Token Signature Verification Throughput

  • Verification QPS: 1,000,000 requests/sec.
  • Asymmetric validation (RSA-256) is CPU-heavy. A single CPU core can verify ~5,000 signatures/second.
  • Downstream CPU Sizing:
    • 1,000,000 QPS / 5,000 validations/core/sec = 200 dedicated CPU cores globally.
    • Mitigated by using symmetric gateway validation (e.g. Ed25519, which is up to 5x faster, or caching verified tokens for short periods).

2. Revocation Bloom Filter Sizing (Redis)

  • Active Revocations (Blacklist): Let's assume we revoke up to 10,000,000 access tokens concurrently per day due to logouts or password changes.
  • False Positive Probability: Set to 0.001 (0.1% chance).
  • Bloom Filter Memory Sizing:
    • Formulas for Bloom Filter size show that 10 Million items with 0.1% false-positive rate requires ~14.3 Megabytes of RAM.
    • This is incredibly tiny and fits entirely inside a single cheap Redis instance cache.

2. API Design & Core Contracts

Downstream gateways, Auth servers, and clients interact through these standardized HTTP/gRPC contracts.

API 1: User Log-In & Token Allocation

Validates credentials and generates access/refresh token pairs.

  • HTTP Method: POST
  • Path: /api/v1/auth/login
  • Headers:
    • Content-Type: application/json

Request Payload

{
  "email": "user@company.com",
  "password": "strong_hashed_password",
  "device_id": "dev_9812ab7"
}

Response Payload

{
  "access_token": "eyJhbGciOiJSUzI1NiIsImtpZCI6ImtleV8wMSJ9.eyJncm91cCI6IkFkbWluIiwic3ViIjoidXNyXzkwMTIiLCJleHAiOjE3Nzk0MjE2MDB9.sig",
  "token_type": "Bearer",
  "expires_in_seconds": 900,
  "refresh_token": "ref_8fa12bc912389a9b"
}

API 2: Revoke Session (Logout / Security Alert)

Registers a token's unique identifier (jti) into the active revocation blacklist.

  • HTTP Method: POST
  • Path: /api/v1/auth/revoke
  • Headers:
    • Content-Type: application/json
    • Authorization: Bearer eyJhbGciOi...

Request Payload

{
  "refresh_token": "ref_8fa12bc912389a9b"
}

Response Payload

{
  "status": "SUCCESS",
  "message": "Session and all child access tokens revoked successfully."
}

3. High-Level Design (HLD)

Our stateless authentication architecture splits the workload: the Auth Server issues tokens and handles key rotations, the API Gateway provides local signature checks, and Redis Cluster Bloom Filters manage active revocations.

Stateless Verification Topology

graph TD
    %% Ingress and Routing
    Client[Browser / Mobile Client] -->|HTTPS Request with JWT| Gateway[API Gateway Layer]

    %% Gateway Checks
    Gateway -->|Local Public Key Check| JWKS[In-Memory JWKS Key Cache]
    Gateway -->|Local Bloom Filter Query| Bloom[(Redis Cluster Bloom Filter)]
    
    %% Downstream routing
    Gateway -->|Forward Decoded Context Headers| MicroserviceA[User Microservice]
    Gateway -->|Forward Decoded Context Headers| MicroserviceB[Billing Microservice]

    %% Authenticate Path
    Client -->|Login Request| AuthServer[Central Auth Server]
    AuthServer -->|Issue JWT signed with Private Key| Client
    AuthServer -->|Publish Revocations| Bloom
    AuthServer -->|Expose Public Keys kid| JWKS

Authentication Lifecycle Sequence

sequenceDiagram
    autonumber
    actor Client as User Device
    participant Gateway as API Gateway
    participant Redis as Redis Bloom Filter
    participant JWKS as JWKS Public Keys
    participant Service as Billing Microservice

    Client->>Gateway: API Request (Authorization: Bearer JWT)
    Gateway->>JWKS: Fetch Cached Public Key (matched by kid)
    JWKS-->>Gateway: Return Public Key PEM
    Gateway->>Gateway: Verify cryptographically & check expiration
    Gateway->>Redis: Check if Token 'jti' in Bloom Filter
    alt Token Revoked / In Bloom Filter
        Redis-->>Gateway: True (Revoked)
        Gateway-->>Client: HTTP 401 Unauthorized
    else Token Active / Miss
        Redis-->>Gateway: False (Active)
        Gateway->>Gateway: Hydrate Header: X-User-Id: usr_9012
        Gateway->>Service: Forward request with headers
        Service-->>Client: HTTP 200 OK (Billing Data)
    end

4. Low-Level Design (LLD) & Data Models

JWT Internal Standard Structure

A cryptographically secure JWT contains three segments separated by dots: Header, Payload, and Signature.

1. Header (Specifies Algorithm & Public Key ID)

{
  "alg": "RS256",
  "typ": "JWT",
  "kid": "pub_key_2026_v1"
}

2. Payload (Claims & Identity Context)

{
  "iss": "https://auth.codesprintpro.com",
  "sub": "usr_90128374",
  "jti": "jwt_b827ac8192a",
  "tenant_id": "t_912",
  "roles": ["BillingAdmin", "PremiumUser"],
  "iat": 1779421200,
  "exp": 1779422100
}

Database Schema: DB Refresh Token Store (PostgreSQL)

While access tokens are stateless, refresh tokens must be recorded in database partitions to enforce rotation rules and block replay threats.

-- PostgreSQL Schema for Managing Refresh Sessions
CREATE TABLE refresh_tokens (
    token_hash VARCHAR(64) PRIMARY KEY,
    user_id VARCHAR(64) NOT NULL,
    device_id VARCHAR(64) NOT NULL,
    parent_token_hash VARCHAR(64),
    is_revoked BOOLEAN NOT NULL DEFAULT FALSE,
    expires_at TIMESTAMP WITH TIME ZONE NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);

-- Indexing for rapid session evaluations
CREATE INDEX idx_refresh_user_device ON refresh_tokens (user_id, device_id);

Compilable Python Implementation: JWT Signature & Expiration Validator

The following program decodes and validates signatures for incoming stateless tokens in process memory.

import hmac
import hashlib
import base64
import json
import time
from typing import Dict, Any, Optional

class ProcessMemoryJWTValidator:
    def __init__(self, signing_secret: str):
        self.signing_secret = signing_secret.encode('utf-8')

    def _base64_urldecode(self, val: str) -> bytes:
        """
        Standard base64url decoding with padding correction
        """
        rem = len(val) % 4
        if rem > 0:
            val += "=" * (4 - rem)
        return base64.urlsafe_b64decode(val.encode('utf-8'))

    def validate_and_decode(self, jwt_string: str) -> Optional[Dict[str, Any]]:
        """
        Validates token structure, signatures, and expiration in-memory.
        """
        parts = jwt_string.split('.')
        if len(parts) != 3:
            print("Validation Failed: Invalid JWT segment structure")
            return None

        header_b64, payload_b64, signature_b64 = parts[0], parts[1], parts[2]

        try:
            # 1. Decode Header and Payload
            header_bytes = self._base64_urldecode(header_b64)
            payload_bytes = self._base64_urldecode(payload_b64)

            header = json.loads(header_bytes.decode('utf-8'))
            payload = json.loads(payload_bytes.decode('utf-8'))

            # 2. Cryptographic Signature Validation (Simulating HMAC-SHA256)
            signature_input = f"{header_b64}.{payload_b64}".encode('utf-8')
            expected_signature_bytes = hmac.new(
                self.signing_secret,
                signature_input,
                hashlib.sha256
            ).digest()
            expected_signature_b64 = base64.urlsafe_b64encode(expected_signature_bytes).decode('utf-8').rstrip('=')

            if not hmac.compare_digest(expected_signature_b64, signature_b64):
                print("Validation Failed: Signature mismatch")
                return None

            # 3. Expiration Check
            current_time = time.time()
            if payload.get("exp") and current_time > payload["exp"]:
                print("Validation Failed: Token has expired")
                return None

            return payload

        except Exception as e:
            print(f"Validation Exception: {e}")
            return None

# Execution Simulation
if __name__ == "__main__":
    secret = "super_secure_vault_secret_key_2026"
    validator = ProcessMemoryJWTValidator(secret)

    # 1. Create a valid mock token expiring in 10 minutes
    header = {"alg": "HS256", "typ": "JWT"}
    payload = {
        "sub": "usr_9012",
        "roles": ["Admin"],
        "exp": int(time.time()) + 600
    }

    h_b64 = base64.urlsafe_b64encode(json.dumps(header).encode('utf-8')).decode('utf-8').rstrip('=')
    p_b64 = base64.urlsafe_b64encode(json.dumps(payload).encode('utf-8')).decode('utf-8').rstrip('=')
    
    sig_input = f"{h_b64}.{p_b64}".encode('utf-8')
    sig_bytes = hmac.new(secret.encode('utf-8'), sig_input, hashlib.sha256).digest()
    sig_b64 = base64.urlsafe_b64encode(sig_bytes).decode('utf-8').rstrip('=')

    valid_jwt = f"{h_b64}.{p_b64}.{sig_b64}"
    
    print("Decoding Valid JWT:")
    decoded = validator.validate_and_decode(valid_jwt)
    print("Decoded Claims:", decoded)

    # 2. Validate expired token simulation
    payload_expired = {
        "sub": "usr_9012",
        "roles": ["Admin"],
        "exp": int(time.time()) - 100 # expired
    }
    pe_b64 = base64.urlsafe_b64encode(json.dumps(payload_expired).encode('utf-8')).decode('utf-8').rstrip('=')
    sig_expired_bytes = hmac.new(secret.encode('utf-8'), f"{h_b64}.{pe_b64}".encode('utf-8'), hashlib.sha256).digest()
    sig_expired_b64 = base64.urlsafe_b64encode(sig_expired_bytes).decode('utf-8').rstrip('=')
    expired_jwt = f"{h_b64}.{pe_b64}.{sig_expired_b64}"

    print("\nDecoding Expired JWT:")
    _ = validator.validate_and_decode(expired_jwt)

5. Scaling Challenges & Bottlenecks

1. Redis Bloom Filter False Positives (Collisions)

  • Problem: When using Bloom Filters for token revocations, hash collisions can cause false positives. A false positive means a valid access token is mistakenly flagged as revoked, resulting in active users being logged out.
  • Mitigation: Scale out the filters. When the false-positive rate reaches 0.1% or memory saturates, initialize a Dual Bloom Filter structure. One filter remains read-only (old records) and a second new filter serves as the write-target. If a Bloom filter query yields a positive check, run a fallback query against a fast Redis Sorted Set (zset) containing exact revoked jti strings to verify if the match was a collision.

2. High-Frequency Signature Verifications CPU Overhead

  • Problem: In extremely high QPS downstream clusters (e.g. over 1M RPS), CPU cores spend up to 40% of their cycles running RSA signature evaluations.
  • Mitigation:
    • Symmetric/Asymmetric Split: Gateways verify tokens using asymmetric keys (RS256) at the public API perimeter. Once authenticated, the gateway converts the JWT into a short-lived, symmetrically-signed downstream token (e.g. HS256 with a fast key unique to the internal service mesh), allowing internal services to verify signatures at 10x lower CPU costs.

6. Technical Trade-offs & Compromises

  • HTML5 Local Storage / Headers: Simple to implement in client-side JS applications. However, local storage is accessible to any script running on the page, leaving tokens highly vulnerable to theft via Cross-Site Scripting (XSS) attacks.
  • Secure HTTP-Only Cookies: Completely inaccessible to client-side JS, neutralizing XSS risks. However, cookies are automatically attached to cross-domain requests, introducing vulnerabilities to Cross-Site Request Forgery (CSRF) attacks.
  • Decision: We compromise with the Split-Token approach. The signature segment of the JWT is stored in an HTTP-Only, Secure, SameSite=Strict cookie, while the header and payload sections are placed in standard JS headers. This neutralizes both XSS token extraction and CSRF session replays.

7. Failure Scenarios & Operational Resiliency

1. JWKS Identity Service Outage

  • Scenario: The Auth Server goes offline completely, meaning services cannot refresh public keys to verify newly issued tokens.
  • Resiliency Plan: Downstream gateways maintain a stale-cache grace period. The JWKS key retriever caches the keys with a 7-day TTL and a 12-hour refresh interval. If the identity server goes down, services continue verifying tokens using cached keys until the cluster recovers.

2. Redis Bloom Filter Outage

  • Scenario: The Redis Cluster managing revoked tokens crashes or is partitioned.
  • Resiliency Plan: Gateways fail open on revocation validation checks for standard routes, but require full active DB token verification for high-risk scopes (such as modifying billing details or withdrawing credits).

3. Clock Drift Across Distributed Clusters

  • Scenario: A microservice node's system clock drifts 15 seconds behind the Auth server's clock, throwing premature token expiration failures.
  • Resiliency Plan: Mandate a 30-second clock skew tolerance in our token validators. If the expiration comparison fails, but the time delta is within 30 seconds of current system time, the validator passes the request.

8. Candidate Verbal Script

Below is a mock interview walkthrough demonstrating how a candidate should execute this system design scenario.

Interviewer: "Design a stateless authentication platform that can handle 1 Billion user accounts and 1 Million QPS."

Candidate: "To support a massive QPS of 1 Million without database bottlenecks, I will design a Stateless Authentication Architecture centered around JSON Web Tokens (JWT). Downstream microservices will verify identity cryptographically and locally in-process.

To prevent downstream services from signing fraudulent tokens, I will implement Asymmetric Cryptography (RS256). The Auth Server holds the private key to sign tokens, while the API Gateway and microservices pull public keys asynchronously via a cached JWKS endpoint.

For session revocations, we face a major challenge because JWTs are stateless. To keep memory usage and lookup latency down, I will use a Redis Cluster with Bloom Filters. When users log out or change passwords, the auth server writes their token's unique identifier (jti) to the Bloom Filter.

Downstream gateways check this Bloom Filter. With a false-positive target rate of 0.1%, we can manage 10 Million active revocations using only 14.3 Megabytes of Redis RAM, keeping latency under 2 milliseconds.

Finally, to address high QPS CPU bottlenecks, I will run a split validation model: the API gateway handles RS256 validation at the perimeter, then hydrates requests with internal headers to allow internal microservices to process requests without CPU-heavy validation overhead."

Interviewer: "What happens if a user's access token is stolen? How do we mitigate the attack window?"

Candidate: "We use three mitigation vectors. First, access tokens have a very short lifetime—exactly 15 minutes. Second, we split the token: we store the signature in an HTTP-only secure cookie, protecting it from JS-based XSS extraction.

Third, our refresh tokens use strict refresh token rotation. Every time a refresh token is used, it is invalidated, and a new one is issued. If a malicious actor steals a refresh token and tries to replay it, our database detects that the token was already used, instantly revoking the entire session lineage to protect the account from further access."


Key Takeaways

  • Signature: Always use asymmetric signing (RS256 or EdDSA). The Auth Server keeps the Private Key (to sign); Microservices keep the Public Key (to verify).
  • Short-lived tokens: Tokens should expire in 15-60 minutes to limit the blast radius if stolen.
  • Refresh Tokens: Use a longer-lived refresh token stored in an HTTP-only cookie to issue new access tokens.

Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.