Lesson 14 of 23 6 min

System Design: Designing Stateless Authentication

A comprehensive guide on stateless authentication using JWT in microservices.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

In a microservices architecture, you can't rely on server-side sessions (stored in memory/database) because every request might hit a different service instance. Stateless Authentication using JWT (JSON Web Tokens) is the industry standard.

1. How it works

graph LR
    Producer[Producer Service] -->|Publish Event| Kafka[Kafka / Event Bus]
    Kafka -->|Consume| Consumer1[Consumer Group A]
    Kafka -->|Consume| Consumer2[Consumer Group B]
    Consumer1 --> DB1[(Primary DB)]
    Consumer2 --> Cache[(Redis)]
  1. Login: User authenticates via username/password.
  2. Issue: The Auth Server generates a JWT, signs it with a secret key, and sends it to the client.
  3. Usage: The client sends the JWT in the header for every subsequent request.
  4. Verification: Services verify the signature using the public key. If the signature matches, the user is authenticated. No database lookup is needed.

2. Security: Signing & Expiry

  • Signature: Always use asymmetric signing (RS256 or EdDSA). The Auth Server keeps the Private Key (to sign); Microservices keep the Public Key (to verify).
  • Short-lived tokens: Tokens should expire in 15-60 minutes to limit the blast radius if stolen.
  • Refresh Tokens: Use a longer-lived refresh token stored in an HTTP-only cookie to issue new access tokens.

3. The Revocation Challenge

JWTs are "stateless," meaning you can't easily "logout" a user before their token expires.

  • Solution: Keep a Revocation List (a blacklist) in a fast distributed store like Redis. For every request, check if the token ID (jti) is in the Redis blacklist.

4. Access token vs refresh token boundary

A robust auth system separates responsibilities:

  • short-lived access token for API authorization
  • long-lived refresh token for session continuity

Refresh tokens should be rotated on every use and tied to device/session identifiers to detect theft or replay.

5. Key rotation and JWKS strategy

Signing keys must rotate periodically without downtime.

Best practice:

  • expose public keys through JWKS endpoint
  • include kid in JWT header
  • allow overlapping old/new keys during rotation window

Services should cache keys with TTL and re-fetch on unknown kid.

6. Claims design and least privilege

JWT claims should be minimal and purpose-specific:

  • subject (sub) and tenant context
  • coarse role/scopes for authorization
  • expiry and issued-at timestamps

Avoid overstuffing user profile data; large tokens increase bandwidth overhead and stale-claim risk.

7. Multi-service authorization pattern

Authentication and authorization are related but different:

  • gateway verifies token integrity and baseline policy
  • downstream services enforce fine-grained domain authorization

Do not centralize all authorization logic in one edge layer for complex domains.

8. Threat model considerations

Key risks:

  • token theft from XSS/local storage leaks
  • refresh token replay
  • algorithm confusion or weak signature validation
  • accepting tokens from wrong issuer/audience

Mitigations include strict iss/aud checks, HTTP-only secure cookies, CSP hardening, and anomaly detection.

9. Performance and reliability trade-offs

Stateless verification is fast, but revocation and introspection can reintroduce stateful dependencies.

Practical approach:

  • local verification for most requests
  • selective Redis revocation checks for sensitive scopes
  • failover policy for revocation backend outages based on risk tier

Auth design is always a balance between security response speed and availability.

10. Observability checklist

Track:

  • token verification failures by reason
  • refresh success/failure rates
  • revocation hit count
  • suspicious geo/device token reuse

Security systems without telemetry turn incidents into blind forensics exercises.

Summary

Stateless auth is the key to scaling microservices. By moving the authentication state from the server to the client's token, you remove a major database bottleneck and make your services independently scalable.

Engineering Standard: The "Staff" Perspective

In high-throughput distributed systems, the code we write is often the easiest part. The difficulty lies in how that code interacts with other components in the stack.

1. Data Integrity and The "P" in CAP

Whenever you are dealing with state (Databases, Caches, or In-memory stores), you must account for Network Partitions. In a standard Java microservice, we often choose Availability (AP) by using Eventual Consistency patterns. However, for financial ledgers, we must enforce Strong Consistency (CP), which usually involves distributed locks (Redis Redlock or Zookeeper) or a strictly linearizable sequence.

2. The Observability Pillar

Writing logic without observability is like flying a plane without a dashboard. Every production service must implement:

  • Tracing (OpenTelemetry): Track a single request across 50 microservices.
  • Metrics (Prometheus): Monitor Heap usage, Thread saturation, and P99 latencies.
  • Structured Logging (ELK/Splunk): Never log raw strings; use JSON so you can query logs like a database.

3. Production Incident Prevention

To survive a 3:00 AM incident, we use:

  • Circuit Breakers: Stop the bleeding if a downstream service is down.
  • Bulkheads: Isolate thread pools so one failing endpoint doesn't crash the entire app.
  • Retries with Exponential Backoff: Avoid the "Thundering Herd" problem when a service comes back online.

Critical Interview Nuance

When an interviewer asks you about this topic, don't just explain the code. Explain the Trade-offs. A Staff Engineer is someone who knows that every architectural decision is a choice between two "bad" outcomes. You are picking the one that aligns with the business goal.

Performance Checklist for High-Load Systems:

  1. Minimize Object Creation: Use primitive arrays and reusable buffers.
  2. Batching: Group 1,000 small writes into 1 large batch to save I/O cycles.
  3. Async Processing: If the user doesn't need the result immediately, move it to a Message Queue (Kafka/SQS).

Technical Trade-offs: Messaging Systems

Pattern Ordering Durability Throughput Complexity
Log-based (Kafka) Strict (per partition) High Very High High
Memory-based (Redis Pub/Sub) None Low High Very Low
Push-based (RabbitMQ) Fair Medium Medium Medium

Key Takeaways

  • Signature: Always use asymmetric signing (RS256 or EdDSA). The Auth Server keeps the Private Key (to sign); Microservices keep the Public Key (to verify).
  • Short-lived tokens: Tokens should expire in 15-60 minutes to limit the blast radius if stolen.
  • Refresh Tokens: Use a longer-lived refresh token stored in an HTTP-only cookie to issue new access tokens.

Mental Model

Connecting isolated components into a resilient, scalable, and observable distributed web.

Verbal Interview Script

Interviewer: "How would you ensure high availability and fault tolerance for this specific architecture?"

Candidate: "To achieve 'Five Nines' (99.999%) availability, we must eliminate all Single Points of Failure (SPOF). I would deploy the API Gateway and stateless microservices across multiple Availability Zones (AZs) behind an active-active load balancer. For the data layer, I would use asynchronous replication to a read-replica in a different region for disaster recovery. Furthermore, it's not enough to just deploy redundantly; we must protect the system from cascading failures. I would implement strict timeouts, retry mechanisms with exponential backoff and jitter, and Circuit Breakers (using a library like Resilience4j) on all synchronous network calls between microservices."

Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.