Lesson 33 of 105 10 minFlagship

API Rate Limiting at Scale: Redis-Based Strategies

Master the architecture of a high-throughput API rate limiter. Learn about Token Bucket, Leaky Bucket, Sliding Window Log algorithms, and atomic Redis Lua scripting.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • **Algorithmic Sizing:** Choosing between Token Bucket, Leaky Bucket, and Sliding Window.
  • **Atomic Operations:** Running rate limiting logic atomically in Redis using Lua scripts.
  • **High Performance:** Limiting check latency to less than 2ms overhead.
Recommended Prerequisites
System Design Interview Framework

Premium outcome

From vague architecture answers to staff-level trade-off thinking.

Backend engineers preparing for senior, staff, and architecture rounds.

What you unlock

  • A reusable system design answer framework for ambiguous prompts
  • Clear language for consistency, scaling, and reliability trade-offs
  • Case-study depth across feeds, payments, storage, and messaging systems

Mental Model

Enforcing access limits at scale requires shielding database partitions from traffic surges without introducing a processing bottleneck. A high-performance API rate limiter isolates traffic checking to a sharded Redis cluster, running atomic calculations via Lua scripts to avoid race conditions, and gracefully degrades to local event-loop limiters if the cache layer becomes unreachable.


Requirements and System Goals

When building an enterprise-grade API Rate Limiting service (similar to Cloudflare edge throttlers or GitHub's API rate limits), we must align our system parameters with strict latency budgets.

1. Functional Requirements

  • Multi-Tenant Throttling: Support rate limiting based on client IP address, authenticated user JWT tokens, or API keys.
  • Standard Throttling Telemetry: Automatically return standardized HTTP headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) to inform clients of their remaining quotas.
  • Granular Time Boundaries: Enforce rate limits over varying sliding time spans (e.g., $100$ requests/minute, $50,000$ requests/day).

2. Non-Functional Requirements & Performance Budgets

  • Ultra-Low Latency Checking: The rate-limiting check must introduce less than 2ms of overhead to the request lifecycle.
  • High Availability Target: 99.999% availability for edge limiters. If the cache layer suffers transient network drops, the rate limiter must fail-open gracefully rather than blocking legitimate customer requests.
  • Atomic Precision: Prevent race conditions under high concurrency where a client bypasses quota limits by issuing thousands of parallel threads.

API Interfaces and Service Contracts

API Gateways intercept incoming client requests and evaluate throttling boundaries before routing calls.

1. Intercepted Client Gateway Request

GET /api/v1/resources

Request Headers:

  • Authorization: Bearer jwt_usr_99ab-88cd
  • X-Forwarded-For: 102.15.22.8

2. Throttled Response (HTTP 429 Too Many Requests)

If the client exceeds their configured quota, the gateway rejects the request immediately.

Response Headers:

  • Retry-After: 36 (Seconds remaining until quota reset)
  • X-RateLimit-Limit: 100 (Max allowed requests per window)
  • X-RateLimit-Remaining: 0 (Remaining quota)
  • X-RateLimit-Reset: 1780000036 (Unix epoch timestamp of quota reset)

Response Payload (429 Too Many Requests):

{
  "errorCode": "API_RATE_LIMIT_EXCEEDED",
  "message": "You have exceeded your request quota of 100 requests per minute. Please try again after 36 seconds.",
  "retryAfterSeconds": 36,
  "resetTimestamp": 1780000036000
}

High-Level Design and Visualizations

Each rate limiting algorithm manages request traffic flow and accommodates traffic bursts differently.

1. Algorithmic Traffic Routing

  • Token Bucket: Tokens are added to a bucket at a fixed rate. Requests consume a token. Allows bursts.
  • Leaky Bucket: Requests fill a buffer queue and leak out at a constant rate. Smooths out traffic spikes.
  • Sliding Window Log: Logs timestamps of all requests, checking the exact window range dynamically. Absolute precision.
graph TD
    subgraph Token Bucket [Token Bucket: Bursty Traffic]
        Refill[Refill: +10 tokens/sec] -->|Cap: 100 tokens| Bucket((Token Pool))
        RequestA[Request] -->|Consume 1 Token| Bucket
        Bucket -->|Success| AllowA[Allow Request]
        Bucket -->|Empty| BlockA[HTTP 429]
    end

    subgraph Leaky Bucket [Leaky Bucket: Smooth Traffic]
        RequestB[Request Burst] -->|Fill Queue| Queue{FIFO Buffer Queue}
        Queue -->|Overflow| BlockB[HTTP 429]
        Queue -->|Leak at constant rate| Leak[Process: 10 requests/sec]
    end

    subgraph Sliding Window Log [Sliding Window Log: Precise Quotas]
        RequestC[Request at timestamp T] -->|1. Add T to Sorted Set| Redis[(Redis ZSET)]
        Redis -->|2. Remove items older than T - 60s| Redis
        Redis -->|3. Count remaining items| Check{Is Count <= Quota?}
        Check -->|Yes| AllowC[Allow Request]
        Check -->|No| BlockC[HTTP 429]
    end

Low-Level Design and Schema Strategies

To support atomic multi-threaded checking, the system delegates rate-limiting execution to a highly optimized Redis Lua script.

1. Redis Sliding Window Rate Limiter Lua Script

The sliding window log is implemented using Redis Sorted Sets (ZSET). We run the calculation atomically in Lua to prevent race conditions during concurrent check loops.

-- KEYS[1]: Redis ZSET Key (e.g., 'rate_limit:usr_99ab')
-- ARGV[1]: Current Unix Timestamp (milliseconds)
-- ARGV[2]: Window Size (milliseconds, e.g., 60000 for 1 minute)
-- ARGV[3]: Maximum Quota Limit (e.g., 100)
-- ARGV[4]: Unique Request UUID

local rate_limit_key = KEYS[1]
local now = tonumber(ARGV[1])
local window_size = tonumber(ARGV[2])
local max_limit = tonumber(ARGV[3])
local request_id = ARGV[4]

local clear_before = now - window_size

-- 1. Remove expired timestamps outside the current sliding window
redis.call('ZREMRANGEBYSCORE', rate_limit_key, 0, clear_before)

-- 2. Count remaining request timestamps inside the active window
local current_requests_count = redis.call('ZCARD', rate_limit_key)

-- 3. Check if count exceeds limit
if current_requests_count < max_limit then
    -- Add the current request timestamp to the ZSET
    redis.call('ZADD', rate_limit_key, now, request_id)
    -- Set a TTL to auto-expire the ZSET after the window closes to prevent memory leaks
    redis.call('EXPIRE', rate_limit_key, math.ceil(window_size / 1000))
    return 1 -- Request allowed
else
    return 0 -- Request throttled
end

2. Audit Telemetry Schema

To generate reports on tenant utilization and audit security events, we log throttled events to a relational PostgreSQL database.

-- PostgreSQL Rate Limit Throttling Audit logs
CREATE TABLE ratelimit_audit_logs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id VARCHAR(50) NOT NULL, -- JWT user ID or IP Address
    endpoint_path VARCHAR(255) NOT NULL,
    request_timestamp TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
    quota_limit INT NOT NULL,
    current_count INT NOT NULL,
    client_ip VARCHAR(45) NOT NULL
);

CREATE INDEX idx_ratelimit_audit ON ratelimit_audit_logs(tenant_id, request_timestamp DESC);

Scaling and Operational Challenges

1. Sliding Window Memory Capacity Calculations

A primary operational challenge for the Sliding Window Log is its high memory footprint, as it stores a unique timestamp for every request.

  • Volume: Let's size a system processing $1,000,000$ active tenants with a standard quota of $100$ requests/minute.
  • Storage per timestamp:
    • Each element inside the Redis ZSET stores a Unix timestamp (double float: $8$ bytes) and a unique request UUID ($16$ bytes) = $24$ bytes of raw payload.
    • Accounting for Redis ZSET internal node memory overhead (pointers, hashes), each entry requires approximately $64$ bytes of RAM.
  • Memory Sizing:
    • Peak memory per active tenant under maximum quota: $100 \text{ requests} \times 64 \text{ bytes} \approx 6.4 \text{ KB}$.
    • Total RAM required for 1,000,000 active tenants: $1,000,000 \times 6.4 \text{ KB} \approx 6,400,000 \text{ KB} \approx 6.1 \text{ GB}$ of dedicated Redis RAM.
    • Staff Optimization: If memory limits are constrained, implement the Sliding Window Counter algorithm. Instead of logging timestamps, we divide the window into fixed sub-intervals, keeping two simple integer counters representing the previous and current intervals, reducing memory overhead by greater than 90% down to less than $1$ KB per tenant.

2. Redis Cluster Multi-Node Sharding & Hotkeys

When rate-limiting client IP addresses globally, a highly active API key can generate millions of requests/second, creating a massive Redis Hotkey Partition.

  • The Bottle Neck: All requests route to the single Redis node hosting that specific key, saturating its CPU core.
  • Staff Solution: Implement Local In-Memory Cache Deduplication at the API Gateway layer. Prior to executing a remote Redis Lua script, the gateway executes a local, ultra-fast in-memory check (using an LRU cache or Token Bucket instance). If local checks verify the client is safely under $50%$ of their quota, the remote Redis call is executed. If a sudden burst occurs, the gateway throttles the user locally, reducing edge Redis cluster networking calls by greater than 75%.

Architectural Trade-offs and Algorithmic Decisions

Selecting the right rate limiting algorithm dictates the performance latency, storage cost, and traffic burst handling of the gateway edge.

Algorithm P99 processing Overhead Memory Footprint Handles Bursts? Absolute Precision
Token Bucket Sub-1ms (O(1) Redis hash query) Very Low ($\approx 100$ bytes per tenant) Yes (Bursty traffic allowed) No (Allows bursts to pass limits)
Leaky Bucket Low (O(1) FIFO queue check) Low (Queue buffers items) No (Forces constant smooth flow) Yes (Smooths traffic strictly)
Sliding Window Log Medium (O(log N) ZSET operations) High ($6.4$ KB per tenant) No (Enforces rigid limit bounds) Yes (Absolute time precision)

Failure Modes and Fault Tolerance Strategies

1. Redis Cluster Outage & Fail-Open Fallbacks

If the edge Redis cluster suffers a total network partition or crashes, a naive API gateway will block all client requests (failing closed) or let all traffic pass unchecked (failing open, vulnerable to backend database exhaustion).

  • Staff Mitigation: Implement Graceful Local Degradation (Resilient Fail-Open). Configure the API Gateway with a local, in-memory rate-limiter fallback (using a lightweight library like Token Bucket in Java/Go). If the Redis cluster connection times out (budgeted at less than 50ms), the gateway logs an operational warning, switches to the local in-memory throttler, and continues routing requests, protecting both customer integration and backend system availability.

2. NTP Clock Drift Race Conditions

Sorted Set calculations (now - window_size) depend strictly on the physical time reported by servers.

  • The Failure: If API Gateway Node A has a system clock drifted 5 seconds fast, and Node B has a correct clock, clients whose requests are load-balanced to Node A will suffer immediate, false throttling alerts.
  • Mitigation: Synchronize all edge gateway nodes and Redis instances using Chrony or NTP with strict alarms if local system clock drifts exceed 5ms.

Staff Engineer Perspective


Production Readiness Checklist

Before moving a Redis-based rate limiter to production:

  • Lua scripts pre-loaded: Gateway nodes pre-load Lua scripts using SCRIPT LOAD to execute calls via efficient SHA hashes.
  • NTP Clock synchronization verified: Chrony is active across all edge servers with drift thresholds set to less than 10ms.
  • Fail-Open fallback active: In-memory rate limiting libraries degrade gracefully if Redis connection latency spikes beyond 50ms.
  • TTL Keys set: All Redis keys are configured with dynamic time-to-live settings to prevent RAM capacity exhaustion.


Verbal Script

Interviewer: "How would you design a highly available, low-overhead API rate-limiting system for a high-traffic microservices cluster?"

Candidate: "To design a highly available, low-overhead rate limiting system, I would deploy a centralized, high-performance Redis-based Rate Limiter integrated directly into our Kong API Gateway layer.

To enforce rate limiting with absolute time precision and prevent race conditions, I would implement the Sliding Window Log algorithm.

To execute the rate-limiting check without introducing a network round-trip bottleneck or event-loop blocking, the gateway delegates the rate limit checking to a Redis Lua Script. When a request arrives, the gateway calls this pre-loaded Lua script atomically.

The script cleans up expired request timestamps older than our current sliding window using ZREMRANGEBYSCORE, counts the remaining entries inside the ZSET using ZCARD, and checks if the quota is exceeded. If the user is within limits, it logs the current timestamp using ZADD and returns success. This entire atomic check executes in less than 2ms P99 latency overhead.

To scale this to millions of active tenants and handle hotkeys where highly active clients overload a single Redis node, I would implement Local In-Memory Cache Deduplication on the gateway nodes. The gateway maintains a local Token Bucket filter. If a user is safely under 50% of their quota, the remote Redis call is bypassed, conserving Redis network resources.

Finally, to guarantee fault tolerance and prevent an edge outage if the Redis cluster becomes unreachable, I would configure our rate limiter to fail-open gracefully. The gateway dynamically degrades to a local, in-memory rate-limiter fallback, logging warning metrics to Prometheus but continuing to route requests, ensuring zero disruption to our customer integrations."

Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.