Case Study: Designing a Global Distributed Rate Limiter

Rate limiting is a critical operational safeguard for protecting distributed systems from abuse, intentional Denial of Service (DoS) attacks, brute-force exploits, and the "noisy neighbor" problem in multi-tenant environments.

This case study designs a globally scalable, high-throughput, low-latency distributed rate limiter capable of evaluating 10 billion requests daily under a strict P99 latency SLA of < 2ms for rate-limiting decisions.

1. Requirements & Core Constraints

Functional Requirements

Multi-level Throttling: The rate limiter must support throttling based on multiple identities:
- IP Address: Prevent anonymous brute-forcing and scraping (e.g., max 60 requests per minute).
- User ID: Enforce user tier tiers (e.g., Free users: 100 req/hr; Pro users: 50,000 req/hr).
- API Key/Token: Enforce server-to-server enterprise billing agreements.
Dynamic Rule Updates: Administrators must be able to update rate-limit rules dynamically in real time without restarting gateway instances.
Actionable Client Signals: Blocked requests must return an explicit HTTP 429 Too Many Requests code alongside headers detailing the limit status and cooldown duration.

Non-Functional Requirements (SLAs)

Low Latency Overhead: The rate-limiting check is on the critical request path. The decision must add < 2ms (P99) to total request latency.
High Availability: The system must achieve 99.999% availability ("Five Nines"). A failure in the rate limiter layer must not take down the entire application stack.
Accuracy & Atomicity: The system must handle high concurrent traffic from a single user across multiple client nodes without suffering from race conditions (over-limit leakage).

Back-of-the-Envelope Estimation (100M DAU)

Let's size our cluster and memory footprint:

Total Daily Requests: $10,000,000,000$ requests/day
Average QPS: $$\text{Average QPS} = \frac{10^{10} \text{ requests}}{86400 \text{ seconds}} \approx 115,740 \text{ QPS}$$
Peak QPS (3x Factor): $$\text{Peak QPS} = 115,740 \times 3 \approx 347,220 \text{ QPS}$$
Memory Estimation (Sliding Window Counter):
- Let's assume each user key-value pair in Redis requires approximately 64 bytes of overhead (including key naming, hash values, TTL, and internal pointers).
- Actively tracked keys (daily active users with active sliding windows): $10,000,000$ active users.
- Memory Space: $$\text{Memory Required} = 10,000,000 \times 64 \text{ bytes} = 640,000,000 \text{ bytes} \approx 640 \text{ MB}$$
- Even with a 10x headroom to store historical statistics and multi-rule windows, a tiny 6.4 GB Redis instance can easily fit the entire working set in RAM.
- Network Bandwidth:
  - With a peak of 350,000 QPS hitting our rate-limiting cache, if each Redis network payload is roughly 250 bytes: $$\text{Bandwidth} = 350,000 \text{ req/sec} \times 250 \text{ bytes} \approx 87.5 \text{ MB/sec} = 700 \text{ Mbps}$$
  - This requires a structured Redis cluster backed by dedicated high-performance networks.

2. Deep Dive: Rate Limiting Algorithms

Choosing the right algorithm depends on our memory constraints, tolerance for traffic bursts, and precision requirements.

+------------------------+------------------------------------------+-----------------------+-----------------------------+
| Algorithm              | Pros                                     | Cons                  | Time & Memory Complexity    |
+------------------------+------------------------------------------+-----------------------+-----------------------------+
| Token Bucket           | Allows bursts; highly memory efficient   | Hard to debug spikes  | Time: O(1), Memory: O(1)    |
| Leaky Bucket           | Smooths output traffic; strictly constant| Delays user requests  | Time: O(1), Memory: O(1)    |
| Fixed Window Counter   | Extremely simple; very low memory        | Double limit at edges | Time: O(1), Memory: O(1)    |
| Sliding Window Log     | 100% precise; no edge burst anomalies     | High memory footprint | Time: O(log N), Memory: O(N)|
| Sliding Window Counter | High accuracy; low, constant memory      | Small approximation   | Time: O(1), Memory: O(1)    |
+------------------------+------------------------------------------+-----------------------+-----------------------------+

1. Token Bucket

Mechanism: A bucket of capacity $C$ is refilled with tokens at a constant rate $R$ per second. Each incoming request consumes one token. If no tokens are present, the request is rejected.
State: Tracks two values: last_refill_timestamp and current_token_count. Instead of a background cron job refilling the bucket (which wastes CPU cycles), tokens are recalculated lazily on each read: $$\text{Tokens to Add} = (\text{current_time} - \text{last_refill_time}) \times R$$

2. Leaky Bucket

Mechanism: Requests enter a queue (the bucket) of capacity $C$. The system processes requests at a constant, fixed rate (leaking). If the queue is full, incoming requests are dropped immediately.
State: Implemented using a FIFO buffer. Excellent for smoothing burst traffic to downstream data processing pipelines.

3. Fixed Window Counter

Mechanism: Divides time into windows (e.g., 1 minute). Each window tracks a counter. If the counter exceeds the threshold, requests are blocked until the next window boundary.
The Edge Burst Defect: If a user has a limit of 100 req/min and sends 100 requests at 11:59:59 and another 100 requests at 12:00:01, they successfully executed 200 requests within a 2-second span, bypassing the limit.

4. Sliding Window Log

Mechanism: To prevent edge bursts, the rate limiter stores a sorted set of timestamps for each request in memory. When a request arrives, it removes all timestamps older than $(\text{current_time} - 1\text{m})$, counts the remaining timestamps, and accepts/rejects.
Deficiency: High memory usage. Storing timestamps for 1,000 requests per user occupies significant space in memory.

5. Sliding Window Counter

Mechanism: Blends the simplicity of Fixed Window with the precision of Sliding Log. It maintains counters for the current window and the previous window. When a request arrives, the counter calculates the dynamic overlap: $$\text{Estimated Requests} = \text{Count}{\text{prev}} \times \left(1 - \frac{\text{Time}{\text{elapsed_in_current_window}}}{\text{Window_Duration}}\right) + \text{Count}_{\text{curr}}$$
This achieves near-perfect accuracy while keeping memory to a tiny, constant $O(1)$ footprint per user.

3. Core API Contract

Clients negotiate limits using standard headers. We expose these clearly inside our API Gateway:

Sample Request

GET /api/v1/resources/analytics HTTP/1.1
Host: api.codesprintpro.com
Authorization: Bearer c7289b52a510
X-Forwarded-For: 203.0.113.195

Response (Allowed Request)

HTTP/1.1 200 OK
Content-Type: application/json
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 994
X-RateLimit-Reset: 1774312860

Response (Throttled Request)

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1774312860
Retry-After: 45

{
  "status": "error",
  "code": 429,
  "message": "API rate limit exceeded. Please retry after 45 seconds.",
  "retry_after_seconds": 45
}

4. High-Level Design (HLD)

To process hundreds of thousands of requests per second, the rate limiter cannot sit on a single server or require heavy database lock routines. We separate the concerns using a high-performance Gateway Middleware Architecture:

graph TD
    Client[Client Browser / App] -->|HTTPS Requests| LB[Global Anycast Load Balancer]
    LB -->|Route| AG1[API Gateway Node 1]
    LB -->|Route| AG2[API Gateway Node 2]
    
    subgraph Rate Limiting Layer
        AG1 -->|1. Check Rate Limit| R_Clust{Redis Cluster <br/> Primary/Replica}
        AG2 -->|1. Check Rate Limit| R_Clust
        
        AG1 -.->|2. Fallback (If Redis Down)| LocalMem1[Local In-Memory Cache]
        AG2 -.->|2. Fallback (If Redis Down)| LocalMem2[Local In-Memory Cache]
    end
    
    R_Clust -->|3. Limit Allowed| App[Application Microservices]
    LocalMem1 -->|3. Fail-Open / Limit Allowed| App
    
    subgraph Rule Configuration Synchronization
        Admin[Admin Dashboard] -->|Update Rules| DB[(PostgreSQL Master)]
        DB -->|Broadcast Rules change| SyncService[Config Sync Daemon]
        SyncService -->|Publish Update| RedisPubSub((Redis Pub/Sub))
        RedisPubSub -.->|Push to Local Memory| AG1
        RedisPubSub -.->|Push to Local Memory| AG2
    end

Traffic Flow Description:

Client Request: A client sends an API request, reaching the Global Load Balancer, which routes it to the closest API Gateway instance (e.g., Envoy or a custom Go gateway proxy).
Active Check: The Gateway intercepts the request, builds the rate-limiting key (e.g., rate:uid:9902:1m), and sends a batch request containing a Redis Lua script to a local Redis Cluster node.
Decisive Action:
- Allowed: If Redis approves, the gateway forwards the request to the application microservices.
- Throttled: If denied, the request returns immediately with a 429 Too Many Requests code, never reaching downstream applications.
Resilience Fallback: If the Redis Cluster becomes unreachable or drops connection pools, the rate limiter fails open (to preserve user experience) and degrades gracefully to evaluating local limits inside its own local memory cache.

5. Low-Level Design (LLD) & Data Models

1. Relational Schema: Rate-Limit Configuration & Rules

To manage dynamically configurable tenant limits, we maintain a PostgreSQL schema. Gateway nodes pull and cache these rules in-memory.

-- Represents specific system tenants or API plan tiers
CREATE TABLE tenant_plans (
    plan_id VARCHAR(50) PRIMARY KEY, -- e.g., 'free_tier', 'pro_tier', 'enterprise'
    max_requests_per_minute INT NOT NULL,
    max_requests_per_hour INT NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Specific override rule definitions per client or path
CREATE TABLE rate_limit_rules (
    rule_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    identifier_type VARCHAR(20) NOT NULL, -- 'IP', 'USER_ID', 'API_KEY'
    identifier_value VARCHAR(255) NOT NULL, -- e.g., 'plan_free' or custom API key
    path_pattern VARCHAR(255) DEFAULT '*', -- e.g., '/api/v1/payment/*'
    max_burst_capacity INT NOT NULL, -- Token bucket capacity C
    refill_rate_per_sec DOUBLE PRECISION NOT NULL, -- Token bucket refill rate R
    is_active BOOLEAN DEFAULT TRUE,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE UNIQUE INDEX idx_rules_lookup ON rate_limit_rules(identifier_type, identifier_value, path_pattern);

2. Redis Key Structure: Sliding Window Counter

To keep memory low, we utilize a hash structure representing active window counts:

Key: rate:{identifier}:{window_timestamp} (e.g., "rate:usr_9901:1774312800")
Type: String (Integer Counter)
TTL: 120 seconds (Automatic eviction of expired windows)

6. Compilable Redis Lua Script

Distributed rate limiting suffers from Race Conditions when multiple concurrent requests read a counter, increment it, and write it back (the classic Read-Modify-Write anomaly).

We solve this cleanly using a Redis Lua Script. Redis executes Lua scripts atomically in a single thread, guaranteeing thread safety without requiring distributed locks.

Below is the production-grade Lua script for the Sliding Window Counter algorithm:

-- KEYS[1]: Current window key (e.g., "rate:usr_9901:1774312800")
-- KEYS[2]: Previous window key (e.g., "rate:usr_9901:1774312740")
-- ARGV[1]: Current window timestamp (seconds)
-- ARGV[2]: Window unit duration (e.g., 60 seconds)
-- ARGV[3]: Max limit threshold (e.g., 100)
-- ARGV[4]: Current request weight (typically 1)

local current_counter = redis.call('GET', KEYS[1])
local previous_counter = redis.call('GET', KEYS[2])

local current_count = current_counter and tonumber(current_counter) or 0
local previous_count = previous_counter and tonumber(previous_counter) or 0

-- Compute current position within the active window (ratio between 0.0 and 1.0)
local current_time = tonumber(ARGV[1])
local window_duration = tonumber(ARGV[2])
local current_window_start = math.floor(current_time / window_duration) * window_duration
local elapsed_time_in_window = current_time - current_window_start
local weight = (window_duration - elapsed_time_in_window) / window_duration

-- Estimate active requests
local estimated_requests = math.floor(previous_count * weight + current_count)
local max_limit = tonumber(ARGV[3])
local request_weight = tonumber(ARGV[4])

if estimated_requests + request_weight > max_limit then
    -- Throttled! Return 0 and the current estimated load
    return {0, estimated_requests}
else
    -- Allowed! Increment the current window counter and set/update TTL
    local new_count = redis.call('INCRBY', KEYS[1], request_weight)
    if new_count == request_weight then
        -- Set TTL slightly longer than window duration to clean up memory
        redis.call('EXPIRE', KEYS[1], window_duration * 2)
    end
    return {1, estimated_requests + request_weight}
end

7. Staff Engineer Perspective (Operational Deep Dive)

1. Mitigating Replication Lag in Global Redis Clusters

In multi-region setups, synchronizing rate limits globally across WANs adds hundreds of milliseconds of latency, violating our < 2ms SLA.

The Strategy: Rate limit locally in each region using local Redis clusters, and sync async stats to a global aggregator.
Alternatively, assign fractional quotas to each region. For example, if a user has a global limit of 100 req/sec, allocate 50 req/sec to US-East and 50 req/sec to EU-West based on historical traffic patterns.

2. Clock Drift & NTP Synchronization Anomalies

Sliding window algorithms depend strictly on Unix timestamps. If clock drift occurs across API gateway servers or Redis hosts, calculations fail, causing unexpected throttling or massive leaks.

Mitigation: Standardize on Redis server time instead of local client time. Use redis.call('TIME') inside Lua scripts to retrieve a unified time reference directly from the database server, bypassing client NTP synchronization errors.

3. Resilience: Fail-Open vs. Fail-Closed & Technical Trade-offs

When Redis clusters experience partitions or crash, how should the rate limiter behave?

The Trade-off:
- Fail-Closed: Blocks all traffic. Guarantees safety for downstream databases, but crashes customer experience completely.
- Fail-Open (Best Practice): Allows requests to proceed. If Redis fails, log the alert, fall back to assessing local gateway memory limits, and let the user access the system. It is always better to allow a noisy neighbor to overload a service briefly than to take down 100% of functional traffic for all customers.

8. Candidate Verbal Mock Interview Script

Interviewer: "Your design depends heavily on Redis. What happens if the connection between your API Gateway and your Redis cluster fails under peak load? How do you prevent your backend databases from collapsing?"

Candidate: "This is a classic operational failure scenario. My rate limiter architecture implements a strict Fail-Open with Local Fallback design pattern.

If our API Gateway detects a Redis connection timeout (configured with a tight 50ms fallback threshold), we do not block the user. Instead, we drop into a degraded local-memory rate-limiting mode. Every gateway node has an in-memory token bucket cache. We evaluate the client's request against this local cache.

While this local fallback means a client could technically bypass their global limit briefly (since traffic isn't shared across all gateway nodes), it preserves the user experience during a cache partition.

To ensure the downstream databases do not collapse under this load, we pair our rate limiter with two other structural safeguards:

Downstream Circuit Breakers: API gateways wrap backend microservice routes in circuit breakers. If backend database response times spike past 500ms, the gateway immediately trips the breaker, returning an HTTP 503 Service Unavailable directly to the client without querying the DB.
Internal Priority Queues: If traffic surges past our scaling ceiling during a rate-limiter failure, we prioritize critical transaction endpoints (like checkout and payment) over non-critical paths (like product searches and recommendations) using gateway route shedding."

System Design: Designing a Global Distributed Rate Limiter

From vague architecture answers to staff-level trade-off thinking.