System Design Module 10: API Design & Rate Limiting

Introduction to API Design and Gateways

In modern distributed architectures, APIs serve as the core service contract between internal microservices and external clients. Designing a clean, scalable, and versioned API contract is a critical software engineering responsibility. A poorly structured API leads to tight coupling between services, operational complexity during upgrades, and security vulnerabilities.

Furthermore, public-facing services are susceptible to denial-of-service (DoS) attacks, brute-force requests, scraping, and downstream resource exhaustion. To protect our backend infrastructure, we must integrate a robust Rate Limiting layer. This layer acts as a gateway defense, enforcing quotas and throttling traffic before it can saturate our databases, thread pools, and application servers.

Requirements and System Goals

Functional Requirements

Flexible API Protocols: Support external RESTful JSON interfaces for web and mobile clients, and high-performance gRPC protobuf interfaces for internal service-to-service communication.
Deterministic Route Versioning: Support multiple versioning strategies (path-based and header-based) to allow parallel API evolution without breaking legacy client integrations.
Idempotency Assurance: Enforce idempotency on non-safe HTTP mutations (POST, PUT) using client-supplied idempotency keys to ensure retry safety.
Granular Rate-Limit Rule Configuration: Support rate limiting based on multiple identification keys (Client IP, User ID, API Token) and allow custom limits per route (e.g., login route has lower limits than search route).
Configurable Throttle Actions: Support rate limit actions such as dropping requests (HTTP 429 Too Many Requests), queuing requests for delayed execution, or routing to low-priority queues.

Non-Functional Requirements

Sub-Millisecond Throttle Overhead: The rate-limiting lookup and evaluation must add less than 1.5 milliseconds to the request lifecycle.
High Availability and Fault Tolerance: The rate limiter must be highly available. If the centralized rate-limiting store (such as Redis) experiences an outage, the API Gateway must fail-open gracefully rather than denying service to legitimate users.
Accuracy and Scale: The rate limiter must maintain accurate request counters across a distributed gateway cluster without suffering from consistency anomalies or split-brain conditions.
Defense Against Thundering Herds: Prevent rate limit quota resets from triggering synchronized client retry storms.

API Interfaces and Service Contracts

We define the interface contracts for rate limit validation and checking.

1. REST Rate Limit Status Contract

Clients can query their current rate-limit consumption status via a lightweight endpoint:

Endpoint: GET /api/v1/users/me/rate-limit-status
Response Payload (JSON):

{
  "user_id": "usr_902810",
  "client_ip": "198.51.100.42",
  "limits": [
    {
      "route": "/api/v1/checkout",
      "limit_quota": 60,
      "remaining_quota": 42,
      "window_seconds": 60,
      "reset_epoch_seconds": 1774834860
    },
    {
      "route": "/api/v1/search",
      "limit_quota": 1000,
      "remaining_quota": 980,
      "window_seconds": 3600,
      "reset_epoch_seconds": 1774838400
    }
  ]
}

2. gRPC Rate Limit Check Service (Internal Gateway Check)

API Gateways query this service synchronously for every incoming request:

syntax = "proto3";

package database.limiter.v1;

service RateLimiterService {
  // Check if a request satisfies configured rate-limit quotas
  rpc CheckRateLimit(RateLimitRequest) returns (RateLimitResponse);
}

message RateLimitRequest {
  string client_key = 1;      // Can be User ID, API Key, or Client IP
  string api_route = 2;        // Target endpoint (e.g., "/checkout")
  int32 weight = 3;           // Cost of the operation (defaults to 1)
  int64 request_timestamp = 4; // Epoch millisecond of arrival
}

message RateLimitResponse {
  bool allowed = 1;           // True if request is under quota
  int32 limit_quota = 2;      // Total allowed requests in current window
  int32 remaining_quota = 3;  // Remaining quota available
  int64 reset_time_ms = 4;    // Epoch ms when quota resets
  string error_message = 5;
}

High-Level Design and Visualizations

A distributed rate limiter operates at the edge of our infrastructure, integrated into the API Gateway layer.

Distributed Rate Limiting System Architecture

graph TD
    Client[Client Applications] -->|1. Request with API Key / Token| LB[Edge Load Balancer]
    LB -->|2. Route to Gateway| GW[API Gateway Cluster Node]
    
    subgraph Rate Limiting Coordinator
        GW -->|3. Check Rate Limit RPC| RLS[Distributed Rate Limiter Service]
        RLS -->|4. Query counter / Apply sliding window| Redis[(Shared Redis Cache Cluster)]
    end

    subgraph Service Backend Mesh
        GW -->|5a. Allowed: Forward Request| AuthSvc[Auth Service]
        GW -->|5a. Allowed: Forward Request| OrderSvc[Order Service]
        GW -->|5b. Blocked: Return HTTP 429| Client
    end

    style Client fill:#f8f9fa,stroke:#343a40
    style Redis fill:#f8d7da,stroke:#dc3545
    style GW fill:#fff3cd,stroke:#ffc107

Rate Limiting Algorithms: Token Bucket vs. Leaky Bucket

graph TD
    subgraph Token Bucket (Burst Friendly)
        Inflow[Token Refiller: Refills at Rate R] -->|Add tokens| Bucket[Bucket: Capacity C]
        Request[Incoming Request] -->|Acquire Token| Bucket
        Bucket -->|Success| Forward[Forward Request]
        Bucket -->|No Tokens| Drop[Drop Request / HTTP 429]
    end

    subgraph Leaky Bucket (Constant Rate Smoothing)
        Request2[Incoming Request] -->|Add to queue| Queue[FIFO Queue: Size N]
        Queue -->|Queue full| Drop2[Drop Request / HTTP 429]
        Queue -->|Drip at constant rate R| Process[Process Request]
    end

    style Token Bucket text-align:left
    style Leaky Bucket text-align:left

Low-Level Design and Schema Strategies

To record sliding-window request histories without race conditions, we utilize Redis sorted sets (ZSET) and coordinate updates using atomic Lua scripts.

Sliding Window Log Algorithm (Redis Sorted Set)

The sliding window log tracks the exact arrival timestamp of every request made by a client within the window.

Key Format: ratelimit:{client_key}:{route}
ZSET Member: Timestamp of the request (epoch milliseconds)
ZSET Score: Timestamp of the request (epoch milliseconds)

When a request arrives at time T:

Remove all elements in the ZSET with a score less than T - window_size (clearing old request logs).
Query the cardinality of the ZSET using ZCARD.
If the count is less than the limit, append the new request timestamp to the ZSET using ZADD and allow the request.
If the count is equal to or greater than the limit, deny the request.

Redis Lua Script (Atomic Execution)

To prevent race conditions where two gateway nodes read the same ZSET size before writing, we package these steps into a single atomic Lua script:

local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
local clear_before = now - window

-- Remove expired logs
redis.call('ZREMRANGEBYSCORE', key, '-inf', clear_before)

-- Count remaining requests in window
local current_requests = redis.call('ZCARD', key)

if current_requests < limit then
    -- Add current request
    redis.call('ZADD', key, now, now)
    -- Set TTL to match window to save memory
    redis.call('EXPIRE', key, math.ceil(window / 1000))
    return {1, limit - current_requests - 1} -- Allowed, remaining quota
else
    return {0, 0} -- Blocked, 0 remaining quota
end

Relational Schema: Rate Limit Policies (PostgreSQL)

Gateway configurations and API key quota policies are persisted in a relational database for administrative changes:

CREATE TABLE client_rate_limit_policies (
    policy_id VARCHAR(64) PRIMARY KEY,
    client_tier VARCHAR(32) NOT NULL CHECK (client_tier IN ('FREE', 'BASIC', 'PREMIUM', 'ENTERPRISE')),
    route_pattern VARCHAR(256) NOT NULL, -- e.g., "/api/v1/checkout"
    limit_quota INT NOT NULL,            -- e.g., 100
    window_seconds INT NOT NULL,         -- e.g., 60
    burst_capacity INT NOT NULL DEFAULT 0,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(client_tier, route_pattern)
);

Scaling and Operational Challenges: Calculations & Formulations

Distributed rate limiters are subject to memory capacity bottlenecks. Let us calculate the memory required to track sliding-window request histories in Redis.

Back-of-the-Envelope Memory Sizing for Redis Sliding Window Log

Let us define:

$N_{\text{active_users}}$: Active users using the system concurrently (e.g., 10,000,000 active users).
$R_{\text{avg}}$: Average requests per user per hour (e.g., 60 requests/hour, which is 1 request/minute).
$S_{\text{zset_overhead}}$: Average memory overhead of a sorted set node in Redis. In Redis, each member in a ZSET uses a skip list and a hash table entry, costing approximately 128 bytes of metadata.
$S_{\text{timestamp}}$: Size of the score and member string (64-bit float score + 64-bit int string value $\approx$ 16 bytes).
$S_{\text{key}}$: Size of key string ratelimit:{user_id}:{route} (approx. 50 bytes).

Step 1: Calculate Total Request Logs stored in memory

For 10,000,000 users making an average of 60 requests per hour, with a sliding window duration of 1 hour (3600 seconds):

$$\text{Total Requests In Window} = 10,000,000 \text{ users} \times 60 \text{ requests} = 600,000,000 \text{ entries}$$

Step 2: Calculate Key Storage Overhead in Redis

Each active user requires a distinct key in the Redis database:

$$\text{Key Memory} = 10,000,000 \text{ keys} \times (S_{\text{key}} + \text{Redis Hash Overhead})$$

$$\text{Redis Hash Key Overhead} \approx 64 \text{ bytes}$$

$$\text{Key Memory} = 10,000,000 \times (50 \text{ bytes} + 64 \text{ bytes}) = 1.14 \text{ GB}$$

Step 3: Calculate ZSET Element Storage Overhead

Each request entry in the sliding window log requires a score and member node:

$$\text{Element Memory} = 600,000,000 \text{ entries} \times (S_{\text{zset_overhead}} + S_{\text{timestamp}})$$

$$\text{Element Memory} = 600,000,000 \times (128 \text{ bytes} + 16 \text{ bytes}) = 86.40 \text{ GB}$$

Step 4: Total Memory Required

The total memory required in Redis is:

$$\text{Total Redis RAM} = \text{Key Memory} + \text{Element Memory} \approx 1.14 \text{ GB} + 86.40 \text{ GB} = 87.54 \text{ GB}$$

Memory Optimization Strategy

An $87.54 \text{ GB}$ footprint for rate-limiting data is expensive to scale in RAM.

The Optimization: To reduce memory, we can swap the Sliding Window Log algorithm for the Sliding Window Counter algorithm. This algorithm only stores two integer counters per user key (the request count of the previous window and the current window) instead of logging individual timestamps:

$$\text{Estimated Count} = \text{Count}{\text{prev}} \times \left(1 - \frac{\text{Time}{\text{current}}}{\text{Window Size}}\right) + \text{Count}_{\text{current}}$$

This optimization reduces element memory footprint from 600,000,000 timestamp logs to simple key-value integer pairs:

$$\text{Total Memory (Optimized)} \approx 10,000,000 \times (50 \text{ bytes} + 2 \text{ integers} \times 8 \text{ bytes}) \approx 0.66 \text{ GB}$$

Using Sliding Window Counters instead of raw log lists drops memory consumption by over 99 percent, saving cost while preserving approximate rate-limit accuracy.

Trade-offs and Architectural Alternatives

Choosing the correct rate limiting algorithm requires balancing memory consumption, execution latency, and burst tolerance.

Rate Limiting Algorithm Comparison

Dimension / Choice	Token Bucket	Leaky Bucket	Fixed Window	Sliding Window Log	Sliding Window Counter
Memory Consumption	Low (1 key per client)	Low (1 queue per client)	Very Low (1 integer per client)	High (Log of timestamps)	Medium (2 integers per client)
Burst Tolerance	High (Supports bursts up to bucket capacity)	Low (Smooths traffic to a constant rate)	High (But allows double-quota bursts at window edges)	None	Medium
Write Lock Contention	High	Medium	Low	High	Medium
Algorithm Accuracy	High	High	Low (Suffers from edge bursts)	Absolute (100% accurate)	Good approximation
Implementation Complexity	Low	Medium	Very Low	High	Medium

Core Trade-offs

Token Bucket vs. Leaky Bucket:
- Token Bucket: Perfect for user-facing APIs where bursty behavior is expected (e.g., loading a dashboard page that triggers multiple asset queries).
- Leaky Bucket: Best for background processing engines or outbound payment queues where we must protect a downstream system from sudden request spikes by enforcing a smooth, constant rate of consumption.

Failure Modes and Fault Tolerance Strategies

Operating a distributed rate limiting cluster at scale exposes the system to partition risks. We design for resilience under failure.

1. Centralized Rate Limit Store Outage (Fail-Open Policy)

If the shared Redis cache cluster crashes or suffers a network partition, the API Gateway nodes cannot verify rate limits. If the gateway fails-closed, it blocks all legitimate traffic, creating an outage.

Mitigation: Enforce a Fail-Open Policy. If the gRPC call to the RateLimiterService returns a timeout or 5xx error, the API Gateway logs the error to an observability channel, increments a fallback counter, and allows the request through. Protecting downstream resources is deferred to circuit breakers and database queue thread-pools.

2. Multi-Region Split-Brain and Local Token Pooling

In globally distributed active-active deployments (e.g., API Gateways in US-East, EU-West, and AP-East), sync latency to a centralized Redis cluster increases network check overhead.

Mitigation: Implement Token Batching/Pooling. Each local gateway instance requests a batch of tokens (e.g., 100 tokens at a time) from the centralized database asynchronously and keeps them in a local memory bucket. Client requests deduct from the local bucket. When the local bucket falls below a threshold, the gateway requests another batch. This limits cross-region RPC check latencies to asynchronous background operations.

3. Redis Lua Script Timeouts and Throttling

Under extreme load, the CPU utilization on Redis nodes can hit 100 percent, causing Lua scripts to block.

Mitigation: Implement local Rate Limit Bypass rules for static assets and deploy edge Web Application Firewalls (WAFs) to handle coarse-grained IP blocking before requests hit the API Gateway cluster.

Verbal Script

Interviewer: "How would you design a distributed rate limiter that operates across a cluster of API Gateway nodes, and what are the trade-offs of the algorithms you could use?"

Candidate: "To design a distributed rate limiter, I would integrate it directly into the API Gateway cluster, utilizing a shared high-performance cache like Redis to store rate-limit counters. This ensures that a client's request quota is tracked accurately regardless of which gateway node receives their traffic.

For the lookup mechanism, I would use the Sliding Window Counter algorithm. When a request arrives, the gateway executes an atomic Redis Lua script that reads the request count of both the current and previous window intervals. It then calculates a weighted average of the client's consumption.

If the client is under their quota, the script increments the counter and returns success. Using a Lua script is critical because it runs atomically within Redis, preventing race conditions such as double-spending of tokens during concurrent requests.

In terms of algorithmic trade-offs, we must choose between the memory usage and the accuracy of the system.

The Sliding Window Log algorithm is highly accurate because it records the exact timestamp of every request. However, it requires significant memory to store thousands of timestamp elements per user key in a sorted set. Under high traffic, this can exhaust our Redis memory capacity.

The Token Bucket algorithm is excellent for user-facing applications because it handles bursty traffic while using minimal memory, needing only two fields to track current token count and last update timestamp.

For background tasks, I would choose a Leaky Bucket to smooth out traffic peaks and protect downstream databases from write saturation."

Interviewer: "What happens if the centralized Redis rate-limit store goes offline? How do you prevent the API Gateways from crashing?"

Candidate: "If the centralized rate-limit store crashes, we must follow the Fail-Open resilience pattern.

In my design, the API Gateway client calls the Rate Limiter Service with a strict execution timeout of less than 2 milliseconds.

If the call times out or returns a system error, the gateway node catches the exception, registers a metric alert to notify engineers of the rate limit store degradation, and allows the user request to proceed through to the downstream microservices.

It is better to risk overloading downstream services than to guarantee a complete site outage by blocking all legitimate traffic.

To protect the downstream databases while running in fail-open mode, the backend services rely on their own internal circuit breakers, bulkhead thread pools, and local request-queue backpressure mechanisms."