An e-commerce checkout and payment system is the financial engine of any retail platform. During flash sales or product drops, this system experiences massive write-concurrency spikes. Thousands of users attempt to purchase the same limited-inventory item at the exact same second, testing the limits of database locks, network connections, and third-party APIs.
Designing a checkout system requires balancing strict data consistency with high write availability. We must guarantee that inventory is never oversold (overselling leads to customer disappointment and legal issues), while ensuring that the checkout flow is fast and responsive. Furthermore, because payment gateways are external networks with high latencies, we must isolate external API calls from local database transactions to prevent thread exhaustion.
This system design guide details the architectural blueprint for designing a high-concurrency, resilient e-commerce checkout and payment platform capable of processing 5,000 completed orders per second.
System Requirements
An enterprise checkout system must coordinate state changes across multiple service boundaries. We divide these specifications into functional capabilities, non-functional limits, and scale assumptions.
Functional Requirements
- Checkout Session Management: Create and track a checkout session containing products, quantities, tax, discounts, and shipping details.
- Idempotency Guarantee: Prevent duplicate orders or duplicate payment captures if a client retries a request or double-clicks the purchase button.
- Inventory Reservation: Temporarily hold inventory for a user during checkout (e.g., 10 minutes) and release it back to the stock pool if they fail to pay.
- Payment Processing: Integrate with external payment gateways (such as Stripe, Adyen, or PayPal) to authorize and capture funds.
- Order State Machine: Transition order states deterministically from
CREATEDtoPAID,FULFILLED, orCANCELLED.
Non-Functional Requirements
- High Concurrency Write Performance: Support thousands of checkouts per second without blocking database connections.
- High Availability: The checkout API must remain online even during backend processing backlogs, queueing transactions where possible.
- Strong Consistency for Inventory: Guarantee that inventory stock records are decremented accurately, avoiding double-selling under concurrent threads.
- Isolation of External Latency: Ensure that slow third-party payment gateways (which can take several seconds to respond) do not consume internal database connection pools.
Scale Assumptions
- Peak Order Throughput: 5,000 completed orders per second.
- Checkout Ingestion Requests: 20,000 checkout attempts/second during flash sales.
- Average Items Per Order: 3 unique products.
- Active Inventory Pool: 1,000,000 unique SKU items.
API Design and Interface Contracts
The checkout interface uses transactional REST endpoints and high-speed gRPC service definitions to communicate with the inventory and payment subsystems.
1. Ingest Checkout Intent (HTTP POST /v1/checkouts)
Invoked by the client frontend to lock in cart items and create a checkout session.
Request Headers:
Idempotency-Key: chk_idemp_99812_881a2
Authorization: Bearer jwt_customer_token_payload...
Request Payload:
{
"cartId": "cart_uuid_00182ab",
"shippingAddressId": "addr_uuid_55162",
"paymentMethodId": "pm_stripe_token_9918",
"currency": "USD"
}
Response Payload (201 Created):
{
"checkoutSessionId": "cs_uuid_3381920ac",
"status": "INVENTORY_RESERVED",
"totalAmountCents": 12900,
"currency": "USD",
"expiresAt": "2026-06-07T12:22:13Z"
}
2. Payment Gateway Webhook Contract (HTTP POST /v1/payments/webhooks)
Receives asynchronous payment status updates from the external payment gateway.
{
"eventId": "evt_stripe_99812736",
"eventType": "payment_intent.succeeded",
"created": 1770289933,
"data": {
"paymentIntentId": "pi_stripe_3381920ac",
"metadata": {
"checkoutSessionId": "cs_uuid_3381920ac"
},
"amountReceivedCents": 12900,
"currency": "usd"
}
}
3. Inventory Reservation gRPC Contract
The Checkout Service communicates with the Inventory Service using gRPC to execute high-speed stock leases.
syntax = "proto3";
package codesprintpro.checkout.inventory.v1;
service InventoryService {
rpc ReserveStock (ReserveStockRequest) returns (ReserveStockResponse);
rpc ReleaseStock (ReleaseStockRequest) returns (ReleaseStockResponse);
rpc CommitStock (CommitStockRequest) returns (CommitStockResponse);
}
message StockItem {
string sku = 1;
int32 quantity = 2;
}
message ReserveStockRequest {
string checkout_session_id = 1;
repeated StockItem items = 2;
int64 lease_duration_seconds = 3;
}
message ReserveStockResponse {
enum Status {
SUCCESS = 0;
OUT_OF_STOCK = 1;
INSUFFICIENT_STOCK = 2;
}
Status status = 1;
string reservation_token = 2;
repeated string unavailable_skus = 3;
}
message ReleaseStockRequest {
string checkout_session_id = 1;
string reservation_token = 2;
}
message ReleaseStockResponse {
bool success = 1;
}
message CommitStockRequest {
string checkout_session_id = 1;
string reservation_token = 2;
}
message CommitStockResponse {
bool success = 1;
}
High-Level Architecture
The system decouples synchronous API actions from background payment validations and inventory commitments using a Transactional Outbox pattern.
End-to-End Checkout Processing Pipeline
This pipeline processes client submissions, locks inventory in Redis, and asynchronously updates databases and payment gateways.
sequenceDiagram
autonumber
participant Client as Web/Mobile Client
participant Gate as API Gateway
participant Check as Checkout Service
participant Redis as Redis Cache Cluster
participant PayGate as Payment Gateway (Stripe)
participant Outbox as Outbox Processor
participant DB as Postgres Orders DB
Client->>Gate: POST /v1/checkouts (Idempotency Key)
Gate->>Check: Forward request
Check->>Redis: Check Idempotency Key (Set NX PX)
Redis-->>Check: OK (First request)
note over Check, Redis: Step 1: Reserve Stock with 10-minute Lease
Check->>Redis: Eval SHA (Acquire multi-SKU decrement locks)
Redis-->>Check: Success (Tokens returned)
Check->>DB: Write Checkout Session & Outbox event (PENDING)
DB-->>Check: Commit Transaction
Check->>PayGate: POST /v1/charges (Stripe Token) (WAN Call)
PayGate-->>Check: Returns 200 OK (Charged)
Check->>DB: Update Checkout Session to COMPLETED & Write Outbox
DB-->>Check: Commit Transaction
Check-->>Client: Return HTTP 200 (Success)
note over Outbox, DB: Step 2: Background Async Commit
Outbox->>DB: Read Outbox pending orders
Outbox->>DB: Mark stock as permanently decremented in SQL
Outbox->>Redis: Delete temporary stock leases
Redis Distributed Inventory Reservation Sequence
To prevent high-concurrency database page locks, inventory stock counts are mirrored in Redis. The Checkout Service leases stock from Redis before writing to the database.
graph TD
ClientReq[Checkout Request] -->|SKUs + Quantities| CheckSvc[Checkout Service]
CheckSvc -->|1. Run Lua Script| Redis[Redis Cluster]
subgraph Redis Inventory Check
Redis -->|2. Check SKU A Stock| StockA{Stock >= Req?}
Redis -->|3. Check SKU B Stock| StockB{Stock >= Req?}
StockA -->|No| Rollback[Abort: Cancel changes]
StockB -->|No| Rollback
StockA -->|Yes| Decr[Decrement Stock Counts]
StockB -->|Yes| Decr
Decr -->|4. Create Lease keys with TTL=10m| Lease[Write: lease:sku:session_id]
end
Rollback -->|Return Status| CheckSvc
Lease -->|5. Return Success + Lease Tokens| CheckSvc
CheckSvc -->|If Success: Proceed to Payment| Pay[Payment API]
CheckSvc -->|If Failure: Return Error| Err[HTTP 409 Out of Stock]
Low-Level Design and Schema
All checkout session data, inventory allocation states, transactions, and client idempotency records are stored in a PostgreSQL relational database.
-- Represents an active or historic checkout session
CREATE TABLE checkout_sessions (
checkout_session_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL,
cart_id UUID NOT NULL,
total_amount_cents INT NOT NULL,
currency VARCHAR(3) NOT NULL DEFAULT 'USD',
session_status VARCHAR(32) NOT NULL DEFAULT 'PENDING', -- PENDING, COMPLETED, EXPIRED, FAILED
idempotency_key VARCHAR(256) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
expires_at TIMESTAMPTZ NOT NULL,
CONSTRAINT uk_idempotency_key UNIQUE (user_id, idempotency_key)
);
CREATE INDEX idx_checkout_expiry ON checkout_sessions (expires_at) WHERE session_status = 'PENDING';
-- Line items linked to a checkout session
CREATE TABLE checkout_order_items (
order_item_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
checkout_session_id UUID NOT NULL REFERENCES checkout_sessions(checkout_session_id) ON DELETE CASCADE,
product_id UUID NOT NULL,
quantity INT NOT NULL,
unit_price_cents INT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_items_checkout ON checkout_order_items (checkout_session_id);
-- Logs inventory allocations (leases) to prevent overselling
CREATE TABLE inventory_allocations (
allocation_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
product_id UUID NOT NULL,
checkout_session_id UUID NOT NULL REFERENCES checkout_sessions(checkout_session_id) ON DELETE CASCADE,
quantity_allocated INT NOT NULL,
allocation_status VARCHAR(32) NOT NULL DEFAULT 'RESERVED', -- RESERVED, COMMITTED, RELEASED
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
expires_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX idx_allocations_product ON inventory_allocations (product_id, allocation_status);
CREATE INDEX idx_allocations_expiry ON inventory_allocations (expires_at) WHERE allocation_status = 'RESERVED';
-- Tracks financial gateway transaction status
CREATE TABLE payment_transactions (
transaction_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
checkout_session_id UUID NOT NULL REFERENCES checkout_sessions(checkout_session_id),
payment_gateway VARCHAR(64) NOT NULL, -- STRIPE, ADYEN, PAYPAL
gateway_reference_id VARCHAR(256) NOT NULL UNIQUE,
amount_cents INT NOT NULL,
transaction_status VARCHAR(32) NOT NULL, -- AUTHORIZED, CAPTURED, REFUNDED, FAILED
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_payment_checkout ON payment_transactions (checkout_session_id);
-- Enforces idempotency for API requests and stores previous responses
CREATE TABLE idempotent_requests (
request_key VARCHAR(256) PRIMARY KEY, -- Combination of user_id + client_idempotency_key
response_status_code INT NOT NULL,
response_body JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Schema Rationale & Index Optimization
idx_checkout_expiry: A partial index restricted tosession_status = 'PENDING'. A background worker continuously scans this index to find and expire checkout sessions that have passed their expiration timestamp. The partial index keeps the index size minimal, avoiding full table scans.uk_idempotency_key: A unique constraint on(user_id, idempotency_key)that acts as a database-level safety guard. Even if concurrent threads bypass memory caches, the database blocks duplicate rows at the storage driver level.idx_allocations_product: Speeds up inventory queries by mapping active allocations directly to SKU identifiers.
Scaling Challenges and Capacity Estimation
Flash sales generate high write volumes. We analyze the system requirements for write IOPS, network ingress, and Redis memory allocations.
1. Flash Sale Write IOPS Calculations
-
Assumptions:
- Ingestion rate during flash sale = $20,000$ checkout requests/second
- Success conversion rate = $5,000$ completed checkouts/second
- Each checkout requires 3 database writes:
- 1 to write the initial
checkout_sessionsstatus (PENDING) - 1 to write the
payment_transactionsrecord - 1 to update the
checkout_sessionsstatus (COMPLETED)
- 1 to write the initial
- Write amplification (indexes, transaction logs) = 2.0x
-
Calculations: $$\text{Database Writes Per Second} = 5,000 \text{ checkouts/s} \times 3 \text{ database writes} = 15,000 \text{ database writes/second}$$ $$\text{Total Write IOPS Capacity Required} = 15,000 \times 2.0 = 30,000 \text{ IOPS}$$
A standard relational database instance cannot support 30,000 write IOPS without write queues or sharding. To scale this, we partition the checkout_sessions and checkout_order_items tables across 8 Postgres shards using a hash of the user_id.
2. Redis Memory Footprint for Inventory Leases
-
Assumptions:
- Ingestion rate = $5,000$ checkouts/second
- Inventory lease duration = $600$ seconds (10 minutes)
- Active reservation lease count = $5,000 \times 600 = 3,000,000$ active leases
- Each lease key-value entry in Redis uses approximately $250$ bytes of memory.
-
Calculations: $$\text{Raw Lease Data Memory} = 3,000,000 \text{ leases} \times 250 \text{ bytes} = 750,000,000 \text{ bytes} \approx 750 \text{ MB}$$ Adding Redis internal overhead (pointers, hash dictionaries, replication buffers), we apply a 3.0x multiplier: $$\text{Total Redis RAM Footprint} = 750 \text{ MB} \times 3.0 \approx 2.25 \text{ GB}$$
This memory footprint easily fits inside a single standard Redis node. We configure a Redis replication cluster (1 primary, 2 read replicas) to ensure high availability and prevent data loss if the primary node crashes.
3. Payment Ingress Bandwidth
-
Assumptions:
- Checkout ingestion rate = $20,000$ requests/second
- Average JSON payload size = $2$ KB
-
Calculations: $$\text{Network Ingress Rate} = 20,000 \text{ requests/s} \times 2 \text{ KB} = 40,000 \text{ KB/second} = 40 \text{ MB/second} \approx 320 \text{ Mbps}$$
This volume of network ingress requires distributing the load across multiple API Gateway instances behind a Layer 4 load balancer (like AWS NLB) to handle the packet rate without introducing routing delay.
Failure Scenarios and Resilience
E-commerce checkout platforms must operate reliably across external networks and database failures.
1. Payment Gateway Timeout (Slow External Network Call)
The Checkout Service calls Stripe to capture a payment, but Stripe takes 12 seconds to respond due to external network congestion.
- The Threat: The Checkout Service thread remains blocked waiting for Stripe, exhausting the application server thread pool. New checkout requests are rejected, causing a system-wide outage.
- Resilience Design:
- We isolate external calls using Async Workers and Thread Pool Bulkheads.
- The Checkout Service writes a pending transaction record to the database, releases the main request thread, and delegates the payment call to a dedicated background execution queue (e.g., Celery or a Go worker channel).
- We apply a strict gateway timeout:
gateway_timeout: 5s. If the gateway fails to respond within 5 seconds, the task is marked as timed out. The system checks the transaction state asynchronously before retrying to prevent duplicate charges.
2. Inventory Locking Starvation (Flash Sale Hotspot SKUs)
100,000 users attempt to purchase a single highly sought-after item (e.g., concert tickets or a limited-edition sneaker) at the exact same moment.
- The Threat: If we use database row locks (
SELECT FOR UPDATE), the first transaction blocks the other 99,999 threads. This database lock queue spikes CPU usage to 100%, causing database connection timeouts. - Resilience Design:
- We use Redis-Based Token Buckets. Stock quantities for hot items are mirrored in Redis.
- When a user checkout is processed, the system executes a Lua script on the Redis cluster to check and decrement the stock count atomically:
local current = redis.call('get', KEYS[1]) if not current or tonumber(current) < tonumber(ARGV[1]) then return 0 else redis.call('decrby', KEYS[1], ARGV[1]) return 1 end - This execution runs in memory in less than a millisecond, handling high request volume without touching the primary SQL database. The database stock records are updated asynchronously in batches using worker queues.
3. Client Double-Click / Retry Double-Charge
A user clicks "Pay Now", experiences a brief network lag, and clicks the button a second time, or their client app retries the HTTP request automatically.
- The Threat: The server receives two identical request payloads, processes both, and charges the user's card twice.
- Resilience Design:
- We enforce client-generated Idempotency Keys.
- When the client initiates checkout, it generates a unique UUID (e.g., combining client timestamp, user ID, and cart ID) and sends it in the
Idempotency-Keyheader. - The server checks the
idempotent_requeststable before processing the write:- If the key exists and the request is complete, it returns the cached response.
- If the key exists and is currently processing, it returns an HTTP 409 Conflict.
- If the key does not exist, it inserts the key with a
PROCESSINGstatus, processes the checkout, and updates the key with the final response.
4. Database Master Crash in Middle of Checkout
The database master crashes immediately after the payment gateway succeeds but before the database transaction commits the order as PAID.
- The Threat: The user is charged, but the system has no record of the completed order, resulting in an orphaned payment.
- Resilience Design:
- We implement a Reconciliation Loop.
- Before calling the payment gateway, the system commits the checkout session state to the database as
PAYMENT_PENDING. - When the payment gateway processes the charge, it emits an asynchronous webhook (e.g.,
payment_intent.succeeded). - Our webhook listener receives the event, verifies its signature, checks the database state, and updates the checkout session to
COMPLETED. - If the database was offline during the gateway call, the webhook listener retries delivery using exponential backoff, ensuring the database state eventually converges to the correct status.
Architectural Trade-offs
Choosing the locking model and payment execution flow requires balancing consistency against system complexity.
Trade-off 1: Pessimistic Locking vs. Redis-Based Leases
Pessimistic locking uses SQL row locks during the checkout transaction; Redis-based leases reserve stock in memory with a Time-To-Live (TTL).
| Feature / Metric | Pessimistic Locking (SQL) | Redis-Based Leases |
|---|---|---|
| Write Performance | Low. Blocks concurrent database requests, limiting throughput. | High. Runs in-memory in less than 1 millisecond. |
| Consistency Guarantee | Maximum. Enforced by database ACID transactions. | Eventual. Requires synchronization between Redis and SQL. |
| Starvation Risk | High. Many threads waiting for a single lock can exhaust connections. | Low. Redis processes operations sequentially without blocking. |
| Complexity | Low. Relies on standard database SQL constraints. | High. Requires Lua scripts, TTL management, and synchronization. |
Trade-off 2: Synchronous vs. Asynchronous Payment Capture
Synchronous capture waits for the payment gateway response in the request thread; asynchronous capture returns a pending status and processes the payment in the background.
| Feature / Metric | Synchronous Payment Capture | Asynchronous Payment Capture |
|---|---|---|
| User Experience | Immediate. Client receives final confirmation in the response. | Delayed. Client must poll an status endpoint or wait for a webhook. |
| Thread Utilization | High. Requests block waiting for slow external APIs. | Low. Request threads are released immediately. |
| Resilience to Gateway Outage | Low. Outages block user requests directly. | High. Requests are queued and retried when the gateway recovers. |
Staff Engineer Perspective
Operating a high-volume checkout system requires designing for database safety and network isolation.
Verbal Script
Interviewer: "How do you prevent double-charging a customer if they click the checkout button multiple times during network lag?"
Candidate: "We prevent double-charging by implementing client-generated idempotency keys combined with a distributed lock at the API Gateway and a deduplication database constraint.
First, when the user opens the checkout page, the client application generates a unique idempotency key (e.g., combining user ID, cart ID, and a local client timestamp). This key is passed in the custom Idempotency-Key header of the HTTP POST checkout request.
Second, when the request hits the API Gateway, the service attempts to acquire a distributed lock in Redis using the key: idemp:lock:<user_id>:<key> with a short TTL (e.g., 5 seconds). If a duplicate request arrives while the first is processing, the gateway fails to acquire the lock and returns an HTTP 409 Conflict. This blocks concurrent duplicate requests at the edge before they reach the backend services.
Third, if the request passes the gateway, the Checkout Service attempts to insert the idempotency key and response state into our idempotent_requests table. If the database returns a unique key violation, it indicates that the request was already processed. The database blocks the duplicate write, and the service returns the cached response from the previous transaction, ensuring the payment API is not called a second time."
Interviewer: "How would you handle a database crash that occurs after a payment is successfully charged but before the order status is committed as paid?"
Candidate: "This is a classic distributed transaction failure. We resolve it by using asynchronous gateway webhooks and a reconciliation background worker.
First, we design our checkout flow so that we never capture payments anonymously or dynamically. Before the Checkout Service calls the payment gateway, it writes the checkout session to the database with a status of PAYMENT_PENDING and saves the generated transaction token.
Second, if the server crashes after the payment completes but before the database is updated to COMPLETED, the order remains flagged as PAYMENT_PENDING. We do not rely solely on the active request thread to update this status.
Instead, the payment provider (e.g., Stripe) dispatches an asynchronous webhook event (like payment_intent.succeeded) to our webhook listener.
When the listener receives the webhook, it checks the database. Finding the order status as PAYMENT_PENDING, it updates the status to COMPLETED and initiates the fulfillment workflow.
Third, we run a background reconciliation cron task. Every hour, it scans the database for checkout sessions that have been stuck in PAYMENT_PENDING for greater than 15 minutes. It queries the payment provider's API using the transaction token to verify the status. If the payment was captured, the worker updates the order status, ensuring the system converges to a consistent state."
Interviewer: "Why would you choose a Redis-based inventory lease over a SQL optimistic locking mechanism during a high-concurrency flash sale?"
Candidate: "During a flash sale where 50,000 users are attempting to purchase an item with limited stock (e.g., 100 units), SQL optimistic locking performs poorly due to high conflict rates.
Under optimistic locking, the database check utilizes a version or stock check: UPDATE inventory SET stock = stock - 1 WHERE product_id = :id AND stock >= 1.
While this prevents overselling, 49,900 of the 50,000 requests will fail their version check and must retry.
This generates high write contention, fills the database transaction log, and wastes CPU cycles on database retries, degrading performance for unrelated tables.
A Redis-based lease solves this by shifting the stock check to memory. Redis processes commands sequentially using a single-threaded execution loop.
We write a Lua script that runs atomically in Redis: it checks the stock count, decrements it if stock is available, and writes a temporary lease key with a 10-minute TTL.
This memory check runs in less than a millisecond, handling the high request rate easily.
The primary SQL database is updated asynchronously in batches by worker threads reading from a queue, protecting the database from write contention and connection exhaustion."