System Design: Designing Ticketmaster (High-Concurrency Booking)

Case Study: Designing Ticketmaster (High-Concurrency Seat Reservation)

Designing a ticket reservation platform like Ticketmaster or ticketing systems for mega-events is one of the most intense tests of distributed systems engineering. The central challenge is not storage or search volume, but Concurrency and Extreme Scarcity. When 1 Million concurrent fans try to buy 50,000 tickets the instant a superstar tour goes live, a naive system will either double-sell seats, crash under thundering database locks, or experience thread starvation.

1. Requirements & Core Constraints

Functional Requirements

Interactive Event Map: Users can view the real-time availability of seats on an interactive stadium map.
Temporary Reservation (Soft Hold): Users can hold up to 4 seats for 10 minutes. During this window, no other user can hold or purchase these seats while the current user completes payment.
Seat Purchase: Confirm booking and charge payments, marking seats as permanently sold.
Search & Catalog: Browse upcoming concerts, events, and venues by category or location.

Non-Functional Requirements (SLAs)

Zero Double-Bookings: Under no circumstances can the same physical seat be sold to two distinct customers.
High Ingestion Resilience: Gracefully survive a flash sale surge of 1,000,000 concurrent active users hitting the checkout engine simultaneously.
Highly Consistent Transactions: Strict adherence to ACID transactions (Specifically Isolation and Consistency) for reservation and purchase states.
Sub-Second Search Latency: Catalog search results must load in under 200ms globally.

Back-of-the-Envelope Capacity Estimates

Let's size the system around a peak flash sale release:

Available Tickets (Arena capacity): 50,000 seats.
Active Waiting Users: 1,000,000 concurrent users.
Average Seat Search QPS (Peak): 150,000 read queries/sec.
Average Reservation Attempts (Peak): 50,000 writes/sec.
Checkout Payment Processing Window: 10 minutes per reservation.

DB Ingestion Bottle Neck:

A standard sharded PostgreSQL database cluster can comfortably handle up to 10,000 raw transactional write QPS before lock-contention degrades performance. If all 50,000 users attempt to hold seats simultaneously directly against our relational database, database connection pools will starve, and locks will cascade, leading to a complete system outage. This mathematical bottleneck dictates that we must decouple user booking requests using a high-throughput traffic-gating queue.

2. API Design & Core Contracts

Stateless API gateways distribute requests to dedicated microservices. Below are the critical checkout contracts.

A. Seat Lock & Temporary Reservation API

When a user selects seats on the interactive stadium map, the client attempts to acquire a soft hold.

POST /v1/bookings/reserve
Authorization: Bearer <JWT_TOKEN>
Content-Type: application/json

{
  "event_id": "evt_98234",
  "seat_ids": ["seat_sec102_rowA_3", "seat_sec102_rowA_4"],
  "idempotency_key": "idem_83472981-8374-4b5f-a50a-da54eb24b45f"
}

Response:

{
  "booking_id": "book_73491823",
  "event_id": "evt_98234",
  "reserved_seats": ["seat_sec102_rowA_3", "seat_sec102_rowA_4"],
  "hold_expiration_epoch_seconds": 1779435600,
  "payment_redirect_url": "https://checkout.csp.io/payments/book_73491823"
}

B. Purchase Confirmation API

Completes the transaction once the payment processor webhook issues a success signature.

POST /v1/bookings/book_73491823/confirm
Authorization: Bearer <JWT_TOKEN>
Content-Type: application/json

{
  "payment_token": "pay_tok_83472918349",
  "idempotency_key": "idem_83472981-8374-4b5f-a50a-da54eb24b45f"
}

Response:

{
  "booking_id": "book_73491823",
  "status": "CONFIRMED",
  "tickets": [
    {
      "ticket_id": "tkt_0092348",
      "seat_id": "seat_sec102_rowA_3",
      "barcode_hash": "sha256_d8d3b88b22a0"
    }
  ]
}

3. High-Level Design (HLD)

Our design shields the relational transaction database by introducing a Virtual Waiting Room queue at the edge, and using a high-throughput Distributed Cache layer to handle temporary reservations.

graph TD
    %% User entry path
    User((Fan Client)) -->|1. Request Booking| LB[Load Balancer]
    LB -->|2. Check Capacity Limits| GatingQueue[Virtual Waiting Room Engine]
    
    %% Waiting room redirection
    GatingQueue -->|Redirection Queue Token| User
    GatingQueue -->|3. Allow Batch of Users (5k/min)| IngressGW[API Gateway]
    
    %% Core stateless services
    IngressGW -->|4. Browse Seats| CatalogSrv[Catalog & Search Service]
    IngressGW -->|5. Reserve Seats| BookingSrv[Booking Transaction Service]
    
    %% High-Performance state stores
    CatalogSrv -->|Cache Read| CDN[Edge CDN / Redis Read Replicas]
    CatalogSrv -->|Cache Miss| Elasticsearch[(Elasticsearch Cluster)]
    
    BookingSrv -->|6. Set Expirable Distributed Lock| LockStore[(Redis Lock Cluster)]
    BookingSrv -->|7. Final Seat Settlement| RelationalDB[(PostgreSQL Primary)]
    
    %% Third Party Integrations
    BookingSrv -->|8. Process Payment| PaymentAPI[External Stripe / PayPal API]

The Ticket Reservation State Machine

A ticket transitions through states dynamically to prevent deadlocks and guarantee that abandoned carts automatically release inventory.

stateDiagram-v2
    [*] --> AVAILABLE : Seat initialized
    
    AVAILABLE --> HELD : User selects seat (Redis TTL set: 10 mins)
    HELD --> AVAILABLE : Hold Expires (Redis TTL ends) OR Client Cancels
    
    HELD --> PENDING_PAYMENT : User clicks 'Pay Now'
    PENDING_PAYMENT --> SOLD : Payment Gateway Webhook confirms Success
    PENDING_PAYMENT --> AVAILABLE : Payment Fails / Timeout
    
    SOLD --> [*] : Permanent Ticket Issued (PostgreSQL Settled)

4. Low-Level Design (LLD) & Data Models

Database Selection Rationale

Inventory & Transaction Store (PostgreSQL sharded by event_id): Strict ACID transactions are critical. We shard our tables by event_id to guarantee that all seating rows for a specific stadium concert exist on a single physical database node. This prevents costly multi-node distributed transactions and guarantees row-level lock locality.
Distributed Locks (Redis Cluster): Reading and writing to PostgreSQL to check seat lock states under millions of requests causes disk-bound I/O bottlenecks. We use Redis to store temporary locks. Since Redis is single-threaded and handles memory operations, it can execute atomic evaluations using Lua scripting at 100,000+ operations per second.

Low-Level SQL Schema (Metadata Store)

-- Represents Venues / Arenas
CREATE TABLE venues (
    venue_id VARCHAR(64) PRIMARY KEY,
    name VARCHAR(128) NOT NULL,
    capacity INT NOT NULL
);

-- Represents specific physical seats
CREATE TABLE seats (
    seat_id VARCHAR(64) PRIMARY KEY,
    venue_id VARCHAR(64) REFERENCES venues(venue_id),
    section VARCHAR(32) NOT NULL,
    row_num VARCHAR(16) NOT NULL,
    seat_num INT NOT NULL
);

-- Represents concert details
CREATE TABLE events (
    event_id VARCHAR(64) PRIMARY KEY,
    venue_id VARCHAR(64) REFERENCES venues(venue_id),
    title VARCHAR(255) NOT NULL,
    start_time TIMESTAMP WITH TIME ZONE NOT NULL
);

-- Active Seats Inventory sharded by event_id
CREATE TABLE event_seats (
    event_id VARCHAR(64) NOT NULL,
    seat_id VARCHAR(64) NOT NULL,
    price DECIMAL(10, 2) NOT NULL,
    status VARCHAR(32) DEFAULT 'AVAILABLE', -- AVAILABLE, HELD, SOLD
    version INT DEFAULT 0,                 -- Optimistic Lock Counter
    PRIMARY KEY (event_id, seat_id)
);

-- Bookings Register sharded by event_id
CREATE TABLE bookings (
    booking_id VARCHAR(64) NOT NULL,
    event_id VARCHAR(64) NOT NULL,
    user_id VARCHAR(64) NOT NULL,
    total_amount DECIMAL(10, 2) NOT NULL,
    status VARCHAR(32) NOT NULL DEFAULT 'RESERVED', -- RESERVED, PAID, EXPIRED
    idempotency_key VARCHAR(64) UNIQUE NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (event_id, booking_id)
);

LLD Implementation: Executing Seat Reservations Safely

To book seats without double-booking, we execute a two-phase check:

Phase 1 (Redis): Try to acquire an atomic distributed lock with a TTL of 10 minutes.
Phase 2 (PostgreSQL): Once the Redis hold is acquired, we verify or lock the relational seat status using an optimistic lock version match to handle eventual consistency synchronization.

Compilable Spring/Java Seat Reservation Logic:

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.redis.core.StringRedisTemplate;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;
import java.time.Duration;
import java.util.List;

@Service
public class SeatingReservationService {

    @Autowired
    private StringRedisTemplate redisTemplate;

    @Autowired
    private EventSeatRepository seatRepository;

    /**
     * Attempts to place a temporary hold on a seat.
     * Executes atomic Redis Distributed lock evaluation before database write.
     */
    @Transactional
    public boolean reserveSeat(String eventId, String seatId, String userId) {
        String lockKey = "lock:evt:" + eventId + ":seat:" + seatId;
        
        // Phase 1: Try atomic Redis Lock (SET NX EX)
        Boolean lockAcquired = redisTemplate.opsForValue().setIfAbsent(
                lockKey, 
                userId, 
                Duration.ofMinutes(10)
        );

        if (lockAcquired == null || !lockAcquired) {
            // Lock occupied. Seat is held by another user.
            return false;
        }

        // Phase 2: Optimistic Lock verification against PostgreSQL sharded node
        EventSeat seat = seatRepository.findByEventIdAndSeatId(eventId, seatId);
        
        if (seat == null || !seat.getStatus().equals("AVAILABLE")) {
            // Relational catalog shows seat is not available. Clean up Redis.
            redisTemplate.delete(lockKey);
            return false;
        }

        // Apply state transition using optimistic version increments
        seat.setStatus("HELD");
        seat.setVersion(seat.getVersion() + 1);
        
        int rowsUpdated = seatRepository.updateSeatWithOptimisticLock(
                eventId, 
                seatId, 
                "HELD", 
                seat.getVersion() - 1
        );

        if (rowsUpdated == 0) {
            // Conflicting transaction updated the record concurrently. Release Redis.
            redisTemplate.delete(lockKey);
            return false;
        }

        return true;
    }
}

5. Scaling Challenges & Bottlenecks

A. The Waiting Room Gating Mechanism (Traffic Smoothing)

To protect PostgreSQL database clusters from crashing during general ticket releases, we implement a Virtual Waiting Room.

The Queue Design: When a user clicks "Buy Tickets," the GTM redirects them to a waiting room web service. This stateless proxy assigns each user a position token stored inside a high-speed Redis Sorted Set (ZSET), using the entry timestamp as the sorting score.
FIFO Batch Releases: A background orchestrator pulls batch sizes (e.g., 5,000 tokens) from the Sorted Set every minute, granting them an encrypted, short-lived gateway entry token. The API Gateway verifies the JWT token before routing requests to the Booking Microservice. This smooths out a massive traffic spike, translating a 1,000,000 QPS surge into a steady 83 QPS transactional flow that our sharded relational databases handle effortlessly.

Flash Surge: 1M concurrent clicks/sec
      |
      v
[Virtual Waiting Room Engine (Redis Sorted Set ZSET)]
      |
      |---> (Redirection Web UI: "You are #48,239 in line...")
      |
      v [Batch releases of 5,000 users per minute]
[API Gateway JWT Decryption Gate]
      |
      v [83 QPS Database Write Load]
[PostgreSQL Relational DB Node] (100% Stability, Zero Outages!)

B. High Contention Row Locks (Front-Row Seat War)

Front-row seats represent ultra-hot keys. If 10,000 users try to book the exact same 2 VIP seats, a typical SELECT ... FOR UPDATE pessimistic query locks the entire index row group, starving the DB connection pool.

Solution: We offload all selection checks to Redis. If the key exists in Redis, the request is immediately rejected at the caching layer. PostgreSQL is only queried once the Redis distributed lock succeeds. This intercepts 99.9% of lock-contention operations, maintaining high database efficiency.

6. Real-World Trade-offs

A. Pessimistic Locking vs. Optimistic Locking under Extreme Contention

Optimistic Locking: Uses a version field. Under low contention, it is extremely performant because it avoids database-level physical write locks during read phases.
- Trade-off: Under high contention (e.g., flash sales), optimistic locking leads to up to 99% database transaction rollbacks, forcing client retries and causing excessive CPU thrashing.
Pessimistic Locking: Locks the rows immediately using FOR UPDATE.
- Trade-off: Prevents rollback retry loops, but blocks concurrent reader threads, leading to database connection pool starvation.
Our Hybrid Strategy: We utilize Redis-level lock-gating to intercept parallel requests, combined with a Fast-fail optimistic lock verification at the PostgreSQL database layer. This achieves the low-latency advantages of caching while preserving the data integrity guarantees of SQL.

7. Failure Scenarios & Fault Tolerance

A. Abandoned Carts and Orphaned Holds

If a user reserves seats but closes their browser, the seats remain held, preventing others from purchasing.

Solution: The Redis lock automatically expires after 10 minutes via standard TTL mechanics. To sync this state back to the database, a background daemon regularly sweeps the database tables looking for records with status = 'HELD' where the corresponding Redis key has expired. The daemon executes a batch update, resetting the seat statuses to AVAILABLE.

B. Payment Success Webhook Delay

If the user pays successfully, but the Payment Webhook experiences network partition delay:

The Pitfall: The 10-minute hold might expire before the confirmation webhook completes, returning the seat to the pool. A second user could then immediately hold and purchase it, causing a double-booking.
Mitigation: When the user enters the Stripe payment page, the client triggers a state change to PENDING_PAYMENT which extends the Redis lock TTL by an additional 15 minutes. If the payment finally times out, the seat status resets to AVAILABLE.

8. Staff Engineer Perspective (Operational Deep Dive)

Trade-off

Consensus-Based Global Locking (ZooKeeper) vs. High-Performance Expirable Locking (Redis Redlock)

A common interview debate is choosing between ZooKeeper/Consensus locks (CP-focused) and Redis Redlock configurations (AP-focused).

A ZooKeeper distributed lock uses ephemeral nodes, guaranteeing absolute linearizability and strong consistency even during network partitions. If the master node goes offline, the Raft/Paxos consensus ensures that we never elect a leader that double-allocates locks.

However, ZooKeeper requires disk-write consensus checks for every lock acquisition, limiting throughput to a few thousand lock operations per second.

Conversely, a Redis Cluster operating over single-threaded memory operations easily scales to 150,000 lock operations per second. In ticketing architectures, we choose the Redis Redlock approach backed by optimistic database constraints.

A transient Redlock anomaly during a network partition represents an acceptable boundary condition if our sharded PostgreSQL database's unique constraints act as the absolute source of truth, catching and rolling back any double-allocation attempts before a transaction commits.

9. Candidate Verbal Script (Mock Interview Guide)

Below is an elite verbal answer walk-through demonstrating high technical mastery:

Candidate: "To design Ticketmaster under a 1,000,000 user flash sale surge, I must solve lock-contention bottlenecks at the database layer. If we let all concurrent requests hit the database directly, connection pools will starve, and the system will crash. To prevent this, my primary architectural decision is implementing a Virtual Waiting Room rate-limiter at the edge.

When a ticketing event goes live, client requests are redirected to a stateless queue service that assigns them a queue position token managed inside a high-speed Redis Sorted Set (ZSET). We then release users in batches matching the capacity of our relational database, converting a chaotic 1,000,000 QPS surge into a stable, controlled transactional write flow.

For data modeling, I will use PostgreSQL sharded by event_id to ensure absolute ACID consistency for final seat bookings. This guarantees that all seating inventory for a specific event lives on the same physical host, preventing multi-node distributed transactions. To manage temporary seat holds safely without database locking bottlenecks, I'll leverage an expirable Redis Distributed Lock (SET NX EX) with a 10-minute TTL. The Booking service will only write to PostgreSQL once the Redis lock is successfully acquired. This hybrid caching-relational strategy guarantees absolute protection against double-booking while maintaining sub-millisecond latencies under massive concurrency."