Case Study: Design an Elevator System

1. Requirement Clarification

Mental Model

Connecting isolated components into a resilient, scalable, and observable distributed web.

graph LR
    Producer[Producer Service] -->|Publish Event| Kafka[Kafka / Event Bus]
    Kafka -->|Consume| Consumer1[Consumer Group A]
    Kafka -->|Consume| Consumer2[Consumer Group B]
    Consumer1 --> DB1[(Primary DB)]
    Consumer2 --> Cache[(Redis)]

Designing an elevator system is a classic Machine Coding round question. It tests your ability to manage state, concurrency, and scheduling algorithms.

Functional Requirements

The system should control M elevators across N floors.
Users can press a button outside the elevator (Up/Down) to request a ride.
Users can press a button inside the elevator to select a destination floor.
The elevator should intelligently pick up passengers on its way.

Non-Functional Requirements

Minimize Wait Time: The algorithm should optimize the overall wait time for all passengers.
Fairness: No passenger should wait indefinitely (Starvation).
Concurrency: The system must handle simultaneous button presses.

2. Core Entities (Class Identification)

Using the noun-extraction method, we can identify our primary classes:

ElevatorSystem (The orchestrator/singleton)
Elevator (The physical car)
Floor
Button (Inside and Outside)
Request (A specific call to move)

3. The Scheduling Algorithm: SCAN (Elevator Algorithm)

A naive approach is FCFS (First Come First Serve), but this forces the elevator to bounce wildly between floors. The industry standard is the SCAN Algorithm (also used in hard drive disk scheduling).

How it works:

The elevator moves continuously in one direction (e.g., UP).
It stops at every requested floor in that direction.
Once it reaches the highest requested floor, it reverses direction (e.g., DOWN) and serves requests on the way down.

To implement this, an elevator needs two data structures:

A TreeSet or PriorityQueue for UP requests (min-heap).
A TreeSet or PriorityQueue for DOWN requests (max-heap).

4. Class Design (Java)

Enum & State

public enum Direction { UP, DOWN, IDLE }
public enum ElevatorState { MOVING, STOPPED, MAINTENANCE }

public class Request {
    int currentFloor;
    int destinationFloor;
    Direction direction;
    // Constructor and getters...
}

The Elevator Class

import java.util.*;

public class Elevator {
    private int id;
    private int currentFloor;
    private Direction currentDirection;
    private ElevatorState state;
    
    // Using TreeSet to automatically sort requests
    private TreeSet<Integer> upRequests;
    private TreeSet<Integer> downRequests;

    public Elevator(int id) {
        this.id = id;
        this.currentFloor = 0; // Ground floor
        this.currentDirection = Direction.IDLE;
        this.state = ElevatorState.STOPPED;
        this.upRequests = new TreeSet<>();
        this.downRequests = new TreeSet<>(Collections.reverseOrder());
    }

    public synchronized void addRequest(Request request) {
        if (request.direction == Direction.UP) {
            upRequests.add(request.destinationFloor);
        } else {
            downRequests.add(request.destinationFloor);
        }
        // Logic to awaken elevator if idle...
    }

    // Simplified run loop representing the SCAN algorithm
    public void run() {
        while (true) {
            if (currentDirection == Direction.UP || currentDirection == Direction.IDLE) {
                processRequests(upRequests, Direction.UP);
            }
            if (currentDirection == Direction.DOWN || currentDirection == Direction.IDLE) {
                processRequests(downRequests, Direction.DOWN);
            }
        }
    }
    
    private void processRequests(TreeSet<Integer> requests, Direction dir) {
        while (!requests.isEmpty()) {
            int targetFloor = requests.pollFirst(); // Get next closest floor
            moveToFloor(targetFloor);
        }
        currentDirection = Direction.IDLE; // Switch to idle when done
    }

    private void moveToFloor(int targetFloor) {
        System.out.println("Elevator " + id + " moving to " + targetFloor);
        this.currentFloor = targetFloor;
    }
}

The Orchestrator (Strategy Pattern)

public class ElevatorSystem {
    private List<Elevator> elevators;
    private ElevatorDispatchStrategy dispatchStrategy; // Strategy Pattern!

    public void requestElevator(int floor, Direction direction) {
        // The strategy decides WHICH elevator is best suited for this request
        Elevator optimalElevator = dispatchStrategy.selectElevator(elevators, floor, direction);
        optimalElevator.addRequest(new Request(floor, floor, direction));
    }
}

5. Verbal Interview Script (Staff Tier)

Interviewer: "How do you decide which elevator to dispatch in a multi-elevator building?"

You: "Dispatching is a complex optimization problem. I would use the Strategy Pattern to abstract the ElevatorDispatchStrategy. A naive strategy is Round-Robin or 'Random', but that increases wait times. A better strategy is Odd/Even zoning (used in skyscrapers). The most optimal dynamic strategy calculates a 'Suitability Score' for each elevator based on its distance from the requester and its current direction. If an elevator is on Floor 2, moving UP, and the request is from Floor 5 moving UP, it receives a high score. If it's moving DOWN, it gets a massive penalty. By injecting this strategy at runtime, we can easily A/B test different algorithms (like LOOK or C-SCAN) without touching the core ElevatorSystem code."

Advanced Architectural Blueprint: The Staff Perspective

In modern high-scale engineering, the primary differentiator between a Senior and a Staff Engineer is the ability to see beyond the local code and understand the Global System Impact. This section provides the exhaustive architectural context required to operate this component at a "MANG" (Meta, Amazon, Netflix, Google) scale.

1. High-Availability and Disaster Recovery (DR)

Every component in a production system must be designed for failure. If this component resides in a single availability zone, it is a liability.

Multi-Region Active-Active: To achieve "Five Nines" (99.999%) availability, we replicate state across geographical regions using asynchronous replication or global consensus (Paxos/Raft).
Chaos Engineering: We regularly inject "latency spikes" and "node kills" using tools like Chaos Mesh to ensure the system gracefully degrades without a total outage.

2. The Data Integrity Pillar (Consistency Models)

When managing state, we must choose our position on the CAP theorem spectrum.

Model	latency	Complexity	Use Case
Strong Consistency	High	High	Financial Ledgers, Inventory Management
Eventual Consistency	Low	Medium	Social Media Feeds, Like Counts
Monotonic Reads	Medium	Medium	User Profile Updates

3. Observability and "Day 2" Operations

Writing the code is only 10% of the lifecycle. The remaining 90% is spent monitoring and maintaining it.

Tracing (OpenTelemetry): We use distributed tracing to map the request flow. This is critical when a P99 latency spike occurs in a mesh of 100+ microservices.
Structured Logging: We avoid unstructured text. Every log line is a JSON object containing correlationId, tenantId, and latencyMs.
Custom Metrics: We export business-level metrics (e.g., "Orders processed per second") to Prometheus to set up intelligent alerting with PagerDuty.

4. Production Readiness Checklist for Staff Engineers

Capacity Planning: Have we performed load testing to find the "Breaking Point" of the service?
Security Hardening: Is all communication encrypted using mTLS (Mutual TLS)?
Backpressure Propagation: Does the service correctly return HTTP 429 or 503 when its internal thread pools are saturated?
Idempotency: Can the same request be retried 10 times without side effects? (Critical for Payment systems).

Critical Interview Reflection

When an interviewer asks "How would you improve this?", they are looking for your ability to identify Bottlenecks. Focus on the network I/O, the database locking strategy, or the memory allocation patterns of the JVM. Explain the trade-offs between "Throughput" and "Latency." A Staff Engineer knows that you can never have both at their theoretical maximums.

Optimization Summary:

Reduce Context Switching: Use non-blocking I/O (Netty/Project Loom).
Minimize GC Pressure: Prefer primitive specialized collections over standard Generics.
Data Sharding: Use Consistent Hashing to avoid "Hot Shards."

Technical Trade-offs: Messaging Systems

Pattern	Ordering	Durability	Throughput	Complexity
Log-based (Kafka)	Strict (per partition)	High	Very High	High
Memory-based (Redis Pub/Sub)	None	Low	High	Very Low
Push-based (RabbitMQ)	Fair	Medium	Medium	Medium

Key Takeaways

The system should control M elevators across N floors.
Users can press a button outside the elevator (Up/Down) to request a ride.
Users can press a button inside the elevator to select a destination floor.