System Design: Designing a Food Delivery App (Uber Eats / DoorDash)

1. Core Requirements & Scale Constraints

A food delivery platform (like Uber Eats, DoorDash, or Grab) is a complex, high-throughput Three-Sided Marketplace that must coordinate three distinct actors in real-time: Customers, Restaurants, and Couriers (Drivers).

Unlike standard e-commerce systems where delivery latency is measured in days, food delivery systems must operate within a strict 30-minute logistics window—before the food gets cold. This requires high-performance geospatial searching, real-time routing engines, transactional order state management, and highly optimized driver dispatch algorithms.

Functional Requirements

Customer Features:
- Browse restaurants by location, cuisine, or rating.
- View real-time menus and place orders.
- Make secure payments and track order preparation and courier location in real-time.
Restaurant Features:
- Update menus, prices, and item availability in real-time.
- Receive, accept, and mark orders as "preparing" or "ready for pickup".
Courier Features:
- Toggle online status and continuously stream GPS location.
- Receive delivery offers (Gig Requests), accept or reject offers, and update order status (arrived, picked up, out for delivery, completed).

Non-Functional SLAs

Geospatial Latency: Courier location updates must be ingested and visible to the tracking customer in $\le 2\text{ seconds}$.
Matching Efficiency: The dispatch system should match orders to couriers with a target pick-up delay of $\le 5\text{ minutes}$ from the moment the food is ready.
High Availability: Core checkout and courier dispatch systems must maintain $99.999%$ uptime SLA to prevent massive business loss during meal-time spikes.
Strict Data Consistency: Financial transactions, ledgers, and order state transitions must maintain strong transactional consistency (ACID) to avoid double-claiming or lost payments.

Back-of-the-Envelope Estimates

System Scale:
- Daily Active Users (DAU): 10 Million.
- Active Restaurants: 500,000.
- Active Couriers: 500,000.
- Daily Completed Orders: 5 Million.
Order Throughput:
- Average Orders per Second: $5\text{ Million} \div 86,400\text{ seconds} \approx 58\text{ orders/sec}$.
- Peak Orders per Second (Lunch/Dinner rush, 5x multiplier): $58 \times 5 = 290\text{ orders/sec}$.
Location Streaming Traffic:
- 500,000 active couriers streaming location every 4 seconds.
- Ingestion Write Throughput: $500,000 \div 4 = 125,000\text{ write pings/sec}$.
- Average size of GPS ping payload (CourierID, Lat, Long, Heading, Status, Timestamp): $\approx 100\text{ bytes}$.
- Inbound Location Bandwidth: $125,000 \times 100\text{ bytes} \approx 12.5\text{ MB/sec}$ of continuous write traffic.
Storage Calculation (Location History for Fraud/Audit):
- Storing GPS history for 90 days.
- Daily coordinates generated: $500,000 \times 21,600\text{ pings/hour} \times 10\text{ hours active/day} = 108\text{ Billion coordinates/day}$.
- Raw Daily Storage Volume: $108\text{ Billion} \times 100\text{ bytes} \approx 10.8\text{ Terabytes per day}$.
- 90-day Storage Volume: $10.8\text{ TB} \times 90 \approx 972\text{ TB}$ of compressed time-series data.

2. API Design & Core Contracts

The system utilizes REST APIs for standard CRUD operations (menu management, user registration) and WebSockets/gRPC for bi-directional real-time communication (courier location streaming and job dispatching).

Order Ingestion API (REST)

Used by the customer app to create a new transaction.

POST /v1/orders
Content-Type: application/json
Authorization: Bearer <jwt_token>

{
  "customer_id": "cust_897123",
  "restaurant_id": "rest_00129",
  "delivery_address": {
    "latitude": 37.774929,
    "longitude": -122.419416,
    "formatted_address": "123 Market St, San Francisco, CA"
  },
  "items": [
    {
      "menu_item_id": "menu_item_9901",
      "quantity": 2,
      "customizations": ["no_onions"]
    },
    {
      "menu_item_id": "menu_item_4402",
      "quantity": 1,
      "customizations": ["extra_cheese"]
    }
  ],
  "payment_method_id": "pm_1a2b3c4d5e"
}

Response Payload:

{
  "order_id": "ord_9082348",
  "status": "PLACED",
  "estimated_delivery_time": "2026-05-22T17:15:00Z",
  "subtotal": 34.50,
  "delivery_fee": 3.99,
  "service_fee": 1.50,
  "total": 39.99
}

Courier Location Streaming & Dispatch (gRPC Protocol)

Couriers maintain a persistent, bidirectional gRPC stream to send GPS telemetry and receive immediate delivery assignments.

syntax = "proto3";

package codesprintpro.fooddelivery;

service CourierDispatchService {
  // Bidirectional stream for continuous telemetry and dispatch coordination
  rpc EstablishSession(stream CourierTelemetry) returns (stream DispatchOffer);
}

message CourierTelemetry {
  string courier_id = 1;
  double latitude = 2;
  double longitude = 3;
  double speed_mps = 4;
  double bearing_degrees = 5;
  int64 timestamp_ms = 6;
  CourierActivityState state = 7;
}

enum CourierActivityState {
  STATE_UNKNOWN = 0;
  STATE_OFFLINE = 1;
  STATE_ONLINE_IDLE = 2;
  STATE_ACCEPTED_ORDER = 3;
  STATE_ARRIVED_RESTAURANT = 4;
  STATE_EN_ROUTE_DELIVERY = 5;
}

message DispatchOffer {
  string offer_id = 1;
  string order_id = 2;
  string restaurant_name = 3;
  double restaurant_latitude = 4;
  double restaurant_longitude = 5;
  double delivery_latitude = 6;
  double delivery_longitude = 7;
  double guaranteed_payout = 8;
  int64 accept_timeout_seconds = 9;
}

3. High-Level Design (HLD)

The architecture is designed as a set of decoupled microservices interacting through synchronous gRPC/REST protocols and an asynchronous Event Bus (Apache Kafka).

graph TD
    %% Clients
    Customer[Customer App]
    Courier[Courier App]
    Restaurant[Restaurant Dashboard]

    %% Gateway
    Gateway[API Gateway & Load Balancer]
    WS_Gateway[WebSocket/gRPC Gateway]

    Customer -->|Browse/Order| Gateway
    Restaurant -->|Update Menu/Accept| Gateway
    Courier -->|Telemetry Session| WS_Gateway

    %% Microservices Layer
    subgraph Microservices ["Microservices Layer"]
        OrderService[Order Ingestion Service]
        LocationService[Geospatial Location Service]
        MatchingEngine[Dynamic Dispatch Matching Engine]
        RestaurantService[Restaurant & Menu Service]
        TrackingService[Real-Time Tracking Service]
        LedgerService[Double-Entry Ledger Service]
    end

    Gateway --> OrderService
    Gateway --> RestaurantService
    WS_Gateway --> LocationService
    WS_Gateway --> TrackingService

    %% Message Bus
    Kafka{Apache Kafka Event Bus}
    OrderService -->|Order Placed/Updated| Kafka
    LocationService -->|Telemetry Event| Kafka
    LedgerService -->|Financial Events| Kafka

    %% Storage & Caches
    subgraph Storage ["Databases & Cache Layer"]
        Postgres[(Order PostgreSQL DB)]
        GeoCache[(Active Courier Redis GEO)]
        Cassandra[(Time-Series Telemetry Cassandra)]
        ES[(Restaurant Search Elasticsearch)]
    end

    OrderService --> Postgres
    RestaurantService --> ES
    LocationService --> GeoCache
    LocationService --> Cassandra
    MatchingEngine -->|Read Telemetry| GeoCache
    MatchingEngine -->|Write Matches| Postgres
    LedgerService --> Postgres

    %% Dispatch Loop
    MatchingEngine -->|Fetch Pending Orders| Postgres
    MatchingEngine -->|Publish Offers| WS_Gateway

Flow and Component Breakdown

API Gateway & WebSocket/gRPC Gateway: Handles user authentication, rate limiting, and routes connections. The WebSocket/gRPC gateway terminates the long-lived TCP connections from couriers to optimize connection overhead.
Geospatial Location Service: Receives 125,000 GPS coordinates per second. It stores the latest coordinate in Redis using spatial sharding (GEOADD commands) for sub-millisecond querying, and pushes historical coordinates into Apache Cassandra for long-term time-series logging.
Dynamic Dispatch Matching Engine: A specialized logistics engine that runs periodic matching sweeps (every 5-10 seconds) within local regions. It retrieves pending orders from PostgreSQL and matches them with available couriers queryable from Redis.
Real-time Tracking Service: Subscribes to telemetry events on Kafka and pushes the updated location of the assigned courier back to the tracking customer via Server-Sent Events (SSE) or WebSockets.

4. Low-Level Design (LLD) & Data Models

Database Technology Choices

PostgreSQL (ACID Compliant Relational DB): Used for Orders, Ledgers, and User Accounts. Relational systems are non-negotiable here to handle strict transactions (e.g., ensuring an order state does not transition out of sequence, and that payments balance perfectly).
Redis (In-Memory Data Structure Store): Tracks ephemeral states: active sessions, distributed locks, and active courier locations. Redis's highly optimized geospatial indexes (based on Geohashes) enable low-latency range queries.
Apache Cassandra (Wide-Column NoSQL DB): Best suited for high-throughput write-heavy time-series workloads. It stores the historical GPS trajectory logs of all completed deliveries.

Core SQL DDL Definitions (PostgreSQL)

-- Core Table mapping restaurants
CREATE TABLE restaurants (
    restaurant_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name VARCHAR(255) NOT NULL,
    geo_hash VARCHAR(12) NOT NULL,
    latitude DECIMAL(9, 6) NOT NULL,
    longitude DECIMAL(9, 6) NOT NULL,
    status VARCHAR(50) NOT NULL DEFAULT 'ACTIVE',
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_restaurants_geohash ON restaurants(geo_hash);

-- Core Table mapping couriers metadata
CREATE TABLE couriers (
    courier_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name VARCHAR(255) NOT NULL,
    vehicle_type VARCHAR(50) NOT NULL, -- BICYCLE, MOTORCYCLE, CAR
    is_active BOOLEAN DEFAULT TRUE,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Strict state machine mapping of order transactions
CREATE TABLE orders (
    order_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    customer_id UUID NOT NULL,
    restaurant_id UUID NOT NULL REFERENCES restaurants(restaurant_id),
    courier_id UUID REFERENCES couriers(courier_id),
    status VARCHAR(50) NOT NULL CHECK (
        status IN ('PLACED', 'ACCEPTED_BY_RESTAURANT', 'PREPARING', 'READY_FOR_PICKUP', 'OUT_FOR_DELIVERY', 'DELIVERED', 'CANCELLED')
    ),
    delivery_latitude DECIMAL(9, 6) NOT NULL,
    delivery_longitude DECIMAL(9, 6) NOT NULL,
    subtotal_cents INT NOT NULL,
    delivery_fee_cents INT NOT NULL,
    service_fee_cents INT NOT NULL,
    total_cents INT NOT NULL,
    version INT NOT NULL DEFAULT 1, -- Optimistic locking field
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_restaurant_id ON orders(restaurant_id);
CREATE INDEX idx_orders_status ON orders(status) WHERE status != 'DELIVERED';

-- Double-Entry Financial Ledger
-- Critical to maintain financial integrity, ensuring no balances drift due to system retries
CREATE TABLE financial_ledger (
    ledger_entry_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    order_id UUID NOT NULL REFERENCES orders(order_id),
    account_id UUID NOT NULL, -- ID of Customer, Restaurant, Courier, or Platform Account
    account_type VARCHAR(50) NOT NULL CHECK (account_type IN ('CUSTOMER', 'RESTAURANT', 'COURIER', 'PLATFORM')),
    entry_type VARCHAR(10) NOT NULL CHECK (entry_type IN ('DEBIT', 'CREDIT')),
    amount_cents INT NOT NULL,
    balance_snapshot_cents INT NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_ledger_account ON financial_ledger(account_id, created_at);

Ephemeral Driver Tracking Model (Redis)

Active couriers report their locations into Redis Geospatial indexes.

Key: active_couriers:geo (Redis Sorted Set under the hood)
Command to update location: GEOADD active_couriers:geo -122.419416 37.774929 "courier_uuid_8829"
Command to query nearby idle couriers: GEORADIUS active_couriers:geo -122.419416 37.774929 5 km WITHDIST WITHCOORD ASC

5. Scaling Challenges & System Bottlenecks

Dynamic Bipartite Driver-Order Matching

A naive dispatch system pairs an order with the closest driver immediately (a Greedy Match). This strategy creates a massive global bottleneck: if Driver $A$ is 200 meters away from Restaurant $X$ but is already matching with a customer 3km away, we force a sub-optimal matching chain that increases overall delivery times.

Greedy Strategy (Sub-optimal):
Order 1 (Prep done) ----------> Matches Courier A (2km away)
Order 2 (Prep done in 1m) ----> Matches Courier B (100m away, but locked)

Batch Matching Strategy (Global Optimum):
Dynamic Dispatcher pools all orders and idle couriers in a local grid, 
solving a Minimum Cost Bipartite Matching problem to minimize overall pickup latency.

To solve this, the platform splits cities into discrete hexagonal cells using H3 spatial indexes (developed by Uber).

Every 5 to 10 seconds, a Dispatch Worker is triggered for a specific H3 cell.
It pools all pending orders in READY_FOR_PICKUP or PREPARING (within a short time window) and all online idle couriers in the cell (and adjacent cells).
It constructs a Bipartite Cost Graph where edges represent the estimated travel time (queried from our routing engine, OSRM).
The system runs the Kuhn-Munkres (Hungarian) Algorithm or a linear programming solver to find the optimal pairings that minimize the overall pickup time for all orders in that batch.

graph TD
    subgraph Grid ["H3 Spatial Cell Matching Loop"]
        Cell[H3 Hexagonal Cell 8828308281fffff]
        PendingOrders[List of Pending Orders: O_1, O_2, O_3]
        IdleCouriers[List of Idle Couriers: C_1, C_2, C_3]
        
        Cell --> PendingOrders
        Cell --> IdleCouriers
        
        BipartiteGraph["Construct Bipartite Graph
        Weights = Estimated Arrival Time (OSRM)"]
        
        PendingOrders --> BipartiteGraph
        IdleCouriers --> BipartiteGraph
        
        Solver[Hungarian Algorithm / Min-Cost Flow Solver]
        BipartiteGraph --> Solver
        
        Matches[Optimal Dispatch Pairings:
        O_1 to C_3 (3m travel)
        O_2 to C_1 (5m travel)]
        
        Solver --> Matches
    end

Concurrency and Avoiding Double-Booking

When a courier is selected for a job, they are sent a Dispatch Offer. We must prevent the "Double Booking" race condition where two separate matching worker loops attempt to assign the same courier to two different orders simultaneously.

We mitigate this at the database layer using Optimistic Concurrency Control (OCC) on the orders table and Redis-based Distributed Locks on courier availability.

Before offering a job, the Matching Engine attempts to acquire an exclusive lock on Redis: SETNX lock:courier_uuid_8829 "true" EX 15 (15-second TTL).
If the lock is successfully acquired, the gRPC gateway pushes the job offer to the courier's device.
If the courier accepts the job, we update the order record using the version number to confirm: UPDATE orders SET courier_id = 'courier_uuid_8829', status = 'ACCEPTED_BY_RESTAURANT', version = version + 1 WHERE order_id = 'order_uuid_1012' AND version = 2;
If another thread modified the order status in the meantime, the row count returned is 0, indicating a race condition occurred, prompting the engine to safely roll back and re-evaluate.

6. Staff Engineer Perspective (Operational Trade-offs)

CAP Theorem Realities: Strong Consistency vs. Eventual Consistency

In a food delivery architecture, we must apply a hybrid approach to the CAP Theorem rather than selecting a single consistency model globally.

The Double-Entry Bookkeeping Pattern

In large-scale marketplaces, storing account balances as simple scalar integers (e.g. balance = balance + amount) is a major production pitfall. If a network partition occurs mid-write, or a database transaction retries, balances will quickly drift, resulting in massive financial audits.

To prevent this, a senior architect enforces Double-Entry Bookkeeping:

Balance calculations are never written as UPDATE balance = balance + 10. Instead, balance is a read-time sum of append-only Ledger Entries.
Every financial event must consist of at least two balanced entries: a Debit from one account and a Credit to another.
For example, when a courier completes a delivery worth $10.00:
1. Credit Courier Account: +$10.00
2. Debit Platform Outflow Account: -$10.00
The net balance across all system accounts always sums to exactly 0. This guarantees auditability, structural immutability, and safety against double-spend retries.

7. Candidate Verbal Script (Mock Interview Guide)

This verbatim dialogue shows how a top-performing candidate negotiates requirements and structures their design during a live architectural session.

Interviewer: "We want to design the backend systems for a scale like DoorDash. Let's focus specifically on the dispatcher engine. How would you pair drivers with orders when the food is ready at a restaurant?"

Candidate: "To design a matching engine at this scale, I want to clarify a key constraint. A naive approach would be a 'greedy' matching system where, the moment a restaurant marks food as ready, we query the closest idle driver and offer them the job. However, this is globally sub-optimal. If several orders are ready in a small radius, greedily matching the first one can result in high wait times for the subsequent ones.

To optimize this, I will design a Batch-based Geospatial Matching Loop that pools orders and idle couriers into spatial buckets using Uber’s H3 hexagonal grid indexing at resolution level 8, which has an average edge length of about 460 meters. Every 5 to 10 seconds, a distributed runner will pool all pending orders in that hexagonal cell and query Redis for active, idle drivers in the same cell and adjacent cells using Redis GEO commands."

Interviewer: "That makes sense. But computing the exact driving distance for every possible courier to every restaurant in that grid can be extremely CPU-intensive. How do you scale that computation under high load?"

Candidate: "That is a critical bottleneck. Calling a full routing engine like OSRM or Google Distance Matrix for $N \times M$ pairs is too slow.

To scale this, I would implement a multi-stage filtering pipeline:

Stage 1 (Geospatial Grid Filtering): We only consider couriers whose current H3 index falls inside the target hex cell or its immediate 6 neighboring cells. This strips away $99%$ of irrelevant couriers.
Stage 2 (Euclidean Distance Approximation): We calculate the simple straight-line distance using coordinate geometry. This is cheap and can filter out couriers separated by natural barriers (like a river or highway) if the straight-line distance exceeds a threshold.
Stage 3 (Routing API Batching): For the remaining candidates, we run a batch query against our local OSRM server to calculate the precise driving time. We then pass this cost matrix to our bipartite matching algorithm to solve for the global optimum in that cell."

Interviewer: "Excellent. Now, how do you handle the scenario where a driver has spotty cellular reception, loses connection, and then re-establishes it? How do you ensure their location is tracked without causing database contention?"

Candidate: "Under spotty network conditions, a courier's app might queue up offline location pings and suddenly send a batch of 10 pings at once, potentially out-of-order.

If we process these out-of-order, the tracking customer will see the driver 'teleport' backwards. To resolve this:

Every GPS ping payload must contain an monotonically increasing client-side timestamp and a sequence number.
The location ingestion service in Redis will use a Compare-And-Swap (CAS) logic using Lua scripting. It checks if the incoming timestamp is strictly greater than the stored timestamp for that courier_id. If it is older, we discard it from the real-time tracking cache.
The raw batch of pings is still written to Cassandra, as it is valuable for raw post-event route audit, but we prevent it from polluting the live real-time tracking interface."