Lesson 6 of 25 13 minDeep Systems

System Design: Designing a Distributed Task Scheduler

How do you reliably schedule and execute millions of jobs at scale? Learn about Timing Wheels, distributed locking, SKIP LOCKED, and exactly-once execution.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • **Decoupled Scheduler & Executor:** Decoupling scheduling (calculating execution time and fetching tasks) from execution (workers performing the job) isolates failures and enables independent scaling.
  • **Optimistic Concurrency Control:** Resolving duplicate execution at the database layer is achieved by atomic state-transition updates, preventing multiple polling nodes from locking the same execution batch.
  • **Hashed Timing Wheels:** In-memory circular buffers allow scheduling and triggering sub-second events without continuous, expensive database scanning.
Recommended Prerequisites
System Design Interview Framework

Premium outcome

Distributed systems mechanics for engineers building serious backend platforms.

Engineers who want stronger distributed-systems fundamentals for platform work.

You leave with

  • More confidence with consistency, causality, locking, and time in distributed systems
  • A stronger sense of which backend guarantees are expensive and why
  • The systems-level foundation needed for difficult architecture trade-offs

1. Core Requirements & Scale Constraints

A Distributed Task Scheduler is a foundational infrastructure component responsible for triggering operations at specific points in the future. Example use cases include scheduling delayed payment audits, triggering automated email campaigns, handling retry policies for failed webhooks, or running daily transactional reconciliations.

Designing a scheduler that handles millions of events requires solving complex challenges around Distributed Locking, High-precision timing, Thundering herds, and Execution guarantees (At-Least-Once vs. Exactly-Once delivery).

System Decoupling:
[Client] ---> (Submit Task) ---> [API Ingestion] ---> [Task Metadata DB]
                                                             |
                                                     [Scheduler Poller] (Hierarchical Timing Wheels)
                                                             |
                                                     [Message Broker] (Kafka / RabbitMQ)
                                                             |
                                                     [Worker Fleet] ---> (Webhook Call) ---> [Downstream API]

Functional Requirements

  • Task Submission: Clients can submit a task with a payload, a specific target execution time (absolute timestamp or relative delay), and a callback URL.
  • Task Cancellation: Clients can cancel or reschedule a pending task using its unique task_id.
  • Execution Status: Clients can query the state of any scheduled job (PENDING, QUEUED, RUNNING, COMPLETED, FAILED).
  • Support for Recurring Jobs: Support both one-off tasks and recurring jobs (cron-like patterns).

Non-Functional SLAs

  • High Precision (Low Latency Deviation): Tasks must be executed within $\le 1\text{ second}$ of their scheduled execution time.
  • High Durability: Once a task is submitted and acknowledged, it must never be lost, even under database crashes or broker outages.
  • At-Least-Once Delivery Guarantees: Tasks must be triggered at least once. Duplicate executions must be minimized and managed via strict worker idempotency.
  • High Scale: Support 100 Million daily scheduled tasks.

Back-of-the-Envelope Estimates

  • Scale Calculation:
    • Daily Scheduled Tasks: 100 Million.
    • Average Submission Rate: $100\text{ Million} \div 86,400\text{ seconds} \approx 1,157\text{ submissions/sec}$.
    • Peak Submission Rate (5x multiplier): $\approx 5,800\text{ submissions/sec}$.
  • Thundering Herd Execution Peak:
    • Standard execution rates mirror submission rates. However, many systems schedule tasks precisely on the hour, half-hour, or midnight.
    • Peak Trigger Rate: If 15% of daily tasks are scheduled precisely at midnight:
    • Peak Trigger Volume: $15\text{ Million tasks}$ to be triggered immediately.
    • Target execution window: 5 minutes ($300\text{ seconds}$).
    • Peak Execution Throughput Required: $15\text{ Million} \div 300\text{ seconds} = 50,000\text{ tasks/sec}$.
  • Storage Calculations:
    • Metadata payload per task (JSON, headers, configs): 1 KB average.
    • Daily Metadata Storage: $100\text{ Million} \times 1\text{ KB} = 100\text{ Gigabytes per day}$.
    • 30-day History Retained (Audits & Logs): $100\text{ GB} \times 30 \approx 3\text{ Terabytes}$.

2. API Design & Core Contracts

Submit Task API (REST)

Used by clients to queue up delayed executions.

POST /v1/tasks
Content-Type: application/json
X-Client-Signature: <hmac_sig>

{
  "client_id": "service_billing_processor",
  "callback_url": "https://api.internal/billing/charge-subscription",
  "execute_at": "2026-05-22T18:00:00Z",
  "idempotency_key": "billing_invoice_9082348",
  "retry_policy": {
    "max_retries": 3,
    "backoff_seconds": 30
  },
  "payload": {
    "invoice_id": "inv_8829031",
    "customer_id": "cust_44921"
  }
}

Response Payload:

{
  "task_id": "task_e4c5f902-892b-4231-a589-9831a28a3812",
  "status": "PENDING",
  "execute_at": "2026-05-22T18:00:00Z",
  "created_at": "2026-05-22T16:55:00Z"
}

Cancel Task API

DELETE /v1/tasks/task_e4c5f902-892b-4231-a589-9831a28a3812
Authorization: Bearer <jwt_token>

Response Payload:

{
  "task_id": "task_e4c5f902-892b-4231-a589-9831a28a3812",
  "status": "CANCELLED"
}

3. High-Level Design (HLD)

The core architecture decouples the Task Storage & Scheduling phase from the Worker Execution phase using an intermediate High-Throughput Message Queue.

graph TD
    %% Clients submitting tasks
    Client[Microservice / Client App]
    Gateway[API Gateway]

    Client -->|Submit Task| Gateway

    %% Submission ingestion
    subgraph Ingestion ["Ingestion Tier"]
        SubmissionService[Task Ingestion Service]
        Gateway --> SubmissionService
    end

    %% Storage Layer
    subgraph Storage ["Task Storage Tier"]
        Postgres[(PostgreSQL Task Master DB)]
        SubmissionService -->|Insert PENDING Task| Postgres
    end

    %% Scheduler Poller Cluster
    subgraph Scheduling ["Scheduling & Poller Cluster"]
        PollerNodes[Scheduler Poller Daemon Cluster]
        TimingWheel[Hierarchical In-Memory Timing Wheel]
        PollerNodes -.->|Load 1m window tasks| TimingWheel
    end

    PollerNodes -->|1. Fetch Due Tasks| Postgres

    %% Queue and Execution
    subgraph QueueLayer ["Decoupled Message Queue"]
        Kafka{Apache Kafka Dispatch Queue}
    end

    PollerNodes -->|2. Push Triggered Tasks| Kafka

    subgraph WorkerFleet ["Executor Worker Fleet"]
        Worker1[Worker Node A]
        Worker2[Worker Node B]
        Kafka -->|3. Consume Jobs| Worker1
        Kafka -->|3. Consume Jobs| Worker2
    end

    %% Callback Trigger
    DownstreamAPI[Client Webhook Endpoint]
    Worker1 -->|4. Invoke Callback POST| DownstreamAPI
    Worker1 -->|5. Update Status SUCCESS/FAIL| Postgres

Component Descriptions

  1. Task Ingestion Service: Receives requests, validates inputs, generates unique task_id hashes using the client-provided idempotency keys, and writes new tasks directly to PostgreSQL with status = PENDING.
  2. Scheduler Poller Cluster: A fleet of coordinated daemons. Every 30 to 60 seconds, they pull tasks scheduled to execute in the near future, loading them into an in-memory Timing Wheel to execute with sub-second precision.
  3. Apache Kafka (Dispatch Queue): Decouples the Poller from the Workers. Once a task reaches its trigger time, the Poller writes it to Kafka as a job event. This buffers downstream workers from sudden spikes.
  4. Executor Worker Fleet: Stateless consumer nodes. They fetch jobs from Kafka partitions, invoke client webhook callbacks, handle retries, and write final statuses back to the database.

4. Low-Level Design (LLD) & Data Models

Database Technology Choice

  • PostgreSQL (with CockroachDB / Cloud Spanner for global scaling): To guarantee that once a task is scheduled it is never lost, we need a relational DB supporting ACID transactions. Relational systems also provide native B-Tree indexing on temporal columns (execute_at) which is mandatory for range-based polling queries.
  • NoSQL engines (like DynamoDB or Cassandra) are poorly suited here because they lack efficient sorted range indexing across global tables without massive indexing overhead, and do not natively support strong ACID locking primitives during batch status transitions.

Master Database DDL (PostgreSQL)

-- Core Tasks Table
CREATE TABLE tasks (
    task_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    client_id VARCHAR(100) NOT NULL,
    idempotency_key VARCHAR(255) NOT NULL,
    callback_url TEXT NOT NULL,
    payload JSONB,
    execute_at TIMESTAMP WITH TIME ZONE NOT NULL,
    status VARCHAR(50) NOT NULL CHECK (
        status IN ('PENDING', 'QUEUED', 'RUNNING', 'COMPLETED', 'FAILED', 'CANCELLED')
    ),
    retry_count INT NOT NULL DEFAULT 0,
    max_retries INT NOT NULL DEFAULT 3,
    backoff_seconds INT NOT NULL DEFAULT 30,
    locked_by VARCHAR(100),
    lock_expiration TIMESTAMP WITH TIME ZONE,
    version INT NOT NULL DEFAULT 1, -- Optimistic locking version
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    CONSTRAINT uq_client_idempotency UNIQUE (client_id, idempotency_key)
);

-- Indexing for Range Queries (The most executed query by Pollers)
CREATE INDEX idx_tasks_execution ON tasks(execute_at, status) 
WHERE status = 'PENDING';

-- Execution Runs Log (For Auditing & Troubleshooting)
CREATE TABLE task_runs (
    run_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    task_id UUID NOT NULL REFERENCES tasks(task_id) ON DELETE CASCADE,
    started_at TIMESTAMP WITH TIME ZONE NOT NULL,
    completed_at TIMESTAMP WITH TIME ZONE,
    status VARCHAR(50) NOT NULL,
    error_message TEXT,
    retry_attempt INT NOT NULL
);

CREATE INDEX idx_task_runs_id ON task_runs(task_id);

5. Scaling Challenges & System Bottlenecks

The Double-Polling Race Condition: Resolving with SKIP LOCKED

If we run a cluster of 10 Scheduler Poller nodes, we must prevent multiple pollers from picking up the exact same tasks simultaneously. If Poller A and Poller B both fetch tasks due for execution at the exact same millisecond:

SELECT * FROM tasks WHERE execute_at <= NOW() AND status = 'PENDING' LIMIT 500;

Both pollers will load the same 500 tasks, drop them into Kafka, and run them twice.

To solve this, we leverage PostgreSQL's highly optimized FOR UPDATE SKIP LOCKED concurrency model:

-- Transaction block executed by Poller nodes
BEGIN;
UPDATE tasks
SET status = 'QUEUED', locked_by = 'poller_node_02', lock_expiration = NOW() + INTERVAL '5 minutes'
WHERE task_id IN (
    SELECT task_id 
    FROM tasks 
    WHERE execute_at <= NOW() AND status = 'PENDING'
    ORDER BY execute_at ASC
    LIMIT 500
    FOR UPDATE SKIP LOCKED
)
RETURNING task_id, callback_url, payload;
COMMIT;

How this works:

  • FOR UPDATE instructs the engine to acquire an exclusive write-lock on the selected 500 rows.
  • SKIP LOCKED instructs the engine to bypass any rows that are already locked by a different concurrent poller transaction.
  • If Poller A locks rows 1-500, Poller B running the exact same query immediately skips rows 1-500 and locks rows 501-1000. This provides $100%$ collision-free parallel polling with zero database lock contention.

Managing the Thundering Herd with Hashed Timing Wheels

During peak events (e.g. midnight daily syncs), executing a database check every second becomes a massive bottleneck. If $50,000\text{ tasks}$ are due at exactly 00:00:00, a sequential poller running LIMIT 500 will require 100 database transactions, introducing significant execution delays.

We solve this using Hashed Timing Wheels at the Poller tier:

  1. Pre-fetching: Every 1 minute, the Poller runs a bulk pre-fetch query to load all tasks due in the next 60 seconds into local memory.
  2. Timing Wheel Structure: The memory cache is structured as a circular array containing 60 buckets (representing the 60 seconds of the minute).
  3. Hashing into Slots: A task due in $K$ seconds from the start of the minute is hashed directly into slot K.
  4. Pointer Rotation: An internal timer thread increments a pointer current_slot = (current_slot + 1) % 60 every second. The Poller immediately reads all tasks in that slot and fires them to Kafka.
Hashed Timing Wheel: Circular Buffer of 60 Slots (1 Slot = 1 Second)

           [59]  [00]  [01]
       [58]                 [02]
     [57]                     [03] <--- Pointer ticks once per second.
    [56]     (Current Slot)     [04]      Fires all tasks inside slot
    [55]           |            [05]      directly to Kafka Partition.
     [54]          v          [06]
       [53]                 [07]
           [52]  ...   [08]

This reduces database access to a single bulk transaction per minute, turning slow random disk seeks into highly predictable sequential batch fetches.


6. Staff Engineer Perspective (Operational Trade-offs)

Webhook Push Invocation vs. Worker Queue Pull

Surviving Broker Outages and Dead-Locks


7. Candidate Verbal Script (Mock Interview Guide)

This verbatim dialogue shows how a top-performing candidate navigates requirements and structures their design during a live architectural session.


Interviewer: "Let's focus on the scheduling mechanics. How do you scale the Poller to handle 50,000 tasks due at the exact same second without crushing your database?"

Candidate: "If we hit the database every second querying for individual tasks, the database will fail under lock contention and connection exhaustion.

To scale this, I would implement an in-memory Hashed Timing Wheel on the Poller nodes. Instead of querying the database every second, the Poller will query the database once every minute to pre-fetch all tasks scheduled to execute within the next 60 seconds.

We load these pre-fetched tasks into an in-memory circular array with 60 slots, representing the 60 seconds of that minute. Each slot holds a linked list of tasks. An internal thread ticks a pointer forward once every second, reading the tasks in the current slot and instantly dropping them into Kafka. This shifts the real-time execution cost from expensive database joins to a simple, sub-millisecond memory lookup."

Interviewer: "That works if all tasks fit in memory. But what if we have millions of tasks due in that minute? The Poller will run out of memory."

Candidate: "That is a valid boundary case. If the volume of tasks due in a single minute exceeds our memory limits, we can transition to a Hierarchical Timing Wheel design, similar to how Linux handles timers.

Instead of a single wheel, we maintain multiple wheels representing different time scales: a Second Wheel (60 slots), a Minute Wheel (60 slots), and an Hour Wheel (24 slots).

  • Tasks scheduled hours away are stored on disk in PostgreSQL.
  • Only tasks due in the current hour are loaded into the Poller's Hour Wheel.
  • As the hour pointer ticks, tasks are cascaded down into the Minute Wheel, and eventually down into the Active Second Wheel.
  • Furthermore, we can scale the Poller horizontally by partition-key hashing. We shard the task database by hashing the task_id and assign each Poller node a subset of shards. Poller Node A only pre-fetches and holds the timing wheel for Shards 1-4, while Poller Node B manages Shards 5-8, allowing us to scale memory capacity linearly."

Interviewer: "Excellent. How do you guarantee that a job is executed 'exactly once'? What if a worker executes a payment task, but crashes before it can write the 'SUCCESS' status back to the database?"

Candidate: "In a distributed system, achieving mathematical 'exactly-once' execution over a network is impossible because network delivery acknowledgement can always fail. We must build our system around At-Least-Once Delivery combined with strict Downstream Idempotency.

When a worker pulls a task, it executes it and writes a record to a deduplication_ledger table inside a transaction on the target database, using the unique task_id (or a client-provided idempotency_key) as a primary key constraint.

If a worker crashes mid-execution and the scheduler retries the task on a different worker, the new worker will attempt to insert the same task_id into the deduplication ledger. The database will throw a unique key constraint violation, letting the worker know the task has already been completed. The worker can then safely skip the callback execution and mark the task as complete in the metadata DB."


Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.