Lesson 75 of 105 21 minFlagship

System Design: Building a Workflow Orchestration Platform

Design a production workflow orchestration platform with durable workflow state, retries, compensation, timeouts, signals, timers, idempotent activities, and operational recovery.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • wait for payment confirmation
  • call three downstream services
  • retry one step but not another
Recommended Prerequisites
System Design Interview Framework

Premium outcome

From vague architecture answers to staff-level trade-off thinking.

Backend engineers preparing for senior, staff, and architecture rounds.

What you unlock

  • A reusable system design answer framework for ambiguous prompts
  • Clear language for consistency, scaling, and reliability trade-offs
  • Case-study depth across feeds, payments, storage, and messaging systems

Many backend processes start life as a simple queue consumer with a database transaction and a couple of retries. This basic pattern works reasonably well until the business processes grow longer, more complex, and span multiple external service boundaries. In a simple architecture, you can easily wrap database writes in a local transaction. However, as soon as you have to coordinate actions across multiple microservices or third-party APIs (such as payment processors, shipping carriers, and customer notification systems), local database transactions can no longer guarantee consistency.

When business requirements demand complex coordination, custom microservice solutions quickly become unmanageable. A typical production flow might require:

  1. Reserving an item in the inventory service.
  2. Charging the user's credit card via an external gateway.
  3. Pausing the execution path to wait for a payment confirmation webhook.
  4. Calling three downstream warehouses to find the closest shipping point.
  5. Retrying a specific API call with exponential backoff while allowing another step to fail without stopping the flow.
  6. Executing compensation steps (such as refunding the card and releasing the inventory) if the shipment creation fails terminally.
  7. Resuming the flow only after receiving manual approval from a compliance manager.
  8. Surviving server crashes and cluster restarts without forgetting the state of in-progress operations.

At this point, you are no longer dealing with a standard background job. You are running a stateful, distributed workflow. Trying to manage this coordination using ad-hoc database updates, custom retry loops, and scattered message queues inside application microservices generates race conditions, double-payouts, and untraceable failures.

This design case study details the architecture of a production-grade workflow orchestration platform (similar to Temporal or AWS Step Functions) capable of executing millions of durably tracked business processes.


System Requirements

To construct a reliable workflow orchestration platform, we must define clear boundaries, non-functional operational guarantees, and specific scale assumptions.

Functional Requirements

  • Durable Workflow Execution: Execute sequential, parallel, and branching activities defined by code or DSL configuration templates.
  • Lease-Based Activity Execution: Orchestrate side-effecting tasks (activities) by dispatching them to external workers via lease-based lock queues.
  • Durable Timers: Support pausing workflow execution for arbitrary durations (seconds, hours, or months) without consuming active runtime CPU or thread resources on the engine hosts.
  • External Signals: Allow active workflows to halt execution, wait, and resume when an external event webhook or callback signal is received.
  • Saga-Style Compensation: Support defining explicit compensation steps to revert side effects when a workflow encounters a terminal failure.
  • Audit Trails and History Logging: Record an append-only, chronologically sorted list of all state changes, inputs, and outputs for every execution.
  • Dynamic Task Queues: Allow service teams to register and listen on custom queues, separating workloads by namespace and traffic priorities.
  • Execution Telemetry: Expose real-time execution states (such as active task count, failed activity distribution, timer delays) through custom Prometheus metrics or structured log output. This allows operations teams to trace stuck pipelines immediately, identify performance degradations, and establish SLA alarms before outages occur.
  • Workflow Versioning and Migration: When business logic changes, long-running workflow executions (which might live for weeks or months) cannot simply be terminated or updated in-flight. The orchestrator must support concurrent version execution, where older executions continue running on the schema version they started with, while new executions run on the latest deployed schema. This prevents system crashes and data loss during software deployments.

Non-Functional Requirements

  • Crash Resilience (Data Plane Survivability): If a worker process executing a workflow crashes, another worker must be able to claim the lease and resume execution exactly where the crashed worker stopped.
  • State-Process Isolation: The progress and state transitions of a workflow must not depend on the local memory of any single worker node.
  • High Ingest Throughput: The core engine database must handle millions of state updates daily without deadlocking or locking hot rows.
  • Strict Idempotency: Provide execution guarantees to ensure that activities (such as debiting a bank card) are never executed duplicate times due to connection timeouts or worker retries.
  • Queryable Visibility: The state and execution history of every active or completed workflow must be inspectable via an admin dashboard or API query.

Scale Assumptions

  • Volume: 10,000,000 active workflow executions per day.
  • Activity Fan-out: Each workflow averages 10 distinct activities.
  • Timer Capacity: Up to 100,000 concurrent timers waiting to fire at any given moment.

API Design and Service Contracts

The orchestration engine exposes gRPC services for high-speed worker polling and REST endpoints for client integrations.

1. Start Workflow Execution (POST /v1/workflows)

Invoked by clients to trigger a new business flow.

Request Payload:

{
  "workflowType": "order_fulfillment",
  "tenantId": "tenant_1120a",
  "businessKey": "order_uuid_9921c",
  "input": {
    "orderId": "order_uuid_9921c",
    "customerId": "cust_88291",
    "amountCents": 15000
  }
}

Response Payload (201 Created):

{
  "workflowId": "wf_exec_001a2b3c",
  "status": "RUNNING",
  "currentState": "STARTED",
  "createdAt": "2026-06-07T11:26:00Z"
}

2. Signal Workflow (POST /v1/workflows/{workflowId}/signals)

Sends an external event or user callback payload to an active workflow execution.

Request Payload:

{
  "signalName": "payment_webhook_received",
  "signalId": "sig_uniq_7761a",
  "payload": {
    "paymentStatus": "SUCCESS",
    "transactionId": "txn_88192a"
  }
}

Response Payload (200 OK):

{
  "workflowId": "wf_exec_001a2b3c",
  "signalProcessed": true
}

3. Poll Activity Task (POST /v1/activities/poll)

Invoked continuously by worker processes to fetch tasks scheduled by the orchestrator.

Request Payload:

{
  "workerId": "worker_host_08a",
  "taskQueue": "order_processing_queue"
}

Response Payload (200 OK):

{
  "taskId": "task_99182a",
  "workflowId": "wf_exec_001a2b3c",
  "activityType": "charge_credit_card",
  "input": {
    "customerId": "cust_88291",
    "amountCents": 15000
  },
  "attemptLimit": 5,
  "leaseDurationSeconds": 30
}

High-Level Architecture

The platform separates the orchestration engine, which manages the state machine, timers, and databases, from the worker pool, which executes the actual side-effecting code.

The Client Application initiates a workflow by posting to the Orchestrator Service. The Orchestrator writes the workflow start event to the Durable State Database and puts the first activity task into the Activity Queue.

Distributed Workers poll the activity queue, lease tasks using pessimistic database locking, execute the activity code, and report completions back to the engine. A dedicated Timer Engine runs in the background to handle delays, retries, and timeouts, while the Signal Gateway routes incoming webhooks to active workflows.

This decoupling ensures that the orchestration engine remains lightweight and stateless. It does not execute the actual business logic; rather, it coordinates the ordering of tasks, tracks history, and manages timers. The workers can scale independently based on the resources required for their specific activities (e.g., memory-intensive document generation vs network-intensive API calling).

By splitting ingestion, queuing, and execution into decoupled services, the platform ensures high scalability. It prevents worker outages from impacting registration pipelines and allows service teams to deploy customized workers to process complex activities without code deployments to the core orchestration engine.

The Durable State Database acts as the single source of truth for the platform. Every state transition, activity scheduling, and signal receipt must be committed to this database before the orchestrator responds to clients or workers. This strict transaction boundary prevents write loss. If the orchestrator service crashed mid-transition, the new leader node simply reads the history from the database to reconstruct the execution state, avoiding duplicate steps or lost context.

End-to-End Workflow Execution Pipeline

This sequence diagram tracks how a workflow is scheduled, how tasks are pushed to queues, claimed by workers, and how the state history is logged.

sequenceDiagram
    autonumber
    participant Client as Client App
    participant Engine as Orchestrator Engine
    participant DB as State DB
    participant Queue as Activity Task Queue
    participant Worker as Activity Worker
    
    Client->>Engine: POST /v1/workflows (Start order_fulfillment)
    Engine->>DB: Write Exec Record & Log WORKFLOW_STARTED event
    Engine->>Queue: Push Task (charge_credit_card)
    Engine-->>Client: Return workflowId
    
    Worker->>Queue: Poll for Tasks (locks task via SKIP LOCKED)
    Queue-->>Worker: Return Task (task_99182a)
    Worker->>Worker: Execute Payment API call
    Worker->>Engine: POST /v1/activities/complete (result)
    Engine->>DB: Log ACTIVITY_COMPLETED event & Update status
    Engine->>Queue: Push next activity (create_shipment)

Distributed Activity Lease Claim Flow

This flowchart details how workers claim activities safely using pessimistic locking to prevent duplicate claims.

graph TD
    A[Worker Request Task] --> B{Pending Task in DB?}
    B -- No --> C[Wait/Poll Backoff]
    B -- Yes --> D[Execute SELECT FOR UPDATE SKIP LOCKED]
    D --> E{Acquire DB Row Lock?}
    E -- No --> C
    E -- Yes --> F[Set Status=RUNNING, locked_by=workerId]
    F --> G[Extend lease_expires_at by 30s]
    G --> H[Return Task Payload to Worker Process]
    H --> I[Execute Business Side Effect]
    I --> J{Execution Success?}
    J -- Yes --> K[Delete Task or Set Status=SUCCEEDED]
    J -- No --> L[Increment attempt count & Set next_attempt_at]
    L --> M[Unlock Row & Release Lease]

Low-Level Design and Schema

At the storage layer, we model the system using a PostgreSQL relational database. This schema represents the logical tables, constraints, and composite indexes required to run durable execution engines.

-- Tracks active and completed workflow executions
CREATE TABLE workflow_executions (
    workflow_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id VARCHAR(64) NOT NULL,
    workflow_type VARCHAR(128) NOT NULL,
    business_key VARCHAR(128) NOT NULL,
    status VARCHAR(32) NOT NULL DEFAULT 'RUNNING', -- RUNNING, WAITING, SUCCEEDED, FAILED, COMPENSATING
    current_state VARCHAR(64) NOT NULL,
    input JSONB NOT NULL,
    context JSONB NOT NULL DEFAULT '{}'::jsonb,
    started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    completed_at TIMESTAMPTZ,
    CONSTRAINT uk_tenant_type_key UNIQUE (tenant_id, workflow_type, business_key)
);

-- Chronological append-only history of workflow events
CREATE TABLE workflow_history (
    event_id BIGSERIAL PRIMARY KEY,
    workflow_id UUID NOT NULL REFERENCES workflow_executions(workflow_id) ON DELETE CASCADE,
    event_type VARCHAR(64) NOT NULL, -- WORKFLOW_STARTED, TASK_SCHEDULED, TASK_COMPLETED, COMPENSATE_RUN
    payload JSONB NOT NULL DEFAULT '{}'::jsonb,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_workflow_history_sort 
ON workflow_history (workflow_id, event_id ASC);

-- Queue of activity tasks for workers to poll
CREATE TABLE workflow_activity_tasks (
    task_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    workflow_id UUID NOT NULL REFERENCES workflow_executions(workflow_id) ON DELETE CASCADE,
    activity_name VARCHAR(128) NOT NULL,
    task_queue VARCHAR(128) NOT NULL,
    status VARCHAR(32) NOT NULL DEFAULT 'PENDING', -- PENDING, RUNNING, SUCCEEDED, FAILED
    attempt_count INT NOT NULL DEFAULT 0,
    next_attempt_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    locked_by VARCHAR(128),
    locked_until TIMESTAMPTZ,
    input JSONB NOT NULL,
    result JSONB,
    last_error TEXT,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Index for workers polling active queue
CREATE INDEX idx_activity_tasks_poll 
ON workflow_activity_tasks (task_queue, status, next_attempt_at)
WHERE status IN ('PENDING', 'FAILED');

-- Durable timers to resume waiting workflows
CREATE TABLE workflow_timers (
    timer_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    workflow_id UUID NOT NULL REFERENCES workflow_executions(workflow_id) ON DELETE CASCADE,
    timer_type VARCHAR(64) NOT NULL,
    fire_at TIMESTAMPTZ NOT NULL,
    status VARCHAR(32) NOT NULL DEFAULT 'PENDING', -- PENDING, FIRED, CANCELLED
    payload JSONB NOT NULL DEFAULT '{}'::jsonb,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_workflow_timers_due 
ON workflow_timers (status, fire_at) 
WHERE status = 'PENDING';

-- Inbox for incoming external signals
CREATE TABLE workflow_signals (
    signal_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    workflow_id UUID NOT NULL REFERENCES workflow_executions(workflow_id) ON DELETE CASCADE,
    signal_name VARCHAR(128) NOT NULL,
    payload JSONB NOT NULL,
    processed BOOLEAN NOT NULL DEFAULT FALSE,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_workflow_signals_unprocessed 
ON workflow_signals (workflow_id, processed) 
WHERE processed = FALSE;

-- PAUSE points requiring manual human-in-the-loop intervention
CREATE TABLE workflow_manual_tasks (
    manual_task_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    workflow_id UUID NOT NULL REFERENCES workflow_executions(workflow_id) ON DELETE CASCADE,
    task_type VARCHAR(128) NOT NULL,
    status VARCHAR(32) NOT NULL DEFAULT 'OPEN', -- OPEN, COMPLETED, CANCELLED
    assigned_to VARCHAR(128),
    input JSONB NOT NULL,
    result JSONB,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    completed_at TIMESTAMPTZ
);

Schema Rationale & Index Optimization

  1. idx_activity_tasks_poll: The query used by workers fetching tasks continuously runs a filter on task_queue, status = 'PENDING', and next_attempt_at <= NOW(). By structuring a partial index that excludes completed tasks, the database search space is kept small, reducing index scan times to sub-millisecond durations.
  2. uk_tenant_type_key: A composite unique constraint on workflow_executions. This serves as an immediate idempotency gate. If a client retries a "Start Workflow" request due to network dropouts, the constraint blocks duplicate insertions, returning the existing execution reference.
  3. idx_workflow_history_sort: Workflow engines reconstruct state by replaying history events in strict sequence. This index ensures that events are retrieved in the exact order they occurred without sorting them in RAM.
  4. Foreign Key Deletions: We utilize ON DELETE CASCADE on execution child tables (history, tasks, signals) to simplify lifecycle maintenance. When a completed execution's metadata is archived and removed from PostgreSQL, all associated execution details are purged automatically in a single query.
  5. Manual Task Management Schema: The manual tasks table tracks human-in-the-loop actions. When a workflow pauses for compliance review, it logs a manual task record with an OPEN status. An external administration UI queries these tasks, allowing administrators to claim, review, and complete them. Upon completion, a signal is sent to the workflow engine, updating the execution state and scheduling the next automated activity.

Scaling Challenges and Capacity Estimation

To verify that a relational backend can scale to our targets, we evaluate write throughput, history log sizes, and timer polling limits.

1. Database Write IOPS Capacity

  • Assumptions:

    • Daily Workflow Starts ($N$) = $10,000,000$
    • Activities per Workflow ($A$) = $10$
    • Each activity execution generates $3$ database write operations (Task scheduled, Task claimed/locked, Task completed). Starting and completing the workflow adds $2$ writes.
  • Calculations: $$\text{Total Daily Writes} = N \times (2 + 3 \times A) = 10,000,000 \times (2 + 3 \times 10) = 320,000,000\text{ writes/day}$$ $$\text{Average Write Throughput} = \frac{320,000,000\text{ writes}}{86,400\text{ seconds}} \approx 3,703\text{ writes/sec}$$ $$\text{Peak Write Throughput (3x multiplier)} \approx 11,100\text{ writes/sec}$$

A single PostgreSQL database instance will struggle to sustain 11,000 write operations per second on spinning disk setups. To scale this write throughput:

  • We partition the workflow_activity_tasks and workflow_history tables by hashing the workflow_id. This allows us to scale database writes horizontally across $16$ database shards, isolating lock contention.
  • We batch database writes by committing history logs in groups of 10–50 events using thread-local flush buffers.

2. History Log Storage Capacity

  • Assumptions:

    • Each workflow execution logs $22$ history events (1 start, 10 scheduled, 10 completed, 1 finished).
    • Average JSON size of a single history event log ($E$) = $250$ bytes.
  • Calculations: $$\text{Daily Event Count} = 10,000,000 \times 22 = 220,000,000\text{ events/day}$$ $$\text{Daily Storage Growth} = 220,000,000 \times 250\text{ bytes} = 55,000,000,000\text{ bytes} = 55\text{ GB/day}$$

Over a year, this accumulates 20 TB of raw events. To prevent database degradation:

  • Completed workflows are moved out of PostgreSQL to cold storage (e.g., compressed JSON files in Amazon S3) after $14$ days.
  • The history logs table is partitioned by month. Once a month partition has no active workflows, it is archived to Parquet format on S3 and dropped from PostgreSQL.

3. Timer Engine Polling Frequency

  • Assumptions:

    • Total pending timers ($T_{active}$) = $100,000$
    • Timer Check Interval ($I$) = $1$ second
  • Calculations:

    • A cron sweeper polls the workflow_timers table every second.
    • To prevent polling bottlenecks, the query must target a partial index filtered on status = 'PENDING'.
    • If 5% of active timers fire within the same minute, the database must transition 5,000 states from PENDING to FIRED. This operation is fast because it involves single-column updates via primary keys.

Failure Scenarios and Resilience

Workflow orchestrators are designed to survive the failure of downstream networks and worker processes.

1. Worker Node Crash Mid-Activity

A worker claims a task, begins executing a payment API call, and then immediately crashes or loses power.

  • The Threat: The task remains in a RUNNING status indefinitely, halting workflow execution.
  • Resilience Design:
    • We lock activity tasks using a lease duration (e.g., locked_until = NOW() + INTERVAL '30 seconds').
    • A background sweeper thread constantly scans workflow_activity_tasks for items where status = 'RUNNING' and locked_until < NOW().
    • When a lease expires, the sweeper marks the task back to PENDING and increments the attempt count. Another worker can then claim and execute the activity.

2. Timer Engine Scanner Failures

If the background process responsible for scanning due timers crashes, workflows won't resume on schedule.

  • The Threat: Timeouts, delayed retries, and scheduled executions halt completely, causing delayed orders.
  • Resilience Design:
    • We run a cluster of timer workers coordinated via distributed locks (e.g., Consul or Redis).
    • The database is split into time ranges, and each worker is assigned a specific partition bucket.
    • If a timer worker node dies, its lease on the partition expires, and another worker claims the partition, ensuring that due timers are swept and fired.

3. Duplicate External Signals

A payment gateway web-hook fires twice due to a network retry, sending identical signals to a waiting workflow.

  • The Threat: The workflow might process the payment confirmation twice, progressing the state machine twice or triggering duplicate ship requests.
  • Resilience Design:
    • We enforce deduplication keys on incoming signals. Each signal payload must include a unique signalId.
    • The orchestrator checks if the signalId has already been processed for that specific workflow_id inside the workflow_signals table. Duplicate matches are discarded.

4. Compensation Activity Failures

During a compensation run, the worker executing the rollback step (e.g., calling refund_payment) crashes or the payment gateway returns a 500 error.

  • The Threat: The system is left in a partially rolled back state, causing data inconsistency and financial imbalance.
  • Resilience Design:
    • Compensations are treated as standard workflow activities, meaning they are bound to the same retry policies and backoff controls.
    • If a compensation activity fails, the orchestrator retries it up to the maximum attempt limit.
    • If it fails terminally, the orchestrator creates a record in workflow_manual_tasks, changes the workflow status to WAITING_FOR_MANUAL_ACTION, and alerts operations team members via PagerDuty.

5. Thundering Herd of Reconnecting Workers

If the connection between the activity worker pool and the orchestrator breaks, workers will repeatedly attempt to reconnect. When the database connection is restored, thousands of workers will poll the database at the same time, saturating database pools and causing thread exhaustion.

  • Mitigation: The polling API implements a token-bucket rate limiter. Furthermore, workers use randomized jitter on their polling intervals so that request arrivals are distributed evenly over time, preventing database CPU spikes.

Architectural Trade-offs

Designing a workflow engine requires selecting the execution state style and deciding between a centralized or decentralized topology.

Trade-off 1: Command/State-Machine Style vs. Event-History Replay (Temporal Model)

In the Command style, the orchestrator stores the current state in a database field. In the Event-History model, the orchestrator records every history event and reconstructs the workflow state on-the-fly by executing the workflow code and replaying those history events.

Aspect Command/State-Machine Style Event-History Replay (Temporal Model)
Complexity Low (Basic database updates and state flags) High (Requires sandboxed determinism in worker SDKs)
Branching Flexibility Complex (Workflow diagrams must be drawn in a UI) Unlimited (Workflows are written in code with standard loops/ifs)
Schema Evolutions Hard (Mutating state models breaks active runs) Very Hard (Changes to code must use explicit versioning calls)
Operational Inspectability High (Simple SQL query reveals the current state) Medium (Must parse history logs to reconstruct the current view)
Worker Resource Needs Low (Workers just execute single activities) High (Workers must download and replay history from start)

Trade-off 2: Orchestration (Centralized) vs. Choreography (Decentralized)

Orchestration uses a central coordinator to guide execution. Choreography relies on event-driven microservices subscribing to event buses and reacting independently.

Aspect Centralized Orchestration Decentralized Choreography
Process Visibility Single source of truth; status is easily queryable. Stale; must query multiple event logs and microservices.
Coupling High (Coordinator knows about all services) Low (Services only know about the event broker)
Error Handling & Sagas Easy (Coordinator executes explicit compensations) Hard (Requires cascading event chains to roll back state)
Latency Higher (Centralized hops and state writes) Lower (Immediate peer-to-peer event routing)

Staff Engineer Perspective

Operating a workflow engine at high-throughput levels exposes operational pitfalls that simple designs ignore.


Verbal Script

Interviewer: "How would you design a workflow orchestration platform that ensures exactly-once execution of business processes during worker failures?"

Candidate: "In a distributed system, achieving absolute 'exactly-once' execution on the network is mathematically impossible due to the Two Generals' Problem. If a worker process executes a payment and crashes before reporting success to the orchestrator, the orchestrator cannot know if the payment went through. Therefore, we design for at-least-once orchestration paired with idempotent activities.

To implement this, the orchestrator generates a unique execution trace for every activity task, incorporating the workflowId and a monotonically increasing taskId. When the worker executes the activity, it must forward this trace ID as the idempotency key to the target downstream service (like Stripe). If a worker crashes and the orchestrator schedules the task again, the next worker uses the same idempotency key, allowing the downstream system to identify the duplicate call and return the cached result safely without executing the transaction again."

Interviewer: "What is your data archiving strategy when the workflow history table reaches billions of rows?"

Candidate: "We use a multi-tiered archiving approach. Since active workflows only need their history during execution and replay, we keep only active histories in our hot PostgreSQL instance.

First, we partition the history table by month. Once a month partition has no active workflows (e.g., 90 days after the month closes), we run an ETL job that extracts those histories, converts them into compressed Parquet files, and writes them to Amazon S3.

Second, we delete the partition from the database. This keeps our PostgreSQL index sizes small enough to fit in memory, ensuring search lookups for active workflows remain fast. If an operator needs to view the audit logs of an archived workflow, the admin dashboard queries S3 via Amazon Athena, avoiding database performance penalties."

Interviewer: "How would you handle a scenario where a workflow contains a loop that generates millions of history events, potentially exceeding database limits?"

Candidate: "This is known as the History Size Bloat problem. If a workflow runs for months and generates millions of events, replaying the entire history to reconstruct the state becomes a major bottleneck.

To prevent this, we implement workflow checkpointing or resetting (Continue-As-New). When the history event count for a single workflow execution exceeds a defined limit (e.g., 10,000 events), the orchestrator triggers a ContinueAsNew instruction. It takes the current accumulated state of the workflow, writes a new workflow execution record with a fresh history log, feeds the state as the starting input, and terminates the old execution. This resets the history event count back to zero, protecting both our database storage and the worker replay latency."

Interviewer: "How would you ensure resource isolation and fairness so that a single misconfigured workflow type from one tenant doesn't starve tasks from other tenants?"

Candidate: "We implement Multi-Queue Scheduling with Token Bucket rate limits at the Orchestrator layer.

Instead of pushing all tasks into a single database table shared by all clients, we assign activities to queues grouped by tenantId and workflowType. The polling gateway uses a weighted fair-queuing scheduler.

When workers request tasks, the scheduler retrieves tasks from queues proportionally based on allocated tenant quotas. Furthermore, if a specific tenant's task rate spikes beyond their contractual SLA, our token-bucket filter rejects new workflow starts with an HTTP 429 status code. This preserves the database CPU capacity and ensures that other tenant workflows execute within their expected P99 latency limits."


Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.