In a traditional monolithic architecture, maintaining data consistency is a solved problem. You wrap your database operations in a single local database transaction, and the database engine guarantees ACID (Atomicity, Consistency, Isolation, Durability) compliance. If any operation fails, the database rollback mechanism instantly reverts all changes, leaving the system in a clean, consistent state.
In a microservices architecture, this single transaction boundary disappears. A single business process, such as placing an order, often spans multiple services: an Order Service to write order records, an Inventory Service to reserve items, a Payment Service to charge credit cards, and a Delivery Service to dispatch courier drivers. Each service has its own private, isolated database.
Attempting to enforce consistency using distributed database transactions (such as Two-Phase Commit or 2PC) at scale introduces major performance bottlenecks. They require locking database rows across multiple networks, which reduces write throughput, increases query latencies, and creates single points of failure. The Saga Pattern is the industry standard for maintaining eventual consistency across microservices without distributed locks.
This system design guide details the architectural blueprint for designing a resilient, high-throughput Saga coordination platform capable of managing 5,000 transactions per second.
System Requirements
To design an enterprise-grade Saga coordination system, we divide our requirements into functional capabilities, non-functional operational limits, and scale assumptions.
Functional Requirements
- Sequence Coordination: Execute multi-step business transactions across isolated databases in a defined order.
- Compensating Transactions: Automatically trigger compensating (rollback) tasks in reverse chronological order when any step in the Saga encounters a terminal failure.
- Durable State Storage: Maintain the lifecycle state and execution history of every active Saga instance.
- Asynchronous Execution: Decouple participant invocations using messaging queues to prevent blocking connection pools.
- Idempotent Retries: Provide execution frameworks to retry failed participants and compensations safely without side effects.
- Queryable Audits: Expose the execution timeline and state of active and completed Sagas for operational troubleshooting.
Non-Functional Requirements
- Eventual Consistency: Accept that intermediate states will be temporarily inconsistent across databases, but guarantee that the system resolves to a final consistent state.
- High Write Throughput: The coordination store must handle thousands of state transitions per second without locking tables.
- Fail-Safe Recovery: If the coordinator process crashes, it must recover the state of in-progress Sagas from the database and resume executions without missing steps.
- No Single Point of Failure (SPOF): Participant microservice failures must not stall unrelated Sagas or degrade overall system availability.
Scale Assumptions
- Throughput: 5,000 active Saga executions per second at peak.
- Average Steps: Each Saga involves 4 distinct microservice participants (e.g., Order, Inventory, Payment, Shipping).
API Design and Service Contracts
A Saga orchestrator coordinates execution using command/reply queues, while choreography systems utilize shared event channels.
1. Inbound Saga Trigger (POST /v1/sagas/orders)
Invoked by client systems to start an order placement Saga.
Request Payload:
{
"sagaType": "order_placement",
"customerId": "cust_uuid_99812",
"items": [
{ "itemId": "item_8819", "quantity": 2 }
],
"paymentMethodId": "pm_5521",
"amountCents": 9900
}
Response Payload (202 Accepted):
{
"sagaId": "saga_uuid_77182ab",
"status": "PROCESSING",
"startedAt": "2026-06-07T11:48:00Z"
}
2. Participant Command Contract (gRPC Protocol)
The orchestrator dispatches execution commands to participant microservices.
syntax = "proto3";
package codesprintpro.saga.v1;
service SagaParticipantService {
rpc ExecuteStep (StepCommandRequest) returns (StepReplyResponse);
rpc CompensateStep (CompensateCommandRequest) returns (CompensateReplyResponse);
}
message StepCommandRequest {
string saga_id = 1;
string step_name = 2;
string payload_json = 3;
}
message StepReplyResponse {
enum StepStatus {
SUCCESS = 0;
FAILURE = 1;
}
StepStatus status = 1;
string output_json = 2;
string error_message = 3;
}
message CompensateCommandRequest {
string saga_id = 1;
string step_name = 2;
string payload_json = 3;
}
message CompensateReplyResponse {
bool success = 1;
}
High-Level Architecture
There are two primary architectures for coordinating a Saga: Choreography and Orchestration.
In Choreography, services listen to Kafka event topics and trigger their local transactions independently. For example, when the Order Service publishes an ORDER_CREATED event, the Inventory Service consumes it, reserves stock, and publishes an INVENTORY_RESERVED event, which the Payment Service consumes.
In Orchestration, a centralized Saga Orchestrator manages the entire lifecycle. It reads the Saga definition, writes the state transitions to the Saga State DB, and pushes step commands into Kafka. Participant services process these commands and return results. If any step fails, the Orchestrator initiates compensating commands in reverse order.
Choreography (Event-Driven Decentralized Flow)
This sequence diagram shows decentralized participant cooperation via an event broker.
sequenceDiagram
autonumber
participant Order as Order Service
participant Inventory as Inventory Service
participant Payment as Payment Service
participant Kafka as Kafka Event Broker
Order->>Order: Write Pending Order
Order->>Kafka: Publish ORDER_CREATED Event
Kafka->>Inventory: Consume ORDER_CREATED Event
Inventory->>Inventory: Reserve Items in DB
Inventory->>Kafka: Publish INVENTORY_RESERVED Event
Kafka->>Payment: Consume INVENTORY_RESERVED Event
Payment->>Payment: Charge Card
Payment->>Kafka: Publish PAYMENT_COMPLETED Event
Kafka->>Order: Consume PAYMENT_COMPLETED Event
Order->>Order: Mark Order as COMPLETED
Orchestration (Centralized Coordinator Flow)
This diagram illustrates the centralized command coordination and compensating rollback path during a payment failure.
sequenceDiagram
autonumber
participant Client as Checkout Client
participant Orch as Saga Orchestrator
participant State as Saga State DB
participant Inv as Inventory Service
participant Pay as Payment Service (Fails)
Client->>Orch: POST /v1/sagas/orders
Orch->>State: Create Saga Instance & Set status=STARTED
Orch->>Inv: Send Command: RESERVE_STOCK
Inv-->>Orch: Reply: STOCK_RESERVED (Success)
Orch->>State: Log Step 1 Success & Set status=STOCK_RESERVED
Orch->>Pay: Send Command: CHARGE_CARD
Pay-->>Orch: Reply: INSUFFICIENT_FUNDS (Failure)
Orch->>State: Log Step 2 Failure & Set status=COMPENSATING
Note over Orch, Inv: Trigger Rollback Compensation
Orch->>Inv: Send Command: RELEASE_STOCK (Compensate Step 1)
Inv-->>Orch: Reply: STOCK_RELEASED (Success)
Orch->>State: Set status=REJECTED
Orch-->>Client: Return Order Rejected Status
Low-Level Design and Schema
For centralized orchestration, the database must track Saga state, participant execution, and history logs. We model this using a PostgreSQL relational database.
-- Tracks overall Saga instance executions
CREATE TABLE saga_instances (
saga_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
saga_type VARCHAR(128) NOT NULL,
status VARCHAR(32) NOT NULL DEFAULT 'STARTED', -- STARTED, COMPLETED, COMPENSATING, REJECTED, FAILED
input_payload JSONB NOT NULL,
current_step_index INT NOT NULL DEFAULT 0,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ
);
CREATE INDEX idx_saga_instances_status
ON saga_instances (status, created_at);
-- Tracks participant steps within a Saga instance
CREATE TABLE saga_steps (
step_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
saga_id UUID NOT NULL REFERENCES saga_instances(saga_id) ON DELETE CASCADE,
step_name VARCHAR(128) NOT NULL,
step_index INT NOT NULL,
status VARCHAR(32) NOT NULL DEFAULT 'PENDING', -- PENDING, SUCCESS, FAILED, COMPENSATED
input_payload JSONB,
output_payload JSONB,
error_message TEXT,
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT uk_saga_step UNIQUE (saga_id, step_index)
);
CREATE INDEX idx_saga_steps_lookup
ON saga_steps (saga_id, step_index ASC);
-- Outbox logs for publishing events atomically
CREATE TABLE transactional_outbox (
event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aggregate_type VARCHAR(128) NOT NULL,
aggregate_id VARCHAR(128) NOT NULL,
event_type VARCHAR(128) NOT NULL,
payload JSONB NOT NULL,
status VARCHAR(32) NOT NULL DEFAULT 'PENDING', -- PENDING, PUBLISHED, FAILED
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
processed_at TIMESTAMPTZ
);
CREATE INDEX idx_outbox_pending
ON transactional_outbox (status, created_at)
WHERE status = 'PENDING';
Schema Rationale & Index Optimization
uk_saga_step: Enforces uniqueness on(saga_id, step_index). This blocks duplicate step records for a single index in case participant timeouts cause retry commands to create duplicate step logs.idx_outbox_pending: A partial index restricted tostatus = 'PENDING'. A background outbox publisher (such as Debezium or a custom tailing daemon) continuously queries this table. The partial index keeps the search scan size extremely small, preventing DB performance degradation.ON DELETE CASCADE: Restructured so that deleting a finished Saga instance purges its child step logs in a single operational write, simplifying ledger database maintenance.
Scaling Challenges and Capacity Estimation
Coordinating distributed transactions at 5,000 TPS requires evaluating write IOPS, event size growth, and database sharding.
1. Database Write IOPS Capacity
-
Assumptions:
- Peak Saga Rate ($S$) = $5,000$ Sagas/sec
- Average Steps per Saga = $4$ steps
- Each step writes $2$ database rows (1 to insert the step log, 1 to update the Saga instance status). Starting and completing the Saga adds $2$ writes.
-
Calculations: $$\text{Total Write Operations Per Second} = S \times (2 + 2 \times 4) = 5,000 \times 10 = 50,000\text{ writes/sec}$$
A single relational database instance cannot support 50,000 write operations per second under ordinary circumstances. To scale this write throughput:
- We shard the
saga_instancesandsaga_stepstables by hashing thesaga_idacross $16$ database instances. - We utilize an in-memory key-value store (e.g., Redis or key-value structures inside Flink) to manage active, hot Saga states, flushing logs to PostgreSQL asynchronously using a Kafka outbox queue.
2. Event Broker Network Bandwidth
-
Assumptions:
- Peak Saga Rate = $5,000$ Sagas/sec
- Average message payload size ($M$) = $1$ KB (contains step payload + trace contexts)
-
Calculations:
- Each Saga produces 1 start message, 4 command messages, 4 reply messages, and 1 completion message. Total events per Saga = $10$ events. $$\text{Total Broker Messages Per Second} = 5,000 \times 10 = 50,000\text{ messages/sec}$$ $$\text{Network Bandwidth} = 50,000 \times 1\text{ KB} = 50,000\text{ KB/sec} = 50\text{ MB/sec} \approx 400\text{ Mbps}$$
This volume of network throughput requires configuring Kafka partitions. We run a Kafka cluster with at least $3$ broker nodes, dividing the order topics into $24$ partitions to spread the network load evenly.
Failure Scenarios and Resilience
Saga platforms must recover safely from process crashes and network partitions without generating transaction discrepancies.
1. Saga Orchestrator Crash Recovery
The orchestrator server crashes after sending a CHARGE_CARD command but before receiving the gateway reply.
- The Threat: When the orchestrator recovers, it does not know if the card was charged, potentially leading to stuck orders or duplicate billing.
- Resilience Design:
- When the orchestrator restarts, it scans the
saga_instancestable for tasks wherestatus = 'STARTED'. - It reads the
saga_stepshistory logs to reconstruct the execution state. - To verify the state of the in-progress step, it queries the participant service using the
saga_idto get the outcome, resuming execution from that point forward.
- When the orchestrator restarts, it scans the
2. Participant Microservice Timeouts
The orchestrator sends a RESERVE_STOCK command to the Inventory Service, but the network connection drops and the request times out.
- The Threat: We do not know if the inventory was reserved, potentially leading to orphan stock locks.
- Resilience Design:
- The orchestrator must never assume failure and trigger compensations immediately.
- It queries the Inventory Service using the
saga_idto check the lock status. - If the reservation was written, the orchestrator updates its state to success. If it was not, the orchestrator retries the request. If the inventory service returns a terminal error, the orchestrator initiates compensations.
3. Out-of-Order Event Deliveries
Due to network routing variances, a PAYMENT_COMPLETED reply arrives at the orchestrator before the INVENTORY_RESERVED confirmation event.
- The Threat: The orchestrator state machine updates the Saga status out of sequence, violating execution constraints.
- Resilience Design:
- We use monotonically increasing sequence vectors on Saga events.
- The orchestrator checks if the incoming event sequence matches the expected index step.
- If it is late or out-of-order, the orchestrator buffers the event in a Redis cache and processes it only after the preceding steps complete.
4. Semantic Lock Conflicts (Lack of Isolation)
Because Saga steps commit local transactions immediately, a second Saga might read intermediate data (e.g., reserving stock that has already been provisionally reserved by a pending Saga).
- The Threat: Over-selling inventory if the first Saga fails and releases the stock.
- Resilience Design:
- We implement Semantic Locks. When a Saga reserves stock, the Inventory Service updates the row status to
PENDING_RESERVEDand saves thesaga_id. - If a second Saga attempts to purchase the same item, the inventory engine blocks it or prompts it to wait.
- Once the first Saga completes or compensates, the status transitions to
COMMITTEDorAVAILABLE, preventing double-selling.
- We implement Semantic Locks. When a Saga reserves stock, the Inventory Service updates the row status to
Architectural Trade-offs
Choosing the coordination model and consistency mechanism requires balancing operational overhead against system complexity.
Trade-off 1: Choreography (Decentralized) vs. Orchestration (Centralized)
Decentralized choreography relies on event reaction; centralized orchestration utilizes a dedicated workflow engine.
| Aspect | Choreography (Decentralized) | Orchestration (Centralized) |
|---|---|---|
| System Coupling | Low (Services only know about the event broker) | High (Orchestrator coordinates all participants) |
| Operational Visibility | Low (Must trace events across multiple logs) | High (Single database tracks execution state) |
| Debugging Complexity | High (Hard to recreate failure event chains) | Low (Saga state logs show the exact failure point) |
| Single Point of Failure | Low (No central coordinator node) | High (If orchestrator DB fails, all Sagas halt) |
| Circular Dependencies | High risk (Services can easily create event loops) | Low (Orchestrator enforces a linear execution graph) |
Trade-off 2: Two-Phase Commit (2PC) vs. Saga Pattern
2PC enforces strong consistency using distributed lock coordinates; Saga guarantees eventual consistency using local transactions and rollbacks.
| Aspect | Two-Phase Commit (2PC) | Saga Pattern |
|---|---|---|
| Consistency Model | Strong Consistency (ACID) | Eventual Consistency (BASE) |
| Write Throughput | Low (Blocked by network locks and disk writes) | High (Local database transactions execute immediately) |
| Resource Locking | High (Locks rows until all nodes confirm success) | Low (No global database locks are held) |
| Failure Behavior | Rolls back atomically during participant outages | Compensates asynchronously; intermediate states are visible |
Staff Engineer Perspective
Designing distributed state machines requires implementing strict safety barriers at the database layer.
Verbal Script
Interviewer: "How would you handle a scenario in an orchestration-based Saga where the orchestrator crashes immediately after sending a charge request to the Payment Service, and the service succeeds in charging the user?"
Candidate: "We solve this by separating command routing from execution confirmation, using idempotency keys and asynchronous orchestrator state recovery.
First, when the orchestrator initiates the payment step, it generates a unique transaction token (e.g., combining saga_id and the step name) and saves it to the saga_steps log as PENDING. It then passes this token as the idempotency key in the header to the Payment Service.
Second, when the orchestrator restarts, it scans the state database for Sagas in the STARTED status. It reads the history logs and finds that the payment step was sent but never acknowledged. Instead of retrying the charge or executing compensation immediately, the orchestrator queries the Payment Service using the transaction token to verify the status.
Because the Payment Service is idempotent, it returns a status of CAPTURED and the transaction details. The orchestrator updates its state database, logs the step as success, and proceeds to the next step, ensuring that the customer is not charged twice and the transaction completes safely."
Interviewer: "What is your strategy for handling out-of-order execution if a compensation message is received by a participant before the original command message?"
Candidate: "This is known as the Out-of-Order Message Arrival problem, and it occurs when networks partition or route messages via divergent paths. If the participant receives a compensation command (e.g., RELEASE_STOCK) before the original command (RESERVE_STOCK), executing the compensation first would result in a no-op because no stock is currently reserved. When the original reservation command arrives later, it would reserve the stock, leaving it locked indefinitely.
To prevent this, the participant database must maintain a Saga Execution Ledger. When the compensation command arrives first, the participant writes a row to its local table: (saga_id, status = COMPENSATED).
When the original reservation command arrives later, the participant checks the ledger first. Finding that the Saga has already been compensated, the participant discards the reservation write immediately, returning a success status to the orchestrator. This prevents the lock from being written and ensures consistency."
Interviewer: "How would you prevent data inconsistency if a compensating transaction fails permanently due to a terminal error in a third-party API?"
Candidate: "If a compensating transaction fails terminally (e.g., a payment refund is rejected by the acquiring bank due to account closure), the automated system cannot reconcile the transaction.
First, we do not let the orchestrator loop indefinitely, as this consumes resource threads. The orchestrator halts automated retries after a configured limit (e.g., 5 attempts).
Second, the orchestrator writes a record to the transactional_outbox indicating a terminal reconciliation failure, updating the Saga instance status to FAILED_COMPENSATION.
Finally, a background monitor consumes this outbox message, triggers an immediate high-priority alert to our operational team via PagerDuty, and logs the case details in our manual queue dashboard. The operations team can then manually review the case and execute a manual wire transfer or write-off, ensuring audit compliance."