Distributed Transactions: 2PC and 3PC
In a single-instance relational database, achieving ACID (Atomicity, Consistency, Isolation, Durability) guarantees is a solved problem. The database engine simply coordinates locks and writes transactions to its local Write-Ahead Log (WAL).
However, in modern microservices or sharded database architectures, a single logical transaction may span multiple physical database servers. Ensuring that either all of these databases commit their changes or none of them do is the distributed transactions challenge.
Two-Phase Commit (2PC) and Three-Phase Commit (3PC) are the fundamental atomic commit protocols designed to solve this consistency problem. This guide details their protocol phases, mathematical bottlenecks, failure recovery logs, and modern architectural alternatives.
Requirements and System Goals
Implementing distributed atomic commit protocols requires precise trade-offs between consistency, latency, and availability.
1. Functional Requirements
- Atomicity: Guarantee that a transaction involving multiple databases commits completely or aborts completely.
- Consensus Execution: Coordinate participant nodes (databases) to agree on a single outcome (Commit or Abort) even under partial network drops.
- Auto-Recovery: Allow participant nodes to recover to a consistent state after a crash or network partition.
- Isolation Guarantees: Prevent concurrent transactions from seeing intermediate states of in-flight distributed transactions.
2. Non-Functional Requirements
- Low Latency Overhead: Minimize the network message complexity and round-trips required to reach agreement.
- High Transactional Throughput: Minimize the duration for which rows are locked on individual database nodes.
- Partition Resilience: Avoid blocking participant nodes indefinitely if a network partition isolates the transaction coordinator.
- Safety Under Crash Failure: Prevent split-brain consistency bugs where one database commits and another aborts the same transaction.
API Interfaces and Service Contracts
Distributed commit protocols rely on structured message structures exchanged between a central Coordinator and Participants (the database nodes).
1. Coordinator-to-Participant Interface
Every participant node exposes private RPC endpoints (e.g., gRPC or internal tcp sockets) that the coordinator invokes to manage the transaction lifecycle.
syntax = "proto3";
package transaction.v1;
service ParticipantService {
rpc Prepare (PrepareRequest) returns (PrepareResponse);
rpc Commit (CommitRequest) returns (CommitResponse);
rpc Abort (AbortRequest) returns (AbortResponse);
// 3PC Specific intermediate phases
rpc PreCommit (PreCommitRequest) returns (PreCommitResponse);
}
message PrepareRequest {
string transaction_id = 1;
string payload = 2;
int64 timeout_ms = 3;
}
message PrepareResponse {
enum Vote {
VOTE_UNSPECIFIED = 0;
VOTE_COMMIT = 1; // Yes, I can commit
VOTE_ABORT = 2; // No, I must abort
}
Vote vote = 1;
string participant_id = 2;
string error_message = 3;
}
message CommitRequest {
string transaction_id = 1;
}
message CommitResponse {
bool success = 1;
}
message AbortRequest {
string transaction_id = 1;
}
message AbortResponse {
bool success = 1;
}
message PreCommitRequest {
string transaction_id = 1;
}
message PreCommitResponse {
bool success = 1;
}
High-Level Design and Visualizations
Visualizing the state transitions and message pathways is essential to understanding the design limitations of 2PC and 3PC.
1. Two-Phase Commit (2PC): Successful Commit
In a successful 2PC run, the coordinator collects "Yes" votes from all nodes in Phase 1 (Prepare) and issues the "Commit" command in Phase 2.
sequenceDiagram
autonumber
actor Client
participant Coord as Coordinator
participant P1 as Participant Node A
participant P2 as Participant Node B
Client->>Coord: Begin Transaction (TX_101)
Note over Coord, P2: PHASE 1: PREPARE
Coord->>P1: RPC: Prepare(TX_101)
Coord->>P2: RPC: Prepare(TX_101)
Note over P1: Log WAL / Acquire Lock
Note over P2: Log WAL / Acquire Lock
P1-->>Coord: VOTE_COMMIT (Yes)
P2-->>Coord: VOTE_COMMIT (Yes)
Note over Coord, P2: PHASE 2: COMMIT
Coord->>P1: RPC: Commit(TX_101)
Coord->>P2: RPC: Commit(TX_101)
Note over P1: Release Lock / Permanent Commit
Note over P2: Release Lock / Permanent Commit
P1-->>Coord: Success
P2-->>Coord: Success
Coord->>Client: Transaction Committed (200 OK)
2. The 2PC Coordinator Crash Bottleneck (Blocking State)
If the coordinator prepares both nodes successfully but crashes before sending the commit command, the participant nodes are stranded in a blocking state, holding row locks indefinitely.
sequenceDiagram
autonumber
participant Coord as Coordinator
participant P1 as Participant Node A
participant P2 as Participant Node B
Coord->>P1: Prepare(TX_101)
Coord->>P2: Prepare(TX_101)
P1-->>Coord: VOTE_COMMIT (Yes)
P2-->>Coord: VOTE_COMMIT (Yes)
Note over Coord: Coordinator CRASHES!
Note over P1, P2: Both nodes are PREPARED but cannot commit or abort!
Note over P1: Holds row locks indefinitely...
Note over P2: Holds row locks indefinitely...
3. Three-Phase Commit (3PC): The Non-Blocking Design
3PC introduces an intermediate phase, PreCommit, to ensure that if a coordinator crashes, participants can safely make a decision using a consensus timeout rule.
sequenceDiagram
autonumber
participant Coord as Coordinator
participant P1 as Participant Node A
participant P2 as Participant Node B
Note over Coord, P2: PHASE 1: CAN-COMMIT
Coord->>P1: CanCommit(TX_102)
Coord->>P2: CanCommit(TX_102)
P1-->>Coord: VOTE_COMMIT (Yes)
P2-->>Coord: VOTE_COMMIT (Yes)
Note over Coord, P2: PHASE 2: PRE-COMMIT
Coord->>P1: PreCommit(TX_102)
Coord->>P2: PreCommit(TX_102)
P1-->>Coord: Acknowledged
P2-->>Coord: Acknowledged
Note over Coord, P2: PHASE 3: DO-COMMIT
Coord->>P1: DoCommit(TX_102)
Coord->>P2: DoCommit(TX_102)
P1-->>Coord: Committed
P2-->>Coord: Committed
Low-Level Design and Schema Strategies
To ensure that both the coordinator and the participants can recover their state after a hardware reboot, they must write state transitions to a durable, non-volatile Write-Ahead Log (WAL) before responding over the network.
1. Coordinator Transaction Log Schema (WAL)
The coordinator uses this schema to track the global state of every active distributed transaction.
CREATE TABLE coordinator_transaction_log (
transaction_id VARCHAR(64) PRIMARY KEY,
status VARCHAR(32) NOT NULL, -- IN_FLIGHT, PREPARED, COMMITTED, ABORTED, PRECOMMIT
participant_count INT NOT NULL,
registered_participants TEXT[] NOT NULL, -- Array of connection strings
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_coord_tx_status ON coordinator_transaction_log (status);
2. Participant Node Transaction Lock Schema
Each participant database node maintains an active lock registry tracking which local rows are locked by active distributed transactions.
CREATE TABLE participant_lock_registry (
local_lock_id VARCHAR(64) PRIMARY KEY,
transaction_id VARCHAR(64) NOT NULL,
locked_table_name VARCHAR(128) NOT NULL,
locked_row_primary_key VARCHAR(255) NOT NULL,
lock_status VARCHAR(16) NOT NULL DEFAULT 'HELD', -- HELD, RELEASED
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_part_tx_id ON participant_lock_registry (transaction_id);
Scaling and Operational Challenges
Distributed transactions introduce a severe performance penalty that limits horizontal database scalability.
1. The Distributed Lock Latency Bottleneck (Mathematical Proof)
To understand why distributed transactions destroy throughput, let us mathematically model database lock duration.
In a standard local database transaction, the lock duration $T_{\text{local}}$ is simply the time it takes to write to the local storage engine: $$T_{\text{local}} = T_{\text{disk_write}} \approx 1\text{ ms}$$
In a Two-Phase Commit distributed transaction, the locks must be acquired during the Prepare phase and are held until the Commit phase is complete. This span requires multiple network round-trips ($R_{\text{RTT}}$) between the coordinator and all participants:
$$T_{\text{lock_2pc}} = 2 \times R_{\text{RTT}} + 2 \times T_{\text{disk_write}}$$
Let us calculate the throughput impact for two data centers located in different regions where the average network round-trip latency ($R_{\text{RTT}}$) is 50 ms:
$$T_{\text{lock_2pc}} = 2 \times (50\text{ ms}) + 2 \times (1\text{ ms}) = 102\text{ ms}$$
Compare this to the local transaction: $$\frac{T_{\text{lock_2pc}}}{T_{\text{local}}} = \frac{102\text{ ms}}{1\text{ ms}} = 102 \times \text{ increase in lock duration}$$
Because database locks are held 102 times longer per transaction, the maximum concurrent write throughput for any locked row drops significantly: $$\text{Max Throughput}{\text{local}} = \frac{1}{1\text{ ms}} = 1,000 \text{ writes/sec}$$ $$\text{Max Throughput}{\text{2pc}} = \frac{1}{102\text{ ms}} \approx 9.8 \text{ writes/sec}$$
This drop in throughput is called distributed lock starvation.
Trade-offs and Architectural Alternatives
To scale systems to millions of transactions per second, modern architects prefer asynchronous eventual consistency patterns over blocking distributed locks.
| Consensus Protocol / Pattern | Consistency Model | Availability (CAP Position) | Latency Overhead | Key Trade-off / Deficiency |
|---|---|---|---|---|
| Two-Phase Commit (2PC) | Strong Consistency | CP (Lacks availability if the coordinator crashes) | High ($2 \times R_{\text{RTT}}$ lock duration) | Strictly blocking; poor performance over networks |
| Three-Phase Commit (3PC) | Strong Consistency | CP (Non-blocking under node crashes, but fails during network partitions) | Very High ($3 \times R_{\text{RTT}}$ lock duration) | Vulnerable to split-brain consistency issues under network partitions |
| Saga Pattern (Orchestrated) | Eventual Consistency | AP (Highly available; nodes execute commits locally) | Low (Async steps, no global locks) | No isolation guarantees; requires writing complex rollbacks (compensating transactions) |
| Paxos / Raft (Spanner TrueTime) | Strong Consistency | CP (Highly resilient consensus-based replication) | Medium (Consensus latency, but utilizes atomic clocks for isolation) | Extremely high implementation complexity; requires hardware support (GPS/Rubidium clocks) |
Failure Modes and Fault Tolerance Strategies
Understanding how a system recovers from participant or coordinator failures is essential to operating CP architectures.
1. Coordinator Recovery Walkthrough
When a crashed coordinator reboots, it must parse its transaction log to resolve incomplete transactions:
- Recovery Agent Scan: The recovery process scans the
coordinator_transaction_logfor transactions in thePREPAREDorPRECOMMITstates. - Resolution Query: For every unresolved transaction, the coordinator queries the participants to check their local status.
- Decision Path:
- If any participant node reports that it aborted the transaction, the coordinator writes
ABORTEDto its log and broadcasts anAbortcommand to all nodes. - If all participant nodes report they are
PREPARED, and the coordinator log shows it had collected all votes but failed to write commit, it writesCOMMITTEDand broadcastsCommit.
- If any participant node reports that it aborted the transaction, the coordinator writes
2. Participant Node Recovery Timeout Rules in 3PC
3PC introduces a timeout rule to resolve the blocking coordinator crash bug:
- PreCommit Timeout: If a participant node is in the
PREPAREDstate and has received aPreCommitinstruction, but the coordinator fails to send the finalDoCommitcommand before a timeout threshold:- The Assumption: Since a
PreCommitwas issued, all other nodes must have voted "Yes" in the initialCanCommitphase. - The Action: The participants form a cooperative recovery group, communicate with one another, and safely elect to Commit the transaction.
- The Assumption: Since a
- CanCommit Timeout: If the node has not received a
PreCommitcommand, it is safe to Abort the transaction because no node could have committed yet.
Staff Engineer Perspective
Verbal Script
Interviewer: "Why would you choose the Saga Pattern over Two-Phase Commit (2PC) for a distributed checkout transaction spanning a Payment service, Inventory service, and Shipping service?"
Candidate: "While 2PC provides strong acid guarantees out of the box, it is a poor fit for microservices checkout flows due to its blocking nature and high latency overhead.
In a checkout transaction, if we use 2PC, the Payment database, Inventory database, and Shipping database must all acquire local database locks during the Prepare phase. These locks must be held synchronously across multiple network round-trips while the Coordinator orchestrates the handshake.
If the user's connection is slow, or if the Shipping service takes 500ms to calculate delivery routes, those database rows remain locked. Under high traffic, this will instantly saturate connection pools, spike database CPU to 100%, and cause a cascading outage across all three services.
Furthermore, if the Coordinator crashes after the Prepare phase, the databases are left in an unresolved, blocked state. They are legally unable to release their locks because they do not know if the other nodes committed or aborted. This requires human intervention to resolve.
To build a resilient checkout system, I would choose an Orchestrated Saga Pattern. Instead of holding global locks, each service commits its local transaction immediately:
First, the Checkout Orchestrator issues a command to the Inventory service to deduct items. The Inventory service writes to its local database and releases locks immediately.
Second, the Orchestrator calls the Payment service. If the payment succeeds, the transaction progresses to the Shipping service.
If the payment fails (for example, due to insufficient funds), we handle this eventual consistency mismatch by executing Compensating Transactions. The Orchestrator issues a rollback command to the Inventory service to increment the inventory back to its original state.
This approach replaces synchronous network blocking with asynchronous, localized commits. It improves write throughput from a few dozen transactions per second under 2PC to thousands of operations per second, ensuring the checkout flow remains resilient and available even if individual downstream services experience transient failures."