System Design: Distributed Transactions (2PC and 3PC)

Distributed Transactions: 2PC and 3PC

In a single-instance relational database, achieving ACID (Atomicity, Consistency, Isolation, Durability) guarantees is a solved problem. The database engine simply coordinates locks and writes transactions to its local Write-Ahead Log (WAL).

However, in modern microservices or sharded database architectures, a single logical transaction may span multiple physical database servers. Ensuring that either all of these databases commit their changes or none of them do is the distributed transactions challenge.

Two-Phase Commit (2PC) and Three-Phase Commit (3PC) are the fundamental atomic commit protocols designed to solve this consistency problem. This guide details their protocol phases, mathematical bottlenecks, failure recovery logs, and modern architectural alternatives.

Requirements and System Goals

Implementing distributed atomic commit protocols requires precise trade-offs between consistency, latency, and availability.

1. Functional Requirements

Atomicity: Guarantee that a transaction involving multiple databases commits completely or aborts completely.
Consensus Execution: Coordinate participant nodes (databases) to agree on a single outcome (Commit or Abort) even under partial network drops.
Auto-Recovery: Allow participant nodes to recover to a consistent state after a crash or network partition.
Isolation Guarantees: Prevent concurrent transactions from seeing intermediate states of in-flight distributed transactions.

2. Non-Functional Requirements

Low Latency Overhead: Minimize the network message complexity and round-trips required to reach agreement.
High Transactional Throughput: Minimize the duration for which rows are locked on individual database nodes.
Partition Resilience: Avoid blocking participant nodes indefinitely if a network partition isolates the transaction coordinator.
Safety Under Crash Failure: Prevent split-brain consistency bugs where one database commits and another aborts the same transaction.

API Interfaces and Service Contracts

Distributed commit protocols rely on structured message structures exchanged between a central Coordinator and Participants (the database nodes).

1. Coordinator-to-Participant Interface

Every participant node exposes private RPC endpoints (e.g., gRPC or internal tcp sockets) that the coordinator invokes to manage the transaction lifecycle.

syntax = "proto3";

package transaction.v1;

service ParticipantService {
  rpc Prepare (PrepareRequest) returns (PrepareResponse);
  rpc Commit (CommitRequest) returns (CommitResponse);
  rpc Abort (AbortRequest) returns (AbortResponse);
  
  // 3PC Specific intermediate phases
  rpc PreCommit (PreCommitRequest) returns (PreCommitResponse);
}

message PrepareRequest {
  string transaction_id = 1;
  string payload = 2;
  int64 timeout_ms = 3;
}

message PrepareResponse {
  enum Vote {
    VOTE_UNSPECIFIED = 0;
    VOTE_COMMIT = 1;  // Yes, I can commit
    VOTE_ABORT = 2;   // No, I must abort
  }
  Vote vote = 1;
  string participant_id = 2;
  string error_message = 3;
}

message CommitRequest {
  string transaction_id = 1;
}

message CommitResponse {
  bool success = 1;
}

message AbortRequest {
  string transaction_id = 1;
}

message AbortResponse {
  bool success = 1;
}

message PreCommitRequest {
  string transaction_id = 1;
}

message PreCommitResponse {
  bool success = 1;
}

High-Level Design and Visualizations

Visualizing the state transitions and message pathways is essential to understanding the design limitations of 2PC and 3PC.

1. Two-Phase Commit (2PC): Successful Commit

In a successful 2PC run, the coordinator collects "Yes" votes from all nodes in Phase 1 (Prepare) and issues the "Commit" command in Phase 2.

sequenceDiagram
    autonumber
    actor Client
    participant Coord as Coordinator
    participant P1 as Participant Node A
    participant P2 as Participant Node B
    
    Client->>Coord: Begin Transaction (TX_101)
    
    Note over Coord, P2: PHASE 1: PREPARE
    Coord->>P1: RPC: Prepare(TX_101)
    Coord->>P2: RPC: Prepare(TX_101)
    Note over P1: Log WAL / Acquire Lock
    Note over P2: Log WAL / Acquire Lock
    P1-->>Coord: VOTE_COMMIT (Yes)
    P2-->>Coord: VOTE_COMMIT (Yes)
    
    Note over Coord, P2: PHASE 2: COMMIT
    Coord->>P1: RPC: Commit(TX_101)
    Coord->>P2: RPC: Commit(TX_101)
    Note over P1: Release Lock / Permanent Commit
    Note over P2: Release Lock / Permanent Commit
    P1-->>Coord: Success
    P2-->>Coord: Success
    
    Coord->>Client: Transaction Committed (200 OK)

2. The 2PC Coordinator Crash Bottleneck (Blocking State)

If the coordinator prepares both nodes successfully but crashes before sending the commit command, the participant nodes are stranded in a blocking state, holding row locks indefinitely.

sequenceDiagram
    autonumber
    participant Coord as Coordinator
    participant P1 as Participant Node A
    participant P2 as Participant Node B
    
    Coord->>P1: Prepare(TX_101)
    Coord->>P2: Prepare(TX_101)
    P1-->>Coord: VOTE_COMMIT (Yes)
    P2-->>Coord: VOTE_COMMIT (Yes)
    
    Note over Coord: Coordinator CRASHES!
    
    Note over P1, P2: Both nodes are PREPARED but cannot commit or abort!
    Note over P1: Holds row locks indefinitely...
    Note over P2: Holds row locks indefinitely...

3. Three-Phase Commit (3PC): The Non-Blocking Design

3PC introduces an intermediate phase, PreCommit, to ensure that if a coordinator crashes, participants can safely make a decision using a consensus timeout rule.

sequenceDiagram
    autonumber
    participant Coord as Coordinator
    participant P1 as Participant Node A
    participant P2 as Participant Node B
    
    Note over Coord, P2: PHASE 1: CAN-COMMIT
    Coord->>P1: CanCommit(TX_102)
    Coord->>P2: CanCommit(TX_102)
    P1-->>Coord: VOTE_COMMIT (Yes)
    P2-->>Coord: VOTE_COMMIT (Yes)
    
    Note over Coord, P2: PHASE 2: PRE-COMMIT
    Coord->>P1: PreCommit(TX_102)
    Coord->>P2: PreCommit(TX_102)
    P1-->>Coord: Acknowledged
    P2-->>Coord: Acknowledged
    
    Note over Coord, P2: PHASE 3: DO-COMMIT
    Coord->>P1: DoCommit(TX_102)
    Coord->>P2: DoCommit(TX_102)
    P1-->>Coord: Committed
    P2-->>Coord: Committed

Low-Level Design and Schema Strategies

To ensure that both the coordinator and the participants can recover their state after a hardware reboot, they must write state transitions to a durable, non-volatile Write-Ahead Log (WAL) before responding over the network.

1. Coordinator Transaction Log Schema (WAL)

The coordinator uses this schema to track the global state of every active distributed transaction.

CREATE TABLE coordinator_transaction_log (
    transaction_id VARCHAR(64) PRIMARY KEY,
    status VARCHAR(32) NOT NULL, -- IN_FLIGHT, PREPARED, COMMITTED, ABORTED, PRECOMMIT
    participant_count INT NOT NULL,
    registered_participants TEXT[] NOT NULL, -- Array of connection strings
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_coord_tx_status ON coordinator_transaction_log (status);

2. Participant Node Transaction Lock Schema

Each participant database node maintains an active lock registry tracking which local rows are locked by active distributed transactions.

CREATE TABLE participant_lock_registry (
    local_lock_id VARCHAR(64) PRIMARY KEY,
    transaction_id VARCHAR(64) NOT NULL,
    locked_table_name VARCHAR(128) NOT NULL,
    locked_row_primary_key VARCHAR(255) NOT NULL,
    lock_status VARCHAR(16) NOT NULL DEFAULT 'HELD', -- HELD, RELEASED
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_part_tx_id ON participant_lock_registry (transaction_id);

Scaling and Operational Challenges

Distributed transactions introduce a severe performance penalty that limits horizontal database scalability.

1. The Distributed Lock Latency Bottleneck (Mathematical Proof)

To understand why distributed transactions destroy throughput, let us mathematically model database lock duration.

In a standard local database transaction, the lock duration $T_{\text{local}}$ is simply the time it takes to write to the local storage engine: $$T_{\text{local}} = T_{\text{disk_write}} \approx 1\text{ ms}$$

In a Two-Phase Commit distributed transaction, the locks must be acquired during the Prepare phase and are held until the Commit phase is complete. This span requires multiple network round-trips ($R_{\text{RTT}}$) between the coordinator and all participants:

$$T_{\text{lock_2pc}} = 2 \times R_{\text{RTT}} + 2 \times T_{\text{disk_write}}$$

Let us calculate the throughput impact for two data centers located in different regions where the average network round-trip latency ($R_{\text{RTT}}$) is 50 ms:

$$T_{\text{lock_2pc}} = 2 \times (50\text{ ms}) + 2 \times (1\text{ ms}) = 102\text{ ms}$$

Compare this to the local transaction: $$\frac{T_{\text{lock_2pc}}}{T_{\text{local}}} = \frac{102\text{ ms}}{1\text{ ms}} = 102 \times \text{ increase in lock duration}$$

Because database locks are held 102 times longer per transaction, the maximum concurrent write throughput for any locked row drops significantly: $$\text{Max Throughput}{\text{local}} = \frac{1}{1\text{ ms}} = 1,000 \text{ writes/sec}$$ $$\text{Max Throughput}{\text{2pc}} = \frac{1}{102\text{ ms}} \approx 9.8 \text{ writes/sec}$$

This drop in throughput is called distributed lock starvation.

Trade-offs and Architectural Alternatives

To scale systems to millions of transactions per second, modern architects prefer asynchronous eventual consistency patterns over blocking distributed locks.

Consensus Protocol / Pattern	Consistency Model	Availability (CAP Position)	Latency Overhead	Key Trade-off / Deficiency
Two-Phase Commit (2PC)	Strong Consistency	CP (Lacks availability if the coordinator crashes)	High ($2 \times R_{\text{RTT}}$ lock duration)	Strictly blocking; poor performance over networks
Three-Phase Commit (3PC)	Strong Consistency	CP (Non-blocking under node crashes, but fails during network partitions)	Very High ($3 \times R_{\text{RTT}}$ lock duration)	Vulnerable to split-brain consistency issues under network partitions
Saga Pattern (Orchestrated)	Eventual Consistency	AP (Highly available; nodes execute commits locally)	Low (Async steps, no global locks)	No isolation guarantees; requires writing complex rollbacks (compensating transactions)
Paxos / Raft (Spanner TrueTime)	Strong Consistency	CP (Highly resilient consensus-based replication)	Medium (Consensus latency, but utilizes atomic clocks for isolation)	Extremely high implementation complexity; requires hardware support (GPS/Rubidium clocks)

Failure Modes and Fault Tolerance Strategies

Understanding how a system recovers from participant or coordinator failures is essential to operating CP architectures.

1. Coordinator Recovery Walkthrough

When a crashed coordinator reboots, it must parse its transaction log to resolve incomplete transactions:

Recovery Agent Scan: The recovery process scans the coordinator_transaction_log for transactions in the PREPARED or PRECOMMIT states.
Resolution Query: For every unresolved transaction, the coordinator queries the participants to check their local status.
Decision Path:
- If any participant node reports that it aborted the transaction, the coordinator writes ABORTED to its log and broadcasts an Abort command to all nodes.
- If all participant nodes report they are PREPARED, and the coordinator log shows it had collected all votes but failed to write commit, it writes COMMITTED and broadcasts Commit.

2. Participant Node Recovery Timeout Rules in 3PC

3PC introduces a timeout rule to resolve the blocking coordinator crash bug:

PreCommit Timeout: If a participant node is in the PREPARED state and has received a PreCommit instruction, but the coordinator fails to send the final DoCommit command before a timeout threshold:
- The Assumption: Since a PreCommit was issued, all other nodes must have voted "Yes" in the initial CanCommit phase.
- The Action: The participants form a cooperative recovery group, communicate with one another, and safely elect to Commit the transaction.
CanCommit Timeout: If the node has not received a PreCommit command, it is safe to Abort the transaction because no node could have committed yet.

Staff Engineer Perspective

Verbal Script

Interviewer: "Why would you choose the Saga Pattern over Two-Phase Commit (2PC) for a distributed checkout transaction spanning a Payment service, Inventory service, and Shipping service?"

Candidate: "While 2PC provides strong acid guarantees out of the box, it is a poor fit for microservices checkout flows due to its blocking nature and high latency overhead.

In a checkout transaction, if we use 2PC, the Payment database, Inventory database, and Shipping database must all acquire local database locks during the Prepare phase. These locks must be held synchronously across multiple network round-trips while the Coordinator orchestrates the handshake.

If the user's connection is slow, or if the Shipping service takes 500ms to calculate delivery routes, those database rows remain locked. Under high traffic, this will instantly saturate connection pools, spike database CPU to 100%, and cause a cascading outage across all three services.

Furthermore, if the Coordinator crashes after the Prepare phase, the databases are left in an unresolved, blocked state. They are legally unable to release their locks because they do not know if the other nodes committed or aborted. This requires human intervention to resolve.

To build a resilient checkout system, I would choose an Orchestrated Saga Pattern. Instead of holding global locks, each service commits its local transaction immediately:

First, the Checkout Orchestrator issues a command to the Inventory service to deduct items. The Inventory service writes to its local database and releases locks immediately.

Second, the Orchestrator calls the Payment service. If the payment succeeds, the transaction progresses to the Shipping service.

If the payment fails (for example, due to insufficient funds), we handle this eventual consistency mismatch by executing Compensating Transactions. The Orchestrator issues a rollback command to the Inventory service to increment the inventory back to its original state.

This approach replaces synchronous network blocking with asynchronous, localized commits. It improves write throughput from a few dozen transactions per second under 2PC to thousands of operations per second, ensuring the checkout flow remains resilient and available even if individual downstream services experience transient failures."