Lesson 9 of 25 15 minDeep Systems

System Design Masterclass: Designing a Distributed Lock Service (Chubby/Zookeeper)

A deep dive into building a highly available, strongly consistent lock service. Learn about consensus protocols, liveness guarantees, fencing tokens, and the "Redlock" controversy.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • **Acquire(LockName, Timeout)**: Returns a `LockID` or fails.
  • **Release(LockID)**: Frees the lock.
  • **Renew(LockID, ExtensionTime)**: Extends the lease.
Recommended Prerequisites
System Design Masterclass: Designing a Distributed Rate Limiter

Premium outcome

Distributed systems mechanics for engineers building serious backend platforms.

Engineers who want stronger distributed-systems fundamentals for platform work.

You leave with

  • More confidence with consistency, causality, locking, and time in distributed systems
  • A stronger sense of which backend guarantees are expensive and why
  • The systems-level foundation needed for difficult architecture trade-offs

In a single-threaded process, we use simple mutexes. In a multi-threaded application, we employ synchronized blocks or concurrent locks. But in a distributed system spanning hundreds of virtual machines, container fleets, and databases, how do we guarantee that only one worker executes a specific critical task at any given instant?

Whether it is protecting against double-processing in payment runs, ensuring only a single crawler processes a targeted domain, or coordinating master node allocations in clusters, a production-grade Distributed Lock Service is essential.

Designing a resilient lock service is a classic system design interview question because it tests your understanding of strong consistency, liveness guarantees, network partitions, garbage collection pauses, and physical clock drifts.


1. Requirements & Core Constraints

Building a lock manager demands absolute consistency and high operational resiliency.

Functional Requirements

  • Acquire Lock: A client must be able to request an exclusive lock on a named resource. If successful, it receives a unique LockToken (fencing token).
  • Release Lock: The client must be able to release the lock explicitly when the critical section is complete.
  • Renew Lock (Lease Extension): To prevent locks from being held indefinitely by crashed clients, locks must have a Time-to-Live (TTL). A client must be able to renew its active lease prior to expiration.
  • Check Lock Status: Clients must be able to inspect current lock ownership.

Non-Functional Requirements & SLAs

  • Strong Consistency (Safety): At most one process can hold a specific named lock at any single point in time. This is non-negotiable and requires a strict CP (Consistency / Partition-tolerance) posture.
  • Liveness (Deadlock Prevention): If a process crashes, freezes, or experiences a network partition while holding a lock, the service must automatically release the lock after a configurable lease timeout.
  • High Availability: The lock service must survive hardware crashes. It should utilize a consensus quorum (e.g. 3 or 5 replication nodes) to prevent single points of failure.
  • Low Latency: Lock lease checking and renewal operations must have sub-millisecond latencies, while write operations (Acquire/Release) must execute under 10 milliseconds.

Back-of-the-Envelope Estimates

Let's size the operational bounds of a lock service deployed across a global microservice fleet:

  • Target QPS: Suppose a large platform coordinates 100,000 distinct distributed workers, each performing a locking check or acquisition once every 5 seconds. $$\text{Average Lock QPS} = \frac{100,000}{5} = 20,000 \text{ QPS}$$ $$\text{Peak Write QPS} = 20,000 \text{ QPS} \times 1.5 = 30,000 \text{ QPS}$$

  • Consensus Replication Ingress (3-Node Raft Cluster): If each consensus lock record contains:

    • Lock Name (e.g., resource path): 64 bytes
    • Fencing Token ID (uint64): 8 bytes
    • Client ID: 36 bytes
    • Lease Expiry timestamp: 8 bytes
    • Record Payload: $\approx 116$ bytes $$\text{Write Bandwidth} = 30,000 \text{ QPS} \times 116 \text{ bytes} \approx 3.48 \text{ MB/sec}$$ This is highly manageable for high-speed network connections, but disk syncing (Write-Ahead Logging) is the primary throughput throttle, limiting consensus lock engines to a few tens of thousands of writes per second on standard hardware without batching.

2. API Design & Core Contracts

The distributed lock service exposes structured gRPC endpoints to enable strongly typed, low-overhead communication between clients and the consensus group coordinator.

syntax = "proto3";

package locking.v1;

service DistributedLockService {
  // Attempts to acquire an exclusive lock on a named resource
  rpc AcquireLock(AcquireLockRequest) returns (AcquireLockResponse);

  // Explicitly releases an active lock using the acquired token
  rpc ReleaseLock(ReleaseLockRequest) returns (ReleaseLockResponse);

  // Heartbeat endpoint used to renew a client's active lease before expiration
  rpc RenewLease(RenewLeaseRequest) returns (RenewLeaseResponse);
}

message AcquireLockRequest {
  string lock_name = 1;
  string client_id = 2;
  uint64 lease_duration_ms = 3;  // Target TTL of the lock lease (default 10,000ms)
  uint64 timeout_ms = 4;         // Maximum time to block waiting for the lock
}

message AcquireLockResponse {
  bool success = 1;
  string lock_token = 2;         // Fencing token represented as a string UUID
  uint64 fencing_token_id = 3;   // Monotonically increasing number for database fencing
  uint64 lease_expiry_epoch_ms = 4;
  string error_message = 5;
}

message ReleaseLockRequest {
  string lock_name = 1;
  string client_id = 2;
  string lock_token = 3;
}

message ReleaseLockResponse {
  bool success = 1;
  string message = 2;
}

message RenewLeaseRequest {
  string lock_name = 1;
  string client_id = 2;
  string lock_token = 3;
  uint64 extend_duration_ms = 4;
}

message RenewLeaseResponse {
  bool success = 1;
  uint64 new_expiry_epoch_ms = 2;
  string error_message = 3;
}

3. High-Level Design (HLD)

To ensure strong consistency, the lock state must be managed by a Replicated State Machine using a consensus algorithm like Paxos or Raft. The cluster consists of an odd number of nodes (typically 3 or 5). One node acts as the active Leader, and the others serve as Followers.

Quorum Replication Architecture

The diagram below maps the client acquisition lifecycle. Bypassing standard uncoordinated databases, the request must commit to a majority quorum of consensus log nodes.

graph TD
    Client[Application Client] -->|1. AcquireLock Request| Leader[Raft Leader Node]
    
    subgraph Consensus Cluster
        Leader -->|2. Replicate Log Entry| Follower1[Follower Node 1]
        Leader -->|2. Replicate Log Entry| Follower2[Follower Node 2]
        
        Follower1 -->|3. Acknowledge Write| Leader
        Follower2 -->|3. Acknowledge Write| Leader
    end
    
    Leader -->|4. Sync to Disk WAL| WAL[(Write-Ahead Log)]
    Leader -->|5. Commit & Return Fencing Token| Client

The Fencing Token Workflow

To mitigate the risk of client execution freezes (e.g., GC pauses), the lock service issues a monotonically increasing fencing token with every successful lock acquisition. Below is the operational workflow demonstrating how downstream storage layers reject outdated operations.

sequenceDiagram
    autonumber
    actor ClientA as App Client A
    actor ClientB as App Client B
    participant LockService as Lock Coordinator
    participant Database as Shared PostgreSQL Store

    ClientA->>LockService: AcquireLock(resource_1)
    LockService-->>ClientA: Lock Granted (Fencing Token = 83)
    
    note over ClientA: Client A enters Stop-the-World GC Pause (15 seconds)
    note over LockService: Lease Expires. Lock Re-allocated to Client B
    
    ClientB->>LockService: AcquireLock(resource_1)
    LockService-->>ClientB: Lock Granted (Fencing Token = 84)
    
    ClientB->>Database: Write Record (Fencing Token = 84)
    note over Database: DB checks fencing token. 84 is greater than 0. Write Committed.
    Database-->>ClientB: Write Success
    
    note over ClientA: Client A wakes up from GC Pause
    ClientA->>Database: Write Record (Fencing Token = 83)
    note over Database: DB checks fencing token. 83 is less than 84! Write Rejected!
    Database-->>ClientA: Write Error (Outdated Fencing Token)

4. Low-Level Design (LLD) & Data Models

To implement a strongly consistent database backend for the lock state, we require a schema that tracks active leases, fencing tokens, and ownership properties.

Relational Schema (PostgreSQL): Distributed Lock State

-- Tracks current active lock leases and fencing token sequence
CREATE TABLE distributed_locks (
    lock_name VARCHAR(255) PRIMARY KEY,
    client_id VARCHAR(255) NOT NULL,
    lock_token VARCHAR(64) NOT NULL UNIQUE,
    fencing_token BIGINT NOT NULL,          -- Monotonically increasing number
    lease_duration_ms BIGINT NOT NULL,
    acquired_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    expires_at TIMESTAMP WITH TIME ZONE NOT NULL
);

-- Sequence used to generate global monotonically increasing fencing tokens
CREATE SEQUENCE seq_fencing_tokens START WITH 1;

-- Create an index to quickly scan and identify expired leases in the background
CREATE INDEX idx_locks_expiry ON distributed_locks (expires_at);

Compilable Java Implementation: Lock Lease Coordinator

This compilable Java class simulates the internal logic of a distributed lock coordinator. It handles concurrent lock acquisition requests, manages client lease heartbeats, and utilizes a background thread pool to evict expired leases safely.

package com.codesprintpro.locking;

import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicLong;
import java.util.logging.Logger;

public class LockLeaseCoordinator {
    private static final Logger logger = Logger.getLogger(LockLeaseCoordinator.class.getName());
    
    // In-memory representation of lock records (replicated state equivalent)
    public static class LockRecord {
        public final String lockName;
        public final String clientId;
        public final String lockToken;
        public final long fencingToken;
        public volatile long expiresAtEpochMs;

        public LockRecord(String lockName, String clientId, String lockToken, long fencingToken, long expiresAtEpochMs) {
            this.lockName = lockName;
            this.clientId = clientId;
            this.lockToken = lockToken;
            this.fencingToken = fencingToken;
            this.expiresAtEpochMs = expiresAtEpochMs;
        }
    }

    private final ConcurrentHashMap<String, LockRecord> activeLocks = new ConcurrentHashMap<>();
    private final AtomicLong fencingTokenGenerator = new AtomicLong(1000); // Start offset
    private final ScheduledExecutorService cleanupScheduler = Executors.newSingleThreadScheduledExecutor();

    public LockLeaseCoordinator() {
        // Start background cleaner running every 500ms to evict expired locks
        cleanupScheduler.scheduleAtFixedRate(this::evictExpiredLocks, 500, 500, TimeUnit.MILLISECONDS);
    }

    public synchronized LockRecord acquireLock(String lockName, String clientId, String lockToken, long leaseMs) {
        long now = System.currentTimeMillis();
        LockRecord currentLock = activeLocks.get(lockName);

        // If lock exists and is not expired, deny acquisition
        if (currentLock != null && currentLock.expiresAtEpochMs > now) {
            logger.info("Acquisition rejected: Lock '" + lockName + "' is currently held by " + currentLock.clientId);
            return null;
        }

        // Generate dynamic fencing token and create new lock lease record
        long nextFencingToken = fencingTokenGenerator.incrementAndGet();
        long expiryTime = now + leaseMs;
        LockRecord newLock = new LockRecord(lockName, clientId, lockToken, nextFencingToken, expiryTime);
        
        activeLocks.put(lockName, newLock);
        logger.info("Lock '" + lockName + "' granted to client '" + clientId + "' with Fencing Token: " + nextFencingToken);
        return newLock;
    }

    public synchronized boolean renewLease(String lockName, String clientId, String lockToken, long extendMs) {
        long now = System.currentTimeMillis();
        LockRecord currentLock = activeLocks.get(lockName);

        if (currentLock == null) {
            logger.warning("Renewal failed: Lock '" + lockName + "' does not exist");
            return false;
        }

        if (currentLock.expiresAtEpochMs <= now) {
            logger.warning("Renewal failed: Lock '" + lockName + "' has already expired");
            return false;
        }

        if (!currentLock.lockToken.equals(lockToken) || !currentLock.clientId.equals(clientId)) {
            logger.warning("Renewal failed: Invalid credentials for lock '" + lockName + "'");
            return false;
        }

        // Extend expiry epoch time
        currentLock.expiresAtEpochMs = now + extendMs;
        logger.info("Lease renewed for Lock '" + lockName + "' to expire in " + extendMs + "ms");
        return true;
    }

    public synchronized void releaseLock(String lockName, String clientId, String lockToken) {
        LockRecord currentLock = activeLocks.get(lockName);
        if (currentLock != null && currentLock.lockToken.equals(lockToken) && currentLock.clientId.equals(clientId)) {
            activeLocks.remove(lockName);
            logger.info("Lock '" + lockName + "' explicitly released by client '" + clientId + "'");
        }
    }

    private synchronized void evictExpiredLocks() {
        long now = System.currentTimeMillis();
        activeLocks.forEach((lockName, record) -> {
            if (record.expiresAtEpochMs <= now) {
                activeLocks.remove(lockName);
                logger.warning("Lock '" + lockName + "' held by '" + record.clientId + "' expired and was evicted.");
            }
        });
    }

    public void shutdown() {
        cleanupScheduler.shutdown();
    }

    // Direct local verification check
    public static void main(String[] args) throws InterruptedException {
        LockLeaseCoordinator coordinator = new LockLeaseCoordinator();
        
        // Test lock acquisition
        String lockName = "exclusive_db_migration";
        LockRecord lock = coordinator.acquireLock(lockName, "worker_node_1", "uuid-token-1122", 1000);
        
        if (lock != null) {
            System.out.println("Test 1 Passed: Lock acquired. Fencing Token: " + lock.fencingToken);
        }

        // Attempt concurrent acquisition (should be blocked)
        LockRecord blockedLock = coordinator.acquireLock(lockName, "worker_node_2", "uuid-token-3344", 1000);
        if (blockedLock == null) {
            System.out.println("Test 2 Passed: Lock concurrent request correctly blocked.");
        }

        // Wait for lease to expire naturally
        Thread.sleep(1200);
        
        // Re-try acquisition after eviction (should succeed)
        LockRecord postExpiryLock = coordinator.acquireLock(lockName, "worker_node_2", "uuid-token-3344", 2000);
        if (postExpiryLock != null) {
            System.out.println("Test 3 Passed: Lock acquired post expiry. Fencing Token: " + postExpiryLock.fencingToken);
        }

        coordinator.shutdown();
    }
}

5. Scaling Challenges & Bottlenecks

Distributed lock systems are high-contention endpoints. Scaling them requires eliminating standard concurrency blockages.

The GC Pause Death Trap (Process Pauses)

If a client node holding an active lease experiences a long JVM Stop-the-World Garbage Collection pause, the local client thread stops running. During this freeze, the client lease can expire on the lock server and be re-allocated to a different worker. When the original client wakes up, it assumes its lock is still valid and writes to the shared database, corrupting state.

  • The Solution: We must enforce Fencing Tokens at the storage engine level. Every lock acquisition returns an incrementing token ID. The target storage database (e.g., PostgreSQL or Cassandra) must validate that the token attached to the incoming write operation is greater than or equal to the last committed token. If a frozen client attempts to write using an outdated token, the database rejects the command.

Paxos/Raft Write Bottlenecks

Consensus algorithms require synchronous disk writes (fsync) to persist Write-Ahead Logs (WAL) across a quorum of nodes. This restricts consensus lock clusters to approximately 5,000 to 10,000 write QPS.

  • The Solution: To scale past this, we implement Hierarchical Path Isolation (similar to Google Chubby's directory trees). We shard locks across multiple independent Paxos consensus groups. A client requiring a lock at /locks/analytics/jobs/1 only contacts the specific consensus group handling the /locks/analytics path, isolating write workloads.

6. Technical Trade-offs & Compromises

Choosing the correct distributed lock architecture involves balancing performance against safety.

Strongly Consistent CP (ZooKeeper/etcd) vs. AP (Redis Redlock)

                       ┌───────────────────────────────┐
                       │   Lock Service Architecture   │
                       └───────────────┬───────────────┘
                                       │
            ┌──────────────────────────┴──────────────────────────┐
            ▼                                                     ▼
┌──────────────────────────────────────┐              ┌──────────────────────────────────────┐
│        etcd / ZooKeeper (CP)         │              │          Redis Redlock (AP)          │
├──────────────────────────────────────┤              ├──────────────────────────────────────┤
│ • Consensus replication (Raft/Paxos) │              │ • Independent node validation        │
│ • Strongly Consistent                │              │ • High In-Memory Throughput          │
│ • Ephemeral session nodes            │              │ • Relies on synchronized clocks      │
│ • Safety prioritised                 │              │ • Potential double-locking on drift  │
└──────────────────────────────────────┘              └──────────────────────────────────────┘
  • CP Model (etcd/ZooKeeper): Under network partitions, etcd prioritizes consistency. If a node is separated from the majority partition, it instantly blocks write operations. Ephemeral session nodes automatically delete if client heartbeat messages stop.
    • Trade-off: High write latency due to consensus logs, but guarantees that double locking is mathematically impossible.
  • AP Model (Redis Redlock): Redis Redlock attempts to achieve lock safety by acquiring locks from multiple independent Redis instances. It is extremely fast and highly available.
    • Trade-off: It relies heavily on synchronized physical system clocks. If a physical clock jumps due to NTP synchronization drift, Redlock can grant the same lock to multiple processes concurrently.
    • Staff Verdict: Use etcd/ZooKeeper for high-value critical systems (e.g., master coordinate elections, financial payout workers). Use Redis/Redlock for high-volume, low-criticality systems (e.g. rate-limiting, deduplicating high-speed webhook ingestion logs).

7. Failure Scenarios & Operational Resiliency

Distributed environments are defined by transient network cuts and cluster hardware failovers.

Consensus Leader Election Failover

If the active Raft Leader node crashes or loses network connection, the remaining followers must dynamically elect a new Leader.

  • Mitigation: During this election window (typically 150ms to 300ms), the lock cluster temporarily refuses new write requests. Bidding or worker clients must use exponential backoff and jitter when retrying AcquireLock commands. Existing lock leases are preserved, as lease expiration clocks are verified on the newly elected leader once the session is restored.

Split-Brain Mitigation

Under a severe network split, a cluster of 5 nodes might be split into a 2-node partition and a 3-node partition.

  • Mitigation: The 2-node partition cannot form a majority quorum (which requires at least 3 nodes). Consequently, it immediately rejects all lock acquisition requests. The 3-node partition continues to process writes normally. This strictly prevents split-brain anomalies where two different master nodes could be elected concurrently.


8. Candidate Verbal Script

Mock Interview Sequence

Interviewer: How would you design a distributed lock service that handles high availability and guarantees that two processes can never hold the same lock concurrently, even under GC pauses?

Candidate: "To guarantee absolute safety, I would choose a CP (Consistency/Partition-tolerance) architecture over an AP system. I would build the lock service using a consensus framework like Raft or Paxos, similar to Apache ZooKeeper or etcd. This ensures that lock operations are replicated across a majority quorum of nodes before being confirmed.

To prevent deadlocks if a client crashes while holding a lock, I would implement a lease-based system where locks have a Time-to-Live (TTL). The client must periodically renew its lease via a heartbeat.

Crucially, to combat the 'GC Pause Death Trap' where a client experiences a Stop-the-World garbage collection pause and loses its lock without knowing it, I would implement Fencing Tokens. Every time a lock is acquired, the service returns a monotonically increasing sequence number along with the lock.

When the client attempts to write to our primary storage database, it must include this fencing token in the write transaction. The database must check this token against the last committed write. If a new client has already acquired the lock and committed a write with a higher token, the database will instantly reject the stale client's write. This guarantees safety even when physical system clocks drift or processes pause indefinitely."

Interviewer: How would the storage database perform this fencing verification at scale without introducing its own performance bottleneck?

Candidate: "Instead of running a heavy relational transaction on every write, the storage engine can maintain a simple metadata table tracking the max_fencing_token per resource. In a database like PostgreSQL, this check can be embedded directly into a conditional update query, such as:

UPDATE resource_table SET data = :new_data, last_fencing_token = :new_token WHERE resource_id = :id AND last_fencing_token < :new_token;

If the update affects zero rows, the client immediately knows its lock lease has expired and that its transaction was safely rejected, preventing data corruption without requiring complex distributed locks at the database layer itself."


Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.