System Design: Designing a Distributed File Lock (Zookeeper/Curator)

In distributed computing, managing concurrent mutations on shared, un-sharded resources (such as legacy physical files, network storage volumes, or third-party API configurations) is a classical synchronization challenge. Writing overlapping data streams results in structural corruption.

While relational databases rely on internal transactional locks, coordinating operations across distinct stateless microservices requires a dedicated Distributed File Lock.

This case study designs a production-grade distributed lock manager built upon Apache ZooKeeper's Ephemeral Sequencer protocol and the Apache Curator framework. It is optimized to operate reliably across a fleet of 100,000 application servers under dynamic clock drifts, GC pauses, and network partitions.

1. Requirements & Core Constraints

Functional Requirements

Mutual Exclusion: Only one host process can acquire the lock for a specific resource path at any given instant.
Fail-Safe Release: If a lock holder crashes or is partitioned, the lock must automatically release within a defined timeout (preventing permanent deadlocks).
Blocking & Reentrant Lock Options: Support both non-blocking "try-lock" and blocking queue-based acquisition mechanisms, along with thread-level reentrancy.
Lock Fair-Queueing: Guarantee that lock requests are processed in strict first-in-first-out (FIFO) order to prevent lock starvation.

Non-Functional Requirements

Strong Consistency (CP System): Lock operations must adhere to linearizable consistency. Availability is gracefully compromised during split-brain events to guarantee absolute safety (CAP Theorem CP position).
Scale Capacity: Coordinate locks for up to 1,000,000 unique resource paths requested by 100,000 application servers.
Low Acquisition Latency: Under low contention, lock lease acquisition must complete in under 5 milliseconds.
Fencing Protection: Support cryptographic fencing tokens to prevent late-arriving write blocks from corrupting storage after a lock lease has expired.

Back-of-the-Envelope Capacity Estimation

1. Ingress & Connection Scales

Active host connections: 100,000 application nodes.
ZooKeeper Heartbeats (Session Keep-Alives):
- Nodes send tick heartbeats every 2 seconds.
- Heartbeat QPS: 100,000 / 2 = 50,000 requests/second just to maintain session state.
Bandwidth: Heartbeat ping size is 64 bytes.
- 50,000 QPS * 64 bytes = 3.2 Megabytes/second. Highly efficient!

2. ZooKeeper Memory Footprint

Active Lock ZNodes: 1,000,000 unique paths.
Size per ZNode: Each node in the ZooKeeper tree is stored entirely in memory for maximum speed, requiring ~500 bytes.
Total ZooKeeper Ensemble RAM footprint:
- 1,000,000 nodes * 500 bytes = 500 Megabytes.
- Extremely lightweight! Easily fits inside cheap server RAM profiles.

2. API Design & Core Contracts

ZooKeeper clients do not utilize HTTP APIs; they interact through linearizable TCP socket connections using custom wire protocols. We represent these operations using standard gRPC structures and ZooKeeper CLI commands.

API 1: CLI-Based Node Setup (ZooKeeper Protocol)

To register a new lock queue path inside the consensus tree:

# Create persistent parent node for file locks
create /locks/file_write_lock_01 "init_payload"

# Client creates sequential ephemeral node inside parent to request lock
create -e -s /locks/file_write_lock_01/lock- "usr_server_node_8a"
# Output returns created sequence node path:
# Created /locks/file_write_lock_01/lock-0000000001

API 2: High-Level gRPC Distributed Lock Lease Request

For non-Java applications communicating with a lock coordinator microservice.

Request: RecordLockRequest (ProtoPayload)
Response: RecordLockResponse (ProtoPayload)

message RecordLockRequest {
  string resource_path = 1; // e.g. "/locks/files/invoice_9821.pdf"
  string client_id = 2; // e.g. "srv_node_west_82"
  int64 session_timeout_ms = 3; // e.g. 5000 (5 seconds)
  bool wait_blocking = 4; // True for waiting in queue
}

message RecordLockResponse {
  bool acquired = 1;
  string acquired_znode_path = 2; // "/locks/files/invoice_9821.pdf/lock-0000000001"
  int64 fencing_token = 3; // Incrementing counter representing lock lease version
  int64 lease_expires_at = 4;
}

3. High-Level Design (HLD)

The HLD uses ZooKeeper's hierarchical directory structure. ZooKeeper relies on the Zab Consensus Protocol to replicate state transitions across a cluster of servers, providing a single system image where all nodes see identical write orders.

Ephemeral Sequential Queue Topography

graph TD
    %% Parent Lock Node
    subgraph ZooKeeper Ensemble Memory Tree
        ParentNode["Persistent Parent Path: /locks/file_01"]
        
        %% Child sequence
        ParentNode --> Node1["Ephemeral Sequential 1: lock-0000000001 <br> (Holds Lock)"]
        ParentNode --> Node2["Ephemeral Sequential 2: lock-0000000002 <br> (Watches Node 1)"]
        ParentNode --> Node3["Ephemeral Sequential 3: lock-0000000003 <br> (Watches Node 2)"]
    end

    %% Client App Hosts
    Client1[App Node A] -.->|Creates Node 1| Node1
    Client2[App Node B] -.->|Creates Node 2| Node2
    Client3[App Node C] -.->|Creates Node 3| Node3

    %% Watch flow
    Node2 -->|Event Watcher Trigger on Delete| Node1
    Node3 -->|Event Watcher Trigger on Delete| Node2

Lock Competition & Acquisition Sequence

sequenceDiagram
    autonumber
    participant AppA as Application Node A (Lock Owner)
    participant AppB as Application Node B (Competitor)
    participant ZK as ZooKeeper Ensemble

    AppA->>ZK: Create ephemeral sequential path (/locks/f1/lock-)
    ZK-->>AppA: Path Assigned: /locks/f1/lock-0000000001
    AppB->>ZK: Create ephemeral sequential path (/locks/f1/lock-)
    ZK-->>AppB: Path Assigned: /locks/f1/lock-0000000002
    
    AppA->>ZK: List children and get lowest sequence
    ZK-->>AppA: Children: [lock-0000000001, lock-0000000002]
    AppA->>AppA: lock-0000000001 matches lowest! (ACQUIRED LOCK)
    
    AppB->>ZK: List children and get lowest sequence
    ZK-->>AppB: Children: [lock-0000000001, lock-0000000002]
    AppB->>AppB: lock-0000000002 does NOT match lowest.
    AppB->>ZK: Register Node Watch on /locks/f1/lock-0000000001
    ZK-->>AppB: Watch Registered (WAITING FOR LOCK)
    
    Note over AppA, ZK: App A crashes or experiences Network Partition timeout
    ZK->>ZK: Session expires. Ephemeral node lock-0000000001 automatically deleted!
    ZK->>AppB: Trigger Watch Event: Node /locks/f1/lock-0000000001 deleted!
    
    AppB->>ZK: List children and verify lowest
    ZK-->>AppB: Children: [lock-0000000002]
    AppB->>AppB: lock-0000000002 matches lowest! (ACQUIRED LOCK)

4. Low-Level Design (LLD) & Data Models

Ephemeral Node Properties

Unlike persistent nodes (which remain until deleted), Ephemeral Nodes are tied directly to the client's TCP session. If the client socket disconnects or fails to send heartbeats within the session timeout window, ZooKeeper's session manager automatically drops the ephemeral node, acting as an implicit lock release.

Node Details

Path: /locks/files/invoice_98.pdf/lock-0000000001
Stat Metadata Block:
- czxid: Creation transaction ID.
- mzxid: Modification transaction ID.
- ctime: Epoch timestamp of node creation.
- ephemeralOwner: The 64-bit session ID associated with this ephemeral node.

Compilable Java Implementation: Ephemeral Session Lock Simulator

The following class simulates ZooKeeper's session tracking, sequential numbering, and watch trigger mechanisms.

package com.codesprintpro.lock;

import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.atomic.AtomicInteger;

public class EphemeralSessionLockSimulator {

    public static class EphemeralNode {
        public final String path;
        public final int sequenceNumber;
        public final String ownerSessionId;
        public final long createdAt;

        public EphemeralNode(String path, int sequenceNumber, String ownerSessionId) {
            this.path = path;
            this.sequenceNumber = sequenceNumber;
            this.ownerSessionId = ownerSessionId;
            this.createdAt = System.currentTimeMillis();
        }
    }

    private final String lockPath;
    private final AtomicInteger sequenceGenerator = new AtomicInteger(0);
    private final Map<String, EphemeralNode> activeNodes = new ConcurrentHashMap<>();
    private final Map<String, Watcher> watchListeners = new ConcurrentHashMap<>();

    public interface Watcher {
        void onNodeDeleted(String deletedNodePath);
    }

    public EphemeralSessionLockSimulator(String lockPath) {
        this.lockPath = lockPath;
    }

    public synchronized String createEphemeralSequential(String sessionId) {
        int seq = sequenceGenerator.incrementAndGet();
        String fullPath = lockPath + "/lock-" + String.format("%010d", seq);
        
        EphemeralNode node = new EphemeralNode(fullPath, seq, sessionId);
        activeNodes.put(fullPath, node);
        System.out.println("Session [" + sessionId + "] created Node: " + fullPath);
        return fullPath;
    }

    public synchronized boolean checkLockAcquired(String clientNodePath) {
        if (!activeNodes.containsKey(clientNodePath)) {
            return false;
        }

        // Find the lowest sequence number currently active
        int currentMinSeq = Integer.MAX_VALUE;
        for (EphemeralNode node : activeNodes.values()) {
            if (node.sequenceNumber < currentMinSeq) {
                currentMinSeq = node.sequenceNumber;
            }
        }

        int clientSeq = activeNodes.get(clientNodePath).sequenceNumber;
        return clientSeq == currentMinSeq;
    }

    public synchronized void registerWatchOnPredecessor(String clientNodePath, Watcher watcher) {
        EphemeralNode clientNode = activeNodes.get(clientNodePath);
        if (clientNode == null) return;

        // Find the predecessor node (largest sequence smaller than client sequence)
        int clientSeq = clientNode.sequenceNumber;
        String predecessorPath = null;
        int maxPredecessorSeq = -1;

        for (EphemeralNode node : activeNodes.values()) {
            if (node.sequenceNumber < clientSeq && node.sequenceNumber > maxPredecessorSeq) {
                maxPredecessorSeq = node.sequenceNumber;
                predecessorPath = node.path;
            }
        }

        if (predecessorPath != null) {
            System.out.println("Node [" + clientNodePath + "] registering watch on predecessor [" + predecessorPath + "]");
            watchListeners.put(predecessorPath, watcher);
        }
    }

    public synchronized void evictSession(String sessionId) {
        System.out.println("\n--- Connection lost / Session [" + sessionId + "] Expired ---");
        List<String> toEvict = new ArrayList<>();
        
        for (EphemeralNode node : activeNodes.values()) {
            if (node.ownerSessionId.equals(sessionId)) {
                toEvict.add(node.path);
            }
        }

        for (String nodePath : toEvict) {
            activeNodes.remove(nodePath);
            System.out.println("Ephemeral node deleted: " + nodePath);
            
            // Trigger Watcher if registered
            Watcher watcher = watchListeners.remove(nodePath);
            if (watcher != null) {
                watcher.onNodeDeleted(nodePath);
            }
        }
    }

    public static void main(String[] args) {
        EphemeralSessionLockSimulator simulator = new EphemeralSessionLockSimulator("/locks/file_01");

        // Server A joins
        String pathA = simulator.createEphemeralSequential("session_node_A");
        boolean hasLockA = simulator.checkLockAcquired(pathA);
        System.out.println("Node A has lock? " + hasLockA); // Expect True

        // Server B joins
        String pathB = simulator.createEphemeralSequential("session_node_B");
        boolean hasLockB = simulator.checkLockAcquired(pathB);
        System.out.println("Node B has lock? " + hasLockB); // Expect False

        // Node B registers watch on Node A
        simulator.registerWatchOnPredecessor(pathB, new Watcher() {
            @Override
            public void onNodeDeleted(String deletedNodePath) {
                System.out.println("Watcher Notification: Predecessor [" + deletedNodePath + "] dropped!");
                boolean holdsNow = simulator.checkLockAcquired(pathB);
                System.out.println("Session B re-evaluates lock: Has lock now? " + holdsNow);
            }
        });

        // Simulating Server A experiencing session timeout (e.g. JVM Stop-The-World pause)
        simulator.evictSession("session_node_A");
    }
}

5. Scaling Challenges & Bottlenecks

The Herd Effect (Thundering Herd)

Problem: In naive distributed lock designs, when a lock releases, the system broadcasts a notify event to all waiting clients. Hundreds of concurrent client nodes simultaneously fetch child nodes and query the coordinator, overwhelming network buffers and CPU cycles.
Mitigation: Implement Point-to-Point Watching. The sequential node design naturally mitigates the herd effect. As shown in the simulator, each client node registers a watch only on the node with the sequence number immediately preceding its own. When Node 1 releases, Node 2 is the only node notified, keeping network broadcast traffic at exactly O(1).

Split-Brain & Consensus Loss

Problem: A network partition isolates ZooKeeper servers. If clients can write to multiple partitioned segments, two different nodes could acquire the "same" lock in their respective partition sets.
Mitigation: ZooKeeper prevents this by enforcing a Quorum Policy. A ZooKeeper cluster of size N requires a strict majority of nodes (floor(N/2) + 1) to accept writes. If a partition cannot form a majority, it refuses to create nodes, ensuring that a split-brain double-lock is mathematically impossible.

6. Technical Trade-offs & Compromises

warning

Staff Engineer Perspective: GC Pauses & Double-Locking

The most insidious failure mode in distributed locking is the JVM Stop-the-World Garbage Collection Pause. Suppose Node A acquires a ZooKeeper lock. It starts a heavy write operation. Suddenly, Node A encounters a major JVM GC pause, freezing all execution threads. Because the JVM is paused, Node A fails to send ZooKeeper heartbeats. ZooKeeper's session times out and deletes Node A's ephemeral lock node. Node B, next in sequence, is notified and successfully acquires the lock. Node A's GC pause completes. Node A resumed its write thread, believing it still holds the lock. Node B is also writing. You now have two processes writing to the same file concurrently, destroying data consistency. Compromise Plan: Utilize Fencing Tokens. Every lock lease returned by the lock coordinator includes an incrementing fence counter (e.g., token = 42). The shared storage system (such as our database or file system) keeps track of the largest token processed. When Node A attempts to write, the storage engine validates the fence token. After Node B acquires the lock with token 43, Node A's older token 42 write is immediately rejected, preventing silent data corruption.

Redis (Redlock) vs. ZooKeeper (Curator)

Redis Redlock: Highly available and incredibly fast. It operates on AP principles where availability is prioritized. However, Redlock depends heavily on system clock sync assumptions across distinct Redis instances. If a clock drifts or a server pauses, Redlock can violate mutual exclusion guarantees.
ZooKeeper (Curator): Built strictly on CP principles. It relies on a linearizable consensus model where consistency is absolute. The trade-off is a lower transaction throughput compared to Redis due to consensus round-trip network hops.
Decision: We mandate ZooKeeper (Curator) for all critical file integrations where data corruption is unacceptable. The strict CP consistency model ensures absolute safety, and Curator handles connection retries and node reconstruction automatically.

7. Failure Scenarios & Operational Resiliency

1. Host Session Outage (JVM GC Spikes)

Scenario: An application server encounters a major GC pause, triggering premature session timeout and ephemeral node deletion.
Resiliency Plan:
- Implement a Fencing Token validation step in our storage systems.
- Tune Curator's sessionTimeout to be larger than typical P99 GC pauses (e.g. 5,000 milliseconds).
- Use Curator's ConnectionStateListener to pause all active file write operations immediately if the client enters a SUSPENDED connection state.

2. ZooKeeper Leader Crash

Scenario: The active leader node in the ZooKeeper cluster crashes.
Resiliency Plan: The remaining ZooKeeper cluster servers halt all client modifications, execute a new leader election round via the Zab protocol within 200ms, and reconstruct active session connections seamlessly without dropping ephemeral node states.

3. Connection Flapping (Session Flapping)

Scenario: Intermittent network jitter causes the client to frequently disconnect and reconnect, leading to orphaned ephemeral lock nodes.
Resiliency Plan: Curator's InterProcessMutex tracks active local locks. If a connection is lost, Curator blocks subsequent local reentrant lock attempts until the session has been validated or re-established, avoiding orphaned locks.

8. Candidate Verbal Script

Below is a mock interview walkthrough demonstrating how a candidate should execute this system design interview.

Interviewer: "Design a distributed lock manager to coordinate file edits across 100,000 application nodes safely."

Candidate: "To build a distributed lock manager where safety and correctness are absolute requirements, I will choose a CP-centric architecture utilizing Apache ZooKeeper and the Apache Curator client framework.

Unlike AP memory stores like Redis, ZooKeeper enforces linearizable consistency, guaranteeing that under network splits, two host nodes can never acquire the same lock.

To manage locks safely without polling, I will use ZooKeeper Ephemeral Sequential Nodes. When an application server requests a lock on a path (such as /locks/file_01), it creates an ephemeral sequential child node: /locks/file_01/lock-0000000001.

The client then lists all child nodes. If its node sequence is the smallest, the client acquires the lock. If not, it registers a watch event only on the node directly preceding it in sequence. This prevents the thundering herd effect by ensuring that only one waiting node is notified when a lock is released.

To handle host crashes or JVM Stop-The-World GC pauses where a lock node might be prematurely deleted, I will implement Fencing Tokens. Every lock lease has an incrementing version token. The target storage system checks this version token, instantly rejecting any write attempts from clients with older, expired lease versions.

By wrapping this logic in Apache Curator's InterProcessMutex, we ensure that connection drops, retries, and ephemeral session lifecycles are handled automatically and robustly."

Interviewer: "What happens if a ZooKeeper client experiences a temporary network partition? How long before the lock is released?"

Candidate: "The timeout is governed by the ZooKeeper sessionTimeout parameter, typically set between 2 to 10 seconds. If a client is partitioned, it fails to send heartbeat pings. Once the sessionTimeout window expires, the ZooKeeper leader declares the session dead, automatically deletes the client's ephemeral node, and triggers a watch notification to the next client in the queue.

To prevent data corruption during this window, we use Curator's connection listener to immediately suspend all local file write operations the moment the connection drops, and validate the fencing token at the storage layer."

Key Takeaways

If yes: You own the lock.
If no: You watch the node immediately preceding yours in the sequence.
InterProcessMutex: Provides a familiar API for distributed locking.