Lesson 10 of 23 6 min

System Design Masterclass: Designing a Distributed Lock Service (Chubby/Zookeeper)

A deep dive into building a highly available, strongly consistent lock service. Learn about consensus protocols, liveness guarantees, fencing tokens, and the "Redlock" controversy.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Introduction to Distributed Locking

Mental Model

Connecting isolated components into a resilient, scalable, and observable distributed web.

graph TD
    App[Application Server] -->|Read Request| Cache[(Redis Cache)]
    Cache -- Cache Miss --> DB[(Primary Database)]
    DB -- Return Data --> App
    App -- Write Data --> Cache

In a single-threaded process, we use a mutex. In a multi-threaded application, we use synchronized blocks or locks. But in a distributed system with hundreds of nodes, how do you ensure that only one process performs a specific task at any given time?

Whether it's preventing double-billing, ensuring only one worker is crawling a specific site, or managing exclusive access to a shared resource, you need a Distributed Lock Service.

Designing one is a classic FAANG interview question because it tests your understanding of Consistency, Liveness, and Clock Drifts.

The Problem: Why is this hard?

A distributed lock isn't just a "key in Redis." In a distributed environment, several things can go wrong:

  1. Network Partition: The process holding the lock is separated from the lock service.
  2. Process Pauses: A JVM Garbage Collection (GC) pause or an OS context switch happens after the process acquires the lock but before it performs the action.
  3. Clock Drift: The time on the lock service and the client are not perfectly synchronized.

The Requirements

Functional

  • Acquire(LockName, Timeout): Returns a LockID or fails.
  • Release(LockID): Frees the lock.
  • Renew(LockID, ExtensionTime): Extends the lease.

Non-Functional

  • Consistency (Safety): Only one process can hold the lock at a time. This is non-negotiable.
  • Liveness: If a process crashes while holding a lock, the lock must eventually be released (Lease-based).
  • High Availability: The lock service itself must not be a single point of failure.

High-Level Architecture

A production-grade lock service (like Google's Chubby or Apache Zookeeper) is usually built on a Consensus Algorithm (Paxos or Raft).

1. The Storage Layer (Replicated State Machine)

The lock state is replicated across 3 or 5 nodes. A client request is only successful if a majority (quorum) of nodes agree. This ensures that even if one node fails, the lock state is preserved and consistent.

2. The Lease Mechanism

We never grant a lock indefinitely. Every lock has a Time-To-Live (TTL).

  • If the client doesn't renew the lock before the TTL expires, the service assumes the client has crashed and releases the lock.

The "GC Pause" Death Trap

Imagine this scenario:

  1. Process A acquires a lock with a 10s TTL.
  2. Process A starts a "Stop the World" GC pause that lasts 15s.
  3. The lock expires at 10s. The Lock Service grants the lock to Process B.
  4. Process A wakes up at 15s. It thinks it still has the lock and writes to the database.
  5. Process B is also writing to the database.
  6. Data Corruption occurs.

The Fix: Fencing Tokens

To solve this, the lock service must return a monotonically increasing version number (a token) with every lock acquisition.

  • When the process writes to the database, it includes the token.
  • The database (or the storage layer) checks if the token is the highest it has seen. If Process B has already written with a higher token, Process A's write is rejected.

Redis Redlock vs. Zookeeper

In interviews, you must know the trade-offs between a CP system (Zookeeper) and an AP system (Redis).

Zookeeper (The Safe Choice)

Zookeeper uses ephemeral nodes and a sequence number. If the client loses its session, the node is automatically deleted. It is Strongly Consistent.

  • Pros: Safety guarantees are built-in.
  • Cons: Higher latency for writes due to consensus.

Redis Redlock (The AP Choice)

Redlock tries to achieve safety by acquiring locks from multiple independent Redis instances.

  • The Controversy: Distributed systems expert Martin Kleppmann famously argued that Redlock is not safe because it relies on synchronized system clocks, which can drift or jump.
  • Senior Verdict: Use Redis for performance/caching where occasional "double-processing" is acceptable. Use Zookeeper/Etcd for critical resources where safety is paramount.

Scalability and Performance

To handle millions of lock requests:

  1. Hierarchical Locks: Store locks in a tree structure (like Chubby).
  2. Client-Side Caching: Chubby allows clients to cache the "Lock Not Held" state to reduce load on the server.
  3. Proxy Layer: Use a proxy to batch lock requests from multiple clients on the same machine.

Interview Script

"To design a Distributed Lock Service, I would prioritize Safety and Liveness. I'd use a consensus-based approach like Raft to ensure the lock state is consistent across a quorum of nodes. I'd implement a lease-based mechanism to prevent deadlocks if a client crashes. Crucially, to handle process pauses like GC, I would implement Fencing Tokens, where every lock comes with a monotonically increasing ID that the storage layer validates before any write operation. While Redis Redlock is an option for high-throughput, low-criticality tasks, for a FAANG-scale production system, I'd lean towards a CP system like Zookeeper for its stronger safety guarantees."

Technical Trade-offs: Database Choice

Model Consistency Latency Complexity Best Use Case
Relational (ACID) Strong High Medium Financial Ledgers, Transactions
NoSQL (Wide-Column) Eventual Low High Large-Scale Analytics, High Write Load
In-Memory Variable Ultra-Low Low Caching, Real-time Sessions

Key Takeaways

  • ****Acquire(LockName, Timeout): Returns a LockID or fails.
  • ****Release(LockID): Frees the lock.
  • ****Renew(LockID, ExtensionTime): Extends the lease.

Production Readiness Checklist

Before deploying this architecture to a production environment, ensure the following Staff-level criteria are met:

  • High Availability: Have we eliminated single points of failure across all layers?
  • Observability: Are we exporting structured JSON logs, custom Prometheus metrics, and OpenTelemetry traces?
  • Circuit Breaking: Do all synchronous service-to-service calls have timeouts and fallbacks (e.g., via Resilience4j)?
  • Idempotency: Can our APIs handle retries safely without causing duplicate side effects?
  • Backpressure: Does the system gracefully degrade or return HTTP 429 when resources are saturated?

Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.