Introduction: The Distributed Memory Management Challenge
In single-process runtimes (such as the JVM or Go runtime), garbage collection (GC) is managed by local runtimes that scan memory graphs, identify unreachable objects, and reclaim space. The local engine has complete visibility of references and thread execution states, operating with strong memory-barrier guarantees.
When an application is decoupled into microservices communicating over a network, this centralized visibility is lost. If Service A allocates a temporary resource (such as a database connection, a large file on a shared block storage, or an in-memory session) inside Service B, managing the lifecycle of that resource becomes a distributed systems challenge.
If Service A crashes, experiences a network partition, or suffers from a software bug that prevents it from sending an explicit delete command, the allocated resource in Service B will leak. Over time, these orphaned resources accumulate, resulting in memory exhaustion, file handle limits, and cloud storage cost explosions.
To solve this, we must design a Distributed Garbage Collection (DGC) engine that manages references, detects dead cycles, and cleans up resources across network-isolated services.
Requirements and System Goals
Functional Requirements
- Explicit Resource Registration: Service providers (owners of resources) must be able to register active allocations and tie them to unique client references.
- Heartbeat-Based Lease Management: Support lease-based lifecycles where resources are granted to clients for a fixed duration (e.g., 60 seconds). Clients must renew their leases via periodic heartbeats.
- Reference Count Tracking: Maintain counts of active service dependencies pointing to shared resources (e.g., how many workflows are currently using a specific file segment).
- Distributed Cycle Detection: Detect circular references across network-isolated services (e.g., Service A references a resource in Service B, which references a resource in Service C, which circular-references back to Service A) where no external active clients remain.
- Idempotent Cleanup Routines: The reclamation workers must execute deletion commands idempotently to prevent system corruption during network retries.
Non-Functional Requirements
- Sub-Millisecond Lookup Latency: Validating lease status and reference counts must not introduce bottleneck overheads into the client processing path (lookup time less than 1.5ms).
- Partition Tolerance: The DGC must handle network partitions. If a client is isolated from the coordinator, the lease must eventually expire, and resources must be reclaimed safely (favoring Consistency over Availability).
- Minimal Network Bandwidth Egress: Heartbeat messages must be small and batchable to prevent flooding internal VPC network links.
- Observable Backlog & Orphan Telemetry: Expose real-time counts of orphan resources, sweep backlogs, and lease renewal failure rates.
API Interfaces and Service Contracts
We define gRPC interface contracts to manage distributed leases and references.
1. Lease Registration and Heartbeat API (gRPC)
Clients use this service to claim resources and keep them alive.
syntax = "proto3";
package database.dgc.v1;
service DistributedGCService {
// Allocate a new resource lease
rpc AcquireLease(AcquireLeaseRequest) returns (AcquireLeaseResponse);
// Renew an active resource lease
rpc RenewLease(RenewLeaseRequest) returns (RenewLeaseResponse);
// Explicitly release a resource reference
rpc ReleaseReference(ReleaseReferenceRequest) returns (ReleaseReferenceResponse);
}
message AcquireLeaseRequest {
string resource_id = 1;
string client_id = 2;
int64 lease_duration_ms = 3; // Requested TTL
}
message AcquireLeaseResponse {
string lease_id = 1;
int64 expires_at_epoch_ms = 2;
bool success = 3;
}
message RenewLeaseRequest {
string lease_id = 1;
string client_id = 2;
int64 extend_duration_ms = 3;
}
message RenewLeaseResponse {
bool success = 1;
int64 new_expires_at_epoch_ms = 2;
}
message ReleaseReferenceRequest {
string resource_id = 1;
string client_id = 2;
}
message ReleaseReferenceResponse {
bool success = 1;
int32 remaining_reference_count = 2;
}
High-Level Design and Visualizations
Distributed garbage collection requires a lease coordinator that tracks client references and sweeps expired entities.
Distributed Lease Coordinator Architecture
graph TD
ClientSvc[Client Service: Service A] -->|1. AcquireLease / Heartbeat RPC| Coord[Distributed Lease Coordinator]
Coord -->|2. Track state & write TTL| DB[(ACID Metadata Store: CockroachDB)]
subgraph Resource Provider Node (Service B)
Provider[Resource Provider Service] -->|3. Query Reference Status| Coord
Sweeper[Background Garbage Collector Sweeper] -->|4a. Fetch expired leases| Coord
Sweeper -->|4b. Delete physical files / connection| Disk[(Local Disk / DB Storage)]
end
style ClientSvc fill:#f8f9fa,stroke:#343a40
style Coord fill:#fff3cd,stroke:#ffc107
style DB fill:#f8d7da,stroke:#dc3545
The Distributed Reference Cycle Problem
Distributed cycles occur when resources refer to each other across service boundaries, preventing local reference counters from dropping to zero.
graph LR
subgraph Client Space
Client[External Active Worker]
end
subgraph Service A
ResA[Resource A: RefCount = 1]
end
subgraph Service B
ResB[Resource B: RefCount = 1]
end
subgraph Service C
ResC[Resource C: RefCount = 1]
end
Client -.->|Drops active reference| ResA
ResA -->|References| ResB
ResB -->|References| ResC
C_Link[ResC] -->|References back| ResA
style Client fill:#f8f9fa,stroke:#343a40
style ResA fill:#f8d7da,stroke:#dc3545
style ResB fill:#f8d7da,stroke:#dc3545
style ResC fill:#f8d7da,stroke:#dc3545
If the External Client drops its reference to Resource A, the references between A, B, and C keep the reference counts of all three nodes at 1. Without a global cycle detection graph traversal, these resources leak forever.
Low-Level Design and Schema Strategies
To track distributed references and dependencies, the lease coordinator manages two primary metadata schemas: lease allocations and dependency graph edges.
SQL Database Schema: Lease Registry
CREATE TABLE resource_leases (
lease_id UUID PRIMARY KEY,
resource_id VARCHAR(255) NOT NULL,
client_id VARCHAR(255) NOT NULL,
lease_status VARCHAR(16) NOT NULL CHECK (lease_status IN ('ACTIVE', 'EXPIRED')),
expires_at TIMESTAMP WITH TIME ZONE NOT NULL,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_lease_expiry ON resource_leases(expires_at, lease_status)
WHERE lease_status = 'ACTIVE';
SQL Database Schema: Distributed Reference Graph (Edge Table)
To detect cycles, the DGC logs references between resources across network partitions:
CREATE TABLE resource_dependency_edges (
source_resource_id VARCHAR(255) NOT NULL,
target_resource_id VARCHAR(255) NOT NULL,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (source_resource_id, target_resource_id)
);
CREATE INDEX idx_dependency_target ON resource_dependency_edges(target_resource_id);
Background Expiration Worker (SQL Sweep Logic)
A cron-based database sweep executes periodically to reclaim expired leases:
BEGIN;
-- 1. Identify expired leases
CREATE TEMP TABLE expired_leases_temp AS
SELECT lease_id, resource_id
FROM resource_leases
WHERE expires_at <= NOW()
AND lease_status = 'ACTIVE'
FOR UPDATE;
-- 2. Mark leases as EXPIRED
UPDATE resource_leases
SET lease_status = 'EXPIRED',
updated_at = NOW()
WHERE lease_id IN (SELECT lease_id FROM expired_leases_temp);
-- 3. Emit deletion events or execute cleanup callbacks
-- (Handled by the application outbox processor)
COMMIT;
Scaling and Operational Challenges: Calculations & Formulations
The primary performance bottleneck of lease-based DGC is network traffic from client heartbeats. Let us calculate this egress volume.
Back-of-the-Envelope Heartbeat Network Bandwidth Calculation
Let us define:
- $N_{\text{active_leases}}$: Number of concurrently active resource leases (e.g., 5,000,000 active leases).
- $T_{\text{heartbeat}}$: Heartbeat renewal interval (e.g., 10 seconds).
- $S_{\text{payload}}$: Average gRPC heartbeat packet size (IP header + TCP header + gRPC metadata +
RenewLeaseRequestpayload $\approx$ 150 bytes).
Step 1: Calculate Request Volume per Second (QPS)
If 5,000,000 active leases send heartbeats every 10 seconds, the rate of incoming heartbeat requests to the coordinator is:
$$QPS_{\text{heartbeat}} = \frac{N_{\text{active_leases}}}{T_{\text{heartbeat}}} = \frac{5,000,000}{10 \text{ seconds}} = 500,000 \text{ requests/sec}$$
Step 2: Calculate Ingress Network Bandwidth
$$\text{Ingress Traffic} = QPS_{\text{heartbeat}} \times S_{\text{payload}}$$
$$\text{Ingress Traffic} = 500,000 \text{ req/sec} \times 150 \text{ bytes} = 75,000,000 \text{ bytes/sec} = 75.0 \text{ MB/sec}$$
Step 3: Database Writes Rate (If updates hit disk directly)
If every heartbeat renewal executes a synchronous UPDATE query on the database:
$$\text{DB Write IOPS} = 500,000 \text{ writes/sec}$$
This write rate will crash standard relational database disks due to write amplification in index updates.
Mitigation: Memory-First Lease Renewal with Jittered DB Flushing
To scale the coordinator:
- In-Memory Tracking: Keep active lease expiry timers in a Redis sorted set (
ZSET), where the score is the lease expiration timestamp. - Fast Path Updates: Client heartbeats update Redis memory (
ZADD) instead of executing SQL updates:
$$\text{Redis Egress/Ingress Latency} \approx 1.5 \text{ ms}$$
- Background Batching: A background task periodically fetches expired leases from Redis and updates the persistent database in batches of 1,000 rows. This reduces SQL write IOPS from 500,000 to:
$$\text{SQL Write IOPS} = \frac{500,000}{1,000} = 500 \text{ writes/sec}$$
This optimization preserves database health, keeps lookup latencies sub-millisecond, and reduces network congestion.
Trade-offs and Architectural Alternatives
No single memory management approach fits all microservice dependencies. We compare three primary alternatives.
Distributed GC Strategy Comparison
| Strategy / Aspect | Reference Counting (RPC-based) | Distributed Lease Management (TTL-based) | Global Mark-and-Sweep (Tracing GC) |
|---|---|---|---|
| Philosophy | Immediate cleanup on counter release | Time-bounded lease expiry | Periodic global reachability scan |
| Write Path Latency | Low (Updates on client registration) | Low (Periodic heartbeat updates) | Zero (Runs entirely out-of-band) |
| Leak Handling (Crashes) | Poor (Missed RPC releases leak resource forever) | Excellent (Leases naturally expire and reclaim space) | Excellent (Cleans all unreachable nodes) |
| Network Egress Cost | Low (Occurs only on allocation/deallocation) | High (Requires continuous client heartbeats) | Very High (Requires scanning all service graphs) |
| Cycle Cleaning | Impossible (Cycles remain locked) | Good (Lease expiry breaks cycles automatically) | Excellent (Traverses and drops cycle groups) |
| Complexity | Low | Medium | Extremely High (Requires global distributed tracer) |
Key Trade-offs
- Reference Counting vs. Leases:
- Reference Counting: Best for low-frequency resource sharing where clients are highly reliable and network links are stable. It consumes minimal network bandwidth because no heartbeats are required.
- Leases: Necessary when clients can crash or be partitioned. It Trades network bandwidth (heartbeat cost) for a guarantee of bounded leak lifetimes.
Failure Modes and Fault Tolerance Strategies
Operating distributed cleanup loops exposes the system to network partition hazards.
1. The "Ghost Reclaim" under Client GC Pause
If Client Service A holds a lease but enters a long Stop-The-World (STW) GC pause, it stops sending heartbeats to the coordinator. The coordinator assumes Client A is dead, expires the lease, and reclaims the resource (e.g., deletes a temporary working directory). When Client A wakes up, it attempts to read the directory, resulting in application failures.
- Mitigation: Implement Lease Safety Buffers. The DGC coordinator marks the lease as
EXPIREDin database metadata but defers the physical deletion of the resource for a safety buffer period (e.g., an additional 30 seconds). If Client A wakes up during this buffer period and attempts a write, the system detects the expired lease and raises a lease renewal error, forcing Client A to abort safely.
2. Network Partitions and Split-Brain Leases
If a network partition isolates the Lease Coordinator from the Resource Provider, the coordinator cannot confirm if the leases are active.
- Mitigation: The Resource Provider must maintain Local Lease Verification. It caches lease states locally. If the partition lasts longer than the lease TTL, the Resource Provider locally disables writes to the resource, protecting data integrity.
3. Database Outage during Sweep
If the DGC database goes offline, expired leases cannot be flagged, causing memory pressure.
- Mitigation: Implement local memory safety valves. If database connections fail, the local resource hosts trigger emergency deletions of oldest non-financial cache files.
Verbal Script
Interviewer: "How would you design a distributed garbage collection system to manage temporary storage resources allocated across multiple microservices?"
Candidate: "To design a distributed garbage collection system, I would use a Lease-based Lifecycle Coordinator rather than a pure reference counting model.
Reference counting is too fragile in distributed environments; if a client crashes or a network partition drops a deallocation RPC, the resource leaks permanently.
In my design, when Client Service A requests a resource in Service B, it must acquire a lease from a centralized Lease Coordinator.
The lease has a configured time-to-live of 60 seconds.
Client A must run a background heartbeat thread that sends a renewal RPC to the coordinator every 20 seconds.
If Client A crashes or is partitioned, the heartbeat stops, the lease naturally expires after 60 seconds, and a background database sweeper node reclaims the resource.
To scale the write path of heartbeats, I would use an in-memory cache layer like Redis to store active lease timers in a sorted set before flushing updates to our persistent CockroachDB metadata store in batches.
This prevents 500,000 write operations per second from overwhelming the database disk.
Furthermore, to protect against client-side GC pauses causing ghost reclamations, we implement a safety grace period.
If a client’s lease expires while it is paused, the physical resource is not deleted immediately.
When the client wakes up and attempts to access the resource, the system rejects the operation, forcing the client to abort before data corruption can occur."
Interviewer: "How do you handle distributed reference cycles where Service A, B, and C refer to each other's resources, and no external client is active?"
Candidate: "Distributed reference cycles are notoriously difficult to clean because local reference counters never drop to zero.
To solve this without building a heavy, complex global graph tracer, I would enforce two strict patterns:
First, Unidirectional Resource Dependency Paths. We design our service contracts to ensure that dependencies are directed and acyclic. A service is forbidden from creating circular back-references.
Second, Absolute TTL Timeouts. Every temporary resource must have an absolute, non-renewable maximum lifetime, regardless of active leases.
For example, a temporary workflow folder can have a maximum TTL of 24 hours.
Even if a cycle keeps the lease active, the absolute TTL prunes the resource once the limit is reached.
This simple, pragmatic policy avoids the massive overhead of distributed graph traversals while guaranteeing that resources are bound and eventually reclaimed."