Lesson 73 of 105 17 minFlagship

System Design: Building a Service Discovery Platform

Design a production service discovery platform with registration, health checks, heartbeats, client-side and server-side discovery, zone awareness, rollout safety, and failure isolation.

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • `checkout-api` needs a healthy `inventory-api`
  • `payment-worker` needs one `fraud-api` instance in the same region if possible
  • API gateway needs all healthy instances of `orders-api`
Recommended Prerequisites
System Design Interview Framework

Premium outcome

From vague architecture answers to staff-level trade-off thinking.

Backend engineers preparing for senior, staff, and architecture rounds.

What you unlock

  • A reusable system design answer framework for ambiguous prompts
  • Clear language for consistency, scaling, and reliability trade-offs
  • Case-study depth across feeds, payments, storage, and messaging systems

In a distributed system, naming a service is easy. Finding a healthy instance of that service, right now, in the right zone, during a deployment, while network partitions are actively occurring, is the real challenge.

As microservice architectures grow, static configuration files fall apart. Autoscaling, rolling deployments, container rescheduling, zone-aware routing, and dynamic failovers turn the question "where should this request go?" into a complex control-plane synchronization problem. If a service needs to route thousands of requests per second, it cannot rely on human-updated static files. It requires a dynamic, highly automated discovery mechanism.

This system design guide details the architecture of a highly available, low-latency, and zone-aware service discovery platform capable of managing over 100,000 active service instances. We will explore how instances announce their availability, how the registry maintains health state, how clients route around failures, and the critical trade-offs involved in consistency models.


System Requirements

To build a production-grade service discovery platform, we must divide our requirements into functional, non-functional, and scale specifications.

Functional Requirements

  • Service Registration & Deregistration: Allow new service instances to register their metadata (IP address, port, zone, version) upon startup and clean up records gracefully when shutting down.
  • Lease-Based Liveness (Heartbeats): Instances must periodically renew their leases. If an instance fails to heartbeat within a specified window, it must be marked unhealthy and eventually pruned from the active pool.
  • Client Query & Subscription (Watch API): Enable client microservices to discover healthy instances of a target service. Support real-time streaming updates so that clients are notified immediately when instances join or leave.
  • Zone & Metadata Routing: Propagate rich metadata such as availability zone, region, container version, and routing weight, allowing clients to make intelligent, cost-effective routing decisions.
  • Graceful Decommissioning: Support a draining state to gracefully stop sending new requests to instances that are about to shut down, allowing in-flight requests to complete without generating errors.

Non-Functional Requirements

  • Low-Latency Reads: Resolving endpoints for a service must take less than 1 millisecond. This implies reading from local memory caches rather than blocking on remote registry network calls.
  • Control Plane Partition Resilience (Data Plane Survivability): Even if the discovery registry experiences a total outage, microservices must continue to communicate using their last-known-good local caches.
  • High Ingress Write Scalability: The control plane must comfortably ingest tens of thousands of heartbeat signals per second without experiencing database lock contention or CPU exhaustion.
  • Bounded Eventual Consistency: The propagation delay of a state change (e.g., an instance dying) to all subscribing clients must be less than 5 seconds across the entire fleet.
  • Security & Workload Identity: Only authenticated service instances may register themselves under a service name to prevent malicious routing and man-in-the-middle attacks.

API Design and Service Contracts

To achieve language-agnostic integration, the discovery platform exposes both a REST HTTP JSON API and a high-performance gRPC service.

1. Service Registration (POST /api/v1/instances)

When a microservice container boots, it issues a registration request.

Request Payload:

{
  "serviceName": "inventory-api",
  "instanceId": "inventory-api-7f9c8d-h2kgw",
  "ipAddress": "10.42.6.18",
  "port": 8080,
  "zone": "us-east-1a",
  "region": "us-east-1",
  "version": "v2.1.0",
  "weight": 100,
  "healthCheckUrl": "http://10.42.6.18:8080/health/ready",
  "ttlSeconds": 30
}

Response Payload (201 Created):

{
  "status": "REGISTERED",
  "leaseId": "lease-abc-123-xyz",
  "expiresAt": "2026-06-07T10:25:30Z"
}

2. Heartbeat Renewal (PUT /api/v1/instances/{instanceId}/heartbeat)

Registered instances must ping the registry before their lease expires.

Request Payload:

{
  "leaseId": "lease-abc-123-xyz",
  "serviceName": "inventory-api"
}

Response Payload (200 OK):

{
  "status": "RENEWED",
  "newExpiresAt": "2026-06-07T10:26:00Z"
}

3. Service Query (GET /api/v1/services/{serviceName}/endpoints)

Used by clients on startup to populate their initial cache.

Response Payload (200 OK):

{
  "serviceName": "inventory-api",
  "version": 45892,
  "endpoints": [
    {
      "instanceId": "inventory-api-7f9c8d-h2kgw",
      "ipAddress": "10.42.6.18",
      "port": 8080,
      "zone": "us-east-1a",
      "status": "HEALTHY",
      "weight": 100
    },
    {
      "instanceId": "inventory-api-7f9c8d-k9lpo",
      "ipAddress": "10.42.7.22",
      "port": 8080,
      "zone": "us-east-1b",
      "status": "DRAINING",
      "weight": 50
    }
  ]
}

4. Watch / Streaming Endpoint Updates (gRPC Protocol)

To avoid continuous polling, clients open a long-lived bidirectional gRPC stream to receive delta updates.

syntax = "proto3";

package discovery.v1;

service DiscoveryService {
  rpc WatchEndpoints(WatchRequest) returns (stream WatchResponse);
}

message WatchRequest {
  string service_name = 1;
  string client_instance_id = 2;
  int64 last_known_version = 3;
}

message EndpointEvent {
  enum EventType {
    ADDED = 0;
    MODIFIED = 1;
    REMOVED = 2;
  }
  EventType type = 1;
  string instance_id = 2;
  string ip_address = 3;
  int32 port = 4;
  string zone = 5;
  string status = 6;
  int32 weight = 7;
}

message WatchResponse {
  string service_name = 1;
  int64 current_version = 2;
  repeated EndpointEvent events = 3;
}

High-Level Architecture

The system splits responsibilities between the Service Instance, the Registry Control Plane, and the Calling Client/Sidecar.

The Service Instance is responsible for self-registration and periodic heartbeats. The Registry Control Plane is a cluster of metadata servers that store instance leases and distribute updates. The Calling Client/Sidecar maintains a local cache of endpoint metadata, handles client-side load balancing, and monitors network changes.

Discovery Propagation Pipeline

This sequence diagram tracks the lifecycle of an instance registration, writing to the consensus store, and its real-time propagation to consumer sidecars.

sequenceDiagram
    autonumber
    participant Instance as Service Instance
    participant Registry as Registry Cluster (Control Plane)
    participant Store as Distributed Store (Raft consensus)
    participant Client as Consumer Client / Sidecar
    
    Instance->>Registry: POST /api/v1/instances (Register + Metadata)
    Registry->>Store: Write Instance Record & Lease Time
    Store-->>Registry: Commit Success
    Registry-->>Instance: 201 Created (Lease Assigned)
    
    Note over Registry, Client: Active Watch Stream (gRPC / SSE)
    Registry->>Client: Push Delta Event: INSTANCE_ADDED (10.42.6.18:8080)
    Client->>Client: Update Local Cache & Refresh Selector Pool

Client-Side Endpoint Selection Flow

This flowchart details how the client SDK or sidecar chooses a target node, optimizing for zone affinity while handling fallbacks when local instances fail.

graph TD
    A[Start Endpoint Selection] --> B{Local Cache Populated?}
    B -- No --> C[Sync Query Control Plane]
    B -- Yes --> D[Filter Endpoints by Target Service]
    D --> E[Filter out non-HEALTHY states]
    E --> F{Any same-zone endpoints?}
    F -- Yes --> G[Select from Same-Zone Pool using WRR]
    F -- No --> H{Any same-region endpoints?}
    H -- Yes --> I[Select from Same-Region Pool using WRR]
    H -- No --> J[Raise NoHealthyEndpoints Exception]
    G --> K[Route Request to Target Instance]
    I --> K
    C --> K

Low-Level Design and Schema

At the database layer, the discovery registry tracks services, instances, and lease expirations. While the production control plane keeps this data in-memory or replicates it using consensus algorithms (like Raft), we model the system relational schema below to define the constraints and composite indexes.

-- Represents logical services in the fleet
CREATE TABLE services (
    service_name VARCHAR(128) PRIMARY KEY,
    owner_team VARCHAR(64) NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Tracks active service instances and metadata
CREATE TABLE service_instances (
    instance_id VARCHAR(128) PRIMARY KEY,
    service_name VARCHAR(128) NOT NULL REFERENCES services(service_name) ON DELETE CASCADE,
    ip_address VARCHAR(45) NOT NULL, -- Supports IPv4 and IPv6
    port INT NOT NULL,
    zone VARCHAR(32) NOT NULL,
    region VARCHAR(32) NOT NULL,
    version VARCHAR(32) NOT NULL,
    weight INT NOT NULL DEFAULT 100 CHECK (weight >= 0),
    status VARCHAR(20) NOT NULL DEFAULT 'STARTING', -- STARTING, HEALTHY, DRAINING, UNHEALTHY
    metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
    lease_expires_at TIMESTAMPTZ NOT NULL,
    registered_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    last_heartbeat_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Index for service health lookups (hot read path)
CREATE INDEX idx_service_instances_lookup 
ON service_instances (service_name, status, zone);

-- Index for cleanup worker finding expired leases
CREATE INDEX idx_service_instances_lease 
ON service_instances (lease_expires_at) 
WHERE status != 'UNHEALTHY';

Schema Rationale & Index Optimization

  1. idx_service_instances_lookup: This is a composite index built specifically for the client read path. Clients filter by service_name and status = 'HEALTHY'. Including zone in the key allows the index to serve zone-aware queries immediately without performing an additional filter scan.
  2. JSONB Metadata: Service teams require custom tags (e.g., "canary=true", "workload-id=spiffe://prod/app"). Storing these inside a JSONB column provides structural flexibility while retaining the ability to index specific properties using PostgreSQL GIN indexes if needed.
  3. idx_service_instances_lease: Used by the control plane's cleanup thread. The partial index filters out instances already marked UNHEALTHY, making the periodic lease expiration scan extremely fast.

Scaling Challenges and Capacity Estimation

To verify the viability of our platform, we perform back-of-the-envelope calculations for a fleet of $100,000$ active instances.

1. Heartbeat Ingress Network Throughput

We must evaluate whether the registry can handle the incoming heartbeat bandwidth.

  • Assumptions:

    • Total instances ($N$) = $100,000$
    • Heartbeat Interval ($T$) = $10$ seconds
    • Heartbeat Payload Size ($P$) = $500$ bytes (HTTP headers + JSON payload)
  • Ingress Calculations: $$\text{Requests Per Second (RPS)} = \frac{N}{T} = \frac{100,000}{10} = 10,000\text{ req/sec}$$ $$\text{Ingress Bandwidth} = \text{RPS} \times P = 10,000 \times 500\text{ bytes} = 5,000,000\text{ B/sec} = 5\text{ MB/sec}$$

This ingestion rate is modest for network bandwidth but represents significant write load for traditional database storage engines. If every heartbeat triggers a synchronous write to a spinning disk or a distributed SQL store, the system will crash under transaction logs and lock contention. To scale this, the registry must write heartbeats to an in-memory storage engine (e.g., Redis or an in-memory Raft state machine) rather than performing blocking disk writes.

2. Update Propagation Network Bandwidth

We must calculate the network egress required to stream updates to clients.

  • Assumptions:

    • Total watching clients ($C$) = $100,000$
    • Churn Rate = 1% of the fleet restarts or autoscales per minute (1,000 instances/minute $\approx 16.7$ instances/sec)
    • Fanout Factor ($F$) = Average number of clients watching each service. We assume $F = 50$ (some core services like inventory-api are watched by hundreds, while minor workers are watched by few).
    • Event Payload Size ($E$) = $1,000$ bytes (contains full endpoint object representation)
  • Egress Calculations: $$\text{Delta Updates Per Second} = \text{Churn Rate} \times F = 16.7\text{ events/sec} \times 50 = 835\text{ updates/sec}$$ $$\text{Egress Bandwidth} = 835 \times E = 835 \times 1,000\text{ bytes} = 835,000\text{ B/sec} \approx 0.835\text{ MB/sec}$$

Even with spikes during major rollouts (where churn rate might multiply by 10), the egress bandwidth remains under 10 MB/sec, demonstrating that streaming delta updates is highly efficient and scalable compared to full list polling.


Failure Scenarios and Resilience

Distributed registries operate as the absolute root of trust for traffic routing. When they fail, the system must degrade gracefully.

1. Network Split-Brain (CAP Theorem)

During a network partition dividing the registry cluster into two halves:

  • Consistent registries (CP - e.g., Consul): The minority partition fails to achieve a Raft quorum and rejects registration writes and updates. This prevents clients from routing to new instances but guarantees that no stale endpoint state is write-validated.
  • Available registries (AP - e.g., Eureka): Both partitions accept updates. Eureka nodes replicate data peer-to-peer on a best-effort basis. If a partition occurs, clients on one side will see different endpoint lists than the other.
  • Resilience Design: We choose an AP-leaning configuration with peer-to-peer gossip for lookup distribution, paired with active outlier detection on the client side. If a client receives a stale endpoint that repeatedly returns 503 Service Unavailable, the client temporarily evicts it locally, bypassing registry staleness.

2. Watch Stream Failures and Thundering Herds

If the registry cluster experiences a brief network flap, all $100,000$ active watch streams will drop simultaneously.

  • The Threat: When connectivity recovers, all $100,000$ sidecars will attempt to reconnect and fetch full snapshots of their target services, creating a massive spike in registry CPU and memory.
  • Resilience Design:
    • Reconnection Jitter: Clients must implement exponential backoff with randomized jitter (e.g., backoff time = $min(30s, \text{initial_backoff} \times 2^{\text{retry}} \pm \text{random_jitter})$) before attempting to reconnect.
    • Delta-First Recovery: When reconnecting, clients send their last_known_version in the watch request. If the registry can serve the deltas from that version forward, it avoids sending a full snapshot.
    • Control Plane Rate Limiting: The registry exposes token-bucket rate limiters on full snapshot requests.

3. Rapid Instance Flapping (Noisy Neighbor Nodes)

An instance with a failing network card or a thrashing garbage collector might cycle between healthy and unhealthy every few seconds.

  • The Threat: Flapping sends a continuous stream of registration and deregistration events, flooding the watch pipeline and causing client routing tables to churn constantly.
  • Resilience Design:
    • Hysteresis: Implement threshold counts for state changes. An instance must succeed in serving 3 consecutive heartbeats to move from UNHEALTHY to HEALTHY. It must fail 2 consecutive heartbeats to be marked UNHEALTHY.
    • Cool-down Penalty: If an instance changes state more than 3 times in a 60-second window, the registry freezes its status as UNHEALTHY for a 5-minute penalty phase.

4. Registry Control Plane Outage

What happens if the entire discovery registry goes dark?

  • The Threat: Microservices cannot query the registry, potentially stopping all service-to-service communication if clients query the registry on every call.
  • Resilience Design:
    • Data Plane Survivability: The calling microservice client/sidecar caches the endpoint list locally. If the registry query fails, the client uses the cached endpoints indefinitely.
    • No Age Eviction During Outages: If the client loses connection to the registry, it freezes lease expiration checks on its local cache. It is better to route to an instance that might be dead than to fail 100% of calls by emptying the local routing table.

Architectural Trade-offs

Choosing the runtime topology of a service discovery system requires balancing architectural complexity against operational capabilities.

Trade-off 1: Client-Side Routing vs. Proxy-Based Routing

In client-side routing, the application client directly contacts the registry and performs load balancing. In proxy-based routing, clients send requests to a dedicated proxy (e.g., Envoy or an API Gateway) which handles the lookup.

Aspect Client-Side Routing (SDK / Sidecar) Proxy-Based Routing (Load Balancer)
Network Hops 1 hop (Direct client-to-service communication) 2 hops (Client-to-proxy, proxy-to-service)
Operational Complexity High (Every service requires an SDK or sidecar daemon) Low (Client configuration is a static proxy IP)
Language Portability Hard (Must implement routing logic in every language) Easy (Language independent)
Failure Isolation High (A single client failing affects only that client) Low (If the proxy pool fails, all traffic halts)
Real-time Latency Lowest (No intermediate proxy processing overhead) Higher (Adds network latency of the proxy hop)

Trade-off 2: CP (Consistent) vs. AP (Available) Registry Architecture

We compare a Raft-backed consistent registry (e.g., HashiCorp Consul) with a peer-to-peer eventually consistent registry (e.g., Netflix Eureka).

Aspect CP Registry (Raft/Paxos) AP Registry (Gossip/Peer-to-Peer)
Write Availability Degrades during network splits (requires majority quorum) High (Every node accepts writes; replicates later)
Read Consistency Strong (Guaranteed up-to-date view of endpoints) Eventual (Reads may return stale/dead endpoints)
Scalability Harder (Writes must go through the Raft leader) Easier (Writes can be distributed across all nodes)
Network Partition Behavior Rejects registrations on partitioned nodes Accepts registrations; tolerates transient state skew

Staff Engineer Perspective

Applying service discovery at hyper-scale reveals operational subtleties that simple tutorials ignore.


Verbal Script

Interviewer: "If the service discovery control plane suffers a complete network partition separating the registry from the microservices, how does the system maintain connectivity, and how do we prevent routing to dead instances?"

Candidate: "We handle this by separating the control plane from the data plane. Our calling services utilize a client sidecar or SDK that maintains a local cache of healthy endpoints. If the registry becomes completely unreachable, the client sidecars freeze their local caches and continue routing requests using the last-known-good endpoint set. We explicitly disable lease expiration locally during registry outages to prevent emptying the routing tables.

However, because the control plane is offline, we cannot receive updates if an endpoint dies during the partition. To handle this, we implement client-side outlier detection and circuit breaking. If a client sends a request to a cached endpoint and encounters three consecutive connection timeouts or 5xx gateway errors, the client-side load balancer temporarily evicts that endpoint from its local active pool for a cool-down period. This allows the system to route around failures dynamically without needing a functioning control plane."

Interviewer: "What load balancing strategy would you implement on the client sidecar to ensure optimal utilization across zones?"

Candidate: "I would implement a Zone-Aware Weighted Round Robin strategy. The client SDK queries its local metadata to identify its own availability zone. It then filters the cached endpoints, prioritising instances located in the same zone. If same-zone instances are available and healthy, the load balancer distributes traffic among them based on their registered weights.

If all same-zone endpoints become unhealthy or if their latency spikes beyond a defined threshold, the load balancer falls back to other zones within the same region. This minimizes cross-zone data transfer costs and network latency while maintaining high availability during localized zone outages."

Interviewer: "How would you handle service discovery in a hybrid environment where some services run on Kubernetes and others run on legacy bare-metal VMs?"

Candidate: "I would use a Unified Service Registry that supports dual-registration mechanisms. For Kubernetes workloads, we can deploy a synchronization controller (similar to Consul Catalog Sync) that listens to the Kubernetes API server for pod updates and automatically mirrors them into our global registry. For bare-metal VMs, we configure a lightweight local agent that executes local health-check scripts and registers the VMs via the REST API. This ensures a single source of truth for all workloads regardless of their execution environment, and allows our client sidecars to route seamlessly across platform boundaries."


Key Takeaways

  • checkout-api needs a healthy inventory-api
  • payment-worker needs one fraud-api instance in the same region if possible
  • API gateway needs all healthy instances of orders-api

Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.