System Design: Building a Service Discovery Platform

In a distributed system, naming a service is easy.

Finding a healthy instance of that service, right now, in the right zone, during a deployment, while failures are happening, is the real problem.

That problem is service discovery.

At small scale, teams hardcode hostnames or keep a static list of instances. That falls apart as soon as autoscaling, rolling deploys, container rescheduling, zone-aware routing, or dynamic failover appear. Suddenly the question "where should this request go?" becomes a control-plane problem with real operational consequences.

This guide designs a production service discovery platform.

Problem Statement

Mental Model

Connecting isolated components into a resilient, scalable, and observable distributed web.

graph TD
    App[Application Server] -->|Read Request| Cache[(Redis Cache)]
    Cache -- Cache Miss --> DB[(Primary Database)]
    DB -- Return Data --> App
    App -- Write Data --> Cache

Build a platform that lets services discover other healthy service instances dynamically.

Examples:

checkout-api needs a healthy inventory-api
payment-worker needs one fraud-api instance in the same region if possible
API gateway needs all healthy instances of orders-api
internal tooling wants to know which version of search-worker is live

The platform should:

register service instances
detect unhealthy instances quickly
distribute instance lists safely
support client-side or server-side load balancing
support zones, regions, and rollout metadata
remain usable during partial control-plane outages

This is not just a registry. It is a liveness, routing, and consistency system.

Requirements

Functional requirements:

register and deregister instances
support leases or heartbeats
perform or consume health checks
list healthy instances for a service
expose metadata such as zone, version, and weight
support canary or subset discovery
support service decommissioning during rollout

Non-functional requirements:

low-latency lookup
high availability
fast removal of unhealthy instances
bounded staleness
no single point of failure in the request path
operational visibility into instance state

The most important design constraint:

traffic routing should continue safely even if the discovery control plane is temporarily unavailable.

Why Static Config Fails

A static list of hosts looks fine at first:

inventory-api:
  - 10.0.1.10:8080
  - 10.0.1.11:8080

But real systems need:

autoscaling
rolling deploys
instance replacement
multi-zone failover
canary versions
node eviction or container restarts

With static config, every topology change becomes a config rollout problem. That does not scale operationally.

High-Level Architecture

Service Instance
   |
   +--> register / heartbeat
   |
   v
Discovery Control Plane
   |
   +--> instance registry
   +--> health state
   +--> watch / snapshot API
   |
   v
Discovery Client / Sidecar
   |
   +--> local cache of healthy endpoints
   |
   v
Calling Service

Optional server-side discovery path:

Client -> discovery-aware proxy / load balancer -> backend instance

There are two main runtime models:

client-side discovery
server-side discovery

Client-Side vs Server-Side Discovery

Client-side discovery

The caller asks discovery for endpoints and balances traffic itself.

Pros:

fewer network hops
flexible routing
good for internal service-to-service traffic

Cons:

every client needs a good library or sidecar
rollout bugs can spread across many services

Server-side discovery

The client sends traffic to a proxy or load balancer, which looks up instances.

Pros:

simpler clients
central routing policy

Cons:

extra hop
proxy becomes a critical shared component

Most large systems use a mix:

client-side inside service mesh or RPC stacks
server-side at gateways and edge layers

Registration Model

Each service instance needs an identity and metadata.

Example registration payload:

{
  "service": "inventory-api",
  "instanceId": "inventory-api-7f9c8d4f7d-h2kgw",
  "host": "10.42.6.18",
  "port": 8080,
  "zone": "ap-south-1a",
  "region": "ap-south-1",
  "version": "2026.04.18.2",
  "weight": 100,
  "healthEndpoint": "/health/ready"
}

Useful metadata:

service name
instance id
host and port
zone / region
version
deployment group / canary marker
protocol
tags

Registry Data Model

CREATE TABLE service_instances (
  service_name TEXT NOT NULL,
  instance_id TEXT PRIMARY KEY,
  host TEXT NOT NULL,
  port INT NOT NULL,
  zone TEXT NOT NULL,
  region TEXT NOT NULL,
  version TEXT NOT NULL,
  weight INT NOT NULL DEFAULT 100,
  status TEXT NOT NULL,          -- starting, healthy, draining, unhealthy, gone
  metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
  registered_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  last_heartbeat_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  lease_expires_at TIMESTAMPTZ NOT NULL
);

CREATE INDEX idx_service_instances_lookup
  ON service_instances (service_name, status, zone);

This table is conceptual. At scale, many discovery systems use strongly-consistent key-value stores or in-memory state machines rather than a generic SQL path in the hot control loop.

Heartbeats and Leases

The platform needs a way to decide whether an instance is still alive.

Common model:

instance registers with a lease
instance renews lease every N seconds
if lease expires, instance becomes unavailable

Example:

heartbeat interval: 10 seconds
lease expiry: 30 seconds

This prevents dead instances from staying discoverable forever.

Simple flow:

instance registers
registry stores lease_expires_at = now + 30s
instance heartbeats every 10s
registry extends lease
if heartbeat stops, instance expires after 30s

Health Checks

Heartbeat is not enough.

A process can still heartbeat while being unable to serve traffic correctly.

Typical health types:

process alive
app ready
dependency degraded
draining for shutdown

Useful distinction:

liveness: should the process be restarted?
readiness: should traffic be sent to it?

The discovery platform should route based on readiness, not just process liveness.

Status Lifecycle

Instances should move through clear states:

STARTING
HEALTHY
DRAINING
UNHEALTHY
EXPIRED
REMOVED

Why DRAINING matters:

rollout starts
instance should stop receiving new traffic
in-flight requests can finish

This reduces abrupt failures during deploys and autoscaling down.

Watch API vs Polling

Clients need updates when topology changes.

Polling

Client asks every few seconds for latest instance list.

Pros:

simple

Cons:

more stale
higher control-plane load

Watch / streaming updates

Registry pushes changes or lets clients maintain a long-lived watch.

Pros:

faster updates
lower steady-state control-plane load

Cons:

more implementation complexity

A common hybrid:

full snapshot on startup
watch for updates afterward
fallback to periodic full refresh

Client Cache

Never make every service call block on a control-plane round trip.

Client library or sidecar should maintain:

current healthy instance list
metadata
last update version
last-known-good snapshot

That gives safe behavior if discovery briefly fails.

Example local state:

{
  "service": "inventory-api",
  "version": 2881,
  "instances": [
    { "host": "10.42.6.18", "port": 8080, "zone": "ap-south-1a", "weight": 100 },
    { "host": "10.42.7.22", "port": 8080, "zone": "ap-south-1b", "weight": 100 }
  ]
}

Selection Strategy

Once the client has a healthy list, it still needs to choose an instance.

Common strategies:

round robin
weighted round robin
least requests
random with power of two choices
consistent hashing for sticky traffic

For internal RPC, a great default is:

prefer same-zone endpoints
use weighted random or least requests
fallback cross-zone only if needed

Zone Awareness

Cross-zone traffic adds cost and latency.

So if a service has instances in:

ap-south-1a
ap-south-1b
ap-south-1c

and the caller is in ap-south-1a, the discovery client should prefer that zone first.

Fallback order:

same zone healthy endpoints
same region other zones
remote region only if policy allows

That policy should be configurable, not hidden inside a client library with no visibility.

Rollouts and Canary Support

Discovery metadata is useful during deployments.

For example:

stable version has weight 100
canary version has weight 5

Or:

only clients with canary tag should see canary instances

Discovery record can include:

{
  "version": "2026.04.18.2",
  "deploymentGroup": "canary"
}

This lets routing systems make rollout decisions without every client learning deployment internals.

Failure Modes

1. Dead instance still receives traffic

Cause:

lease expiry too long
readiness not checked
stale client cache

Fix:

shorter leases
readiness-aware status
faster watch propagation

2. Flapping instance churns in and out

Cause:

noisy health checks

Fix:

hysteresis
consecutive-failure thresholds
cool-down period before reentry

3. Discovery control plane outage

Cause:

registry cluster issue

Fix:

clients continue with last-known-good cache
do not clear endpoint cache immediately

4. Full stale endpoint set after partition

Cause:

watch stream broken silently

Fix:

periodic full refresh
endpoint version monotonicity
freshness alarms

5. Rolling deployment sends traffic too early

Cause:

instance marked discoverable before ready

Fix:

explicit startup state
only publish once readiness is true

Consistency Trade-Offs

Discovery is usually eventually consistent at the fleet level.

That is acceptable if:

bad endpoints are removed fast enough
stale topology windows are short
clients retry safely

Trying to make every discovery read globally strongly consistent often adds more pain than value in the request path.

The better model is:

strong enough control-plane state
local cached runtime view
retries and circuit breaking in callers

Integration With Load Balancing

Discovery alone is not enough.

Calling clients still need:

connection pooling
timeouts
retries
circuit breakers
outlier detection

A discovery system that returns endpoints but ignores real-time failure behavior leaves too much burden on every service team.

Example Client-Side Selector

public class DiscoveryClient {

    private final EndpointCache endpointCache;

    public Endpoint select(String service, String callerZone) {
        List<Endpoint> endpoints = endpointCache.healthyEndpoints(service);

        List<Endpoint> sameZone = endpoints.stream()
            .filter(e -> e.zone().equals(callerZone))
            .toList();

        List<Endpoint> pool = sameZone.isEmpty() ? endpoints : sameZone;
        if (pool.isEmpty()) {
            throw new NoHealthyEndpointsException(service);
        }

        return weightedRandom(pool);
    }
}

That’s intentionally boring. Discovery logic should be predictable, not clever.

Server-Side Alternative

In some systems, a local sidecar or proxy handles discovery:

app -> local proxy -> healthy upstream

This centralizes:

retries
timeouts
endpoint selection
circuit breaking

This is attractive when you do not want every application language runtime to implement discovery logic differently.

Security and Trust

Who is allowed to register instances?

Important protections:

only trusted workloads can register
instance identity tied to workload identity, not arbitrary hostname claims
metadata changes audited
discovery records cannot be spoofed casually

Otherwise, your routing layer becomes a place where malicious or broken workloads can poison service resolution.

Observability

Track:

total registered instances by service
healthy vs unhealthy count
lease expiry count
registration errors
watch propagation latency
stale cache age
zone skew
endpoint selection failures

Useful dashboards:

service endpoint count over time
discovery lag during deployments
flapping instances
most common no-healthy-endpoints errors

If discovery is failing, service owners should know whether the issue is:

no healthy instances
stale client cache
control-plane lag
wrong metadata or policy

What I Would Build First

Phase 1:

service registry
lease-based registration
healthy instance lookup
simple client cache with periodic polling

Phase 2:

watch API
zone-aware selection
draining state and rollout metadata
observability dashboards

Phase 3:

sidecar/proxy integration
canary-aware discovery
stronger workload identity integration
richer outlier and traffic policy hooks

This order matters. Teams often leap into elaborate mesh features before they have solid registration and liveness semantics.

Production Checklist

service instances have stable identity
readiness affects discoverability
leases expire automatically
clients cache last-known-good endpoint set
periodic refresh exists even with watches
zone-aware routing supported
draining state used in deploys
no request path depends on control-plane round trip
discovery metadata audited
stale endpoint age monitored

Final Takeaway

Service discovery is how a distributed system learns where it can safely send traffic right now.

If you design it well, deployments, autoscaling, and failures become routine infrastructure behavior.

If you design it poorly, every topology change turns into random connection errors and ghost outages.

Technical Trade-offs: Database Choice

Model	Consistency	Latency	Complexity	Best Use Case
Relational (ACID)	Strong	High	Medium	Financial Ledgers, Transactions
NoSQL (Wide-Column)	Eventual	Low	High	Large-Scale Analytics, High Write Load
In-Memory	Variable	Ultra-Low	Low	Caching, Real-time Sessions

Key Takeaways

checkout-api needs a healthy inventory-api
payment-worker needs one fraud-api instance in the same region if possible
API gateway needs all healthy instances of orders-api

Verbal Interview Script

Interviewer: "How would you ensure high availability and fault tolerance for this specific architecture?"

Candidate: "To achieve 'Five Nines' (99.999%) availability, we must eliminate all Single Points of Failure (SPOF). I would deploy the API Gateway and stateless microservices across multiple Availability Zones (AZs) behind an active-active load balancer. For the data layer, I would use asynchronous replication to a read-replica in a different region for disaster recovery. Furthermore, it's not enough to just deploy redundantly; we must protect the system from cascading failures. I would implement strict timeouts, retry mechanisms with exponential backoff and jitter, and Circuit Breakers (using a library like Resilience4j) on all synchronous network calls between microservices."

System Design: Building a Service Discovery Platform

Problem Statement

Mental Model

Requirements

Why Static Config Fails

High-Level Architecture

Client-Side vs Server-Side Discovery

Client-side discovery

Server-side discovery

Registration Model

Registry Data Model

Heartbeats and Leases

Health Checks

Status Lifecycle

Watch API vs Polling

Polling

Watch / streaming updates

Client Cache

Selection Strategy

Zone Awareness

Rollouts and Canary Support

Failure Modes

1. Dead instance still receives traffic

2. Flapping instance churns in and out

3. Discovery control plane outage

4. Full stale endpoint set after partition

5. Rolling deployment sends traffic too early

Consistency Trade-Offs

Integration With Load Balancing

Example Client-Side Selector

Server-Side Alternative

Security and Trust

Observability

What I Would Build First

Production Checklist

Final Takeaway

Technical Trade-offs: Database Choice

Key Takeaways

Read Next

Verbal Interview Script

Want to track your progress?