Spring Boot Production Readiness Checklist: Timeouts, Pools, Health Checks, and Observability

Spring Boot makes it easy to start a service. Production makes it clear whether the service is actually ready.

A production-ready service is not just one that passes unit tests. It has bounded timeouts, sane thread pools, useful health checks, structured logs, metrics, graceful shutdown, safe configuration, and predictable behavior when dependencies fail.

This checklist focuses on the things that prevent real outages.

1. Set Timeouts Everywhere

Mental Model

Applying Staff-level engineering principles to build robust, production-grade software.

graph TD
    JVM[Java Virtual Machine]
    JVM --> Heap[Heap Memory]
    JVM --> Stack[Thread Stacks]
    JVM --> Metaspace[Metaspace]
    Heap --> Eden[Young Gen: Eden]
    Heap --> Survivor[Young Gen: Survivor]
    Heap --> Old[Old Generation]

The default timeout is often too high, missing, or hidden in a library. Every outbound call should have a connect timeout and a read/response timeout.

For WebClient:

HttpClient httpClient = HttpClient.create()
    .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 1000)
    .responseTimeout(Duration.ofSeconds(2));

WebClient client = WebClient.builder()
    .clientConnector(new ReactorClientHttpConnector(httpClient))
    .baseUrl("https://payment-service")
    .build();

For RestTemplate:

SimpleClientHttpRequestFactory factory = new SimpleClientHttpRequestFactory();
factory.setConnectTimeout(1000);
factory.setReadTimeout(2000);
RestTemplate restTemplate = new RestTemplate(factory);

Timeouts should be lower than the upstream caller's timeout. If your load balancer times out at 30 seconds, your service should fail dependency calls much earlier and return a controlled error.

2. Tune Database Pooling

HikariCP is fast, but it cannot guess your production topology. Set pool size based on database capacity and pod count:

spring:
  datasource:
    hikari:
      maximum-pool-size: 15
      minimum-idle: 5
      connection-timeout: 1000
      max-lifetime: 1800000
      leak-detection-threshold: 30000

Alert on:

hikaricp.connections.pending
hikaricp.connections.timeout
hikaricp.connections.acquire
hikaricp.connections.usage

If pending rises, do not blindly increase pool size. Check slow queries and transaction scope first.

3. Keep Transactions Short

Do not wrap HTTP calls inside database transactions:

@Transactional
public void badCheckout(Order order) {
    orderRepository.save(order);
    paymentClient.charge(order); // holds DB transaction while waiting
}

Prefer:

public void checkout(Order order) {
    PaymentResult payment = paymentClient.charge(order);
    persistOrder(order, payment);
}

@Transactional
public void persistOrder(Order order, PaymentResult payment) {
    orderRepository.save(order.withPayment(payment));
}

Transactions should protect data consistency, not the whole workflow.

4. Expose Useful Health Checks

Enable Actuator:

management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
  endpoint:
    health:
      probes:
        enabled: true

Use separate liveness and readiness probes:

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080

Liveness means "restart me if I am dead." Readiness means "do not send me traffic right now." Do not make liveness depend on the database, or a database outage can cause every pod to restart repeatedly.

5. Graceful Shutdown

When Kubernetes terminates a pod, the service needs time to stop accepting traffic and finish in-flight requests.

server:
  shutdown: graceful

spring:
  lifecycle:
    timeout-per-shutdown-phase: 30s

Kubernetes:

terminationGracePeriodSeconds: 45

This prevents connection resets during deployments and node drains.

6. Structured Logging

Logs should answer operational questions quickly. Include request ID, user/tenant where safe, route, status, duration, and error type.

{
  "event": "http_request",
  "trace_id": "abc123",
  "route": "/orders",
  "status": 201,
  "duration_ms": 84,
  "tenant_id": "t_42"
}

Never log secrets, tokens, full card numbers, or raw PII. Add masking at the logging boundary.

7. Metrics That Matter

Expose Prometheus metrics with Micrometer:

management:
  metrics:
    tags:
      application: checkout-api

Alert on symptoms:

request p95/p99 latency
error rate by route
dependency latency
Hikari pending connections
JVM GC pauses
executor queue size
Kafka consumer lag if applicable

Avoid alerting only on CPU. CPU can be high while the service is healthy, and low while every request is stuck waiting on a dependency.

8. Resilience Defaults

Use circuit breakers for slow dependencies:

resilience4j:
  circuitbreaker:
    instances:
      paymentService:
        slidingWindowSize: 50
        failureRateThreshold: 50
        slowCallDurationThreshold: 2s
        slowCallRateThreshold: 50

Retries should be limited and jittered:

resilience4j:
  retry:
    instances:
      paymentService:
        maxAttempts: 2
        waitDuration: 100ms

Do not retry non-idempotent operations unless the downstream API supports idempotency keys.

9. Deployment Safety

Production deployments should have:

readiness checks
rolling updates
rollback path
feature flags for risky behavior
database migrations compatible with old and new code
canary metrics for error rate and latency

For database changes, follow expand-contract:

Add nullable column
Deploy code that writes both old and new
Backfill
Deploy code that reads new
Remove old column later

Final Checklist

Timeouts on every outbound call
HikariCP sized by database capacity
Short transactions
Separate liveness and readiness probes
Graceful shutdown enabled
Structured logs with trace IDs
Prometheus metrics exposed
Circuit breakers and bounded retries
Safe deployment and rollback strategy
Alerts tied to user impact

Spring Boot gives you strong defaults for development. Production readiness comes from making every important failure mode explicit.

Engineering Standard: The "Staff" Perspective

In high-throughput distributed systems, the code we write is often the easiest part. The difficulty lies in how that code interacts with other components in the stack.

1. Data Integrity and The "P" in CAP

Whenever you are dealing with state (Databases, Caches, or In-memory stores), you must account for Network Partitions. In a standard Java microservice, we often choose Availability (AP) by using Eventual Consistency patterns. However, for financial ledgers, we must enforce Strong Consistency (CP), which usually involves distributed locks (Redis Redlock or Zookeeper) or a strictly linearizable sequence.

2. The Observability Pillar

Writing logic without observability is like flying a plane without a dashboard. Every production service must implement:

Tracing (OpenTelemetry): Track a single request across 50 microservices.
Metrics (Prometheus): Monitor Heap usage, Thread saturation, and P99 latencies.
Structured Logging (ELK/Splunk): Never log raw strings; use JSON so you can query logs like a database.

3. Production Incident Prevention

To survive a 3:00 AM incident, we use:

Circuit Breakers: Stop the bleeding if a downstream service is down.
Bulkheads: Isolate thread pools so one failing endpoint doesn't crash the entire app.
Retries with Exponential Backoff: Avoid the "Thundering Herd" problem when a service comes back online.

Critical Interview Nuance

When an interviewer asks you about this topic, don't just explain the code. Explain the Trade-offs. A Staff Engineer is someone who knows that every architectural decision is a choice between two "bad" outcomes. You are picking the one that aligns with the business goal.

Performance Checklist for High-Load Systems:

Minimize Object Creation: Use primitive arrays and reusable buffers.
Batching: Group 1,000 small writes into 1 large batch to save I/O cycles.
Async Processing: If the user doesn't need the result immediately, move it to a Message Queue (Kafka/SQS).

Key Takeaways

request p95/p99 latency
error rate by route
dependency latency

Verbal Interview Script

Interviewer: "How does the JVM handle memory allocation for this implementation, and what are the GC implications?"

Candidate: "In this implementation, the short-lived objects are allocated in the Eden space of the Young Generation. Because they have a very short lifecycle, they will be quickly collected during a Minor GC, which is highly efficient. However, if we were to maintain strong references to these objects—for instance, in a static Map or a long-lived cache—they would survive multiple GC cycles and get promoted to the Old Generation. This would eventually trigger a Major GC (or Full GC), causing a "Stop-the-World" pause that increases our P99 latency. To mitigate this in a high-throughput environment, I would consider using the ZGC or Shenandoah garbage collectors for predictable sub-millisecond pause times, or optimize the data structures to reduce object churn."