Spring Boot makes it easy to start a service. Production makes it clear whether the service is actually ready.
A production-ready service is not just one that passes unit tests. It has bounded timeouts, sane thread pools, useful health checks, structured logs, metrics, graceful shutdown, safe configuration, and predictable behavior when dependencies fail.
This checklist focuses on the things that prevent real outages.
1. Set Timeouts Everywhere
Mental Model
Applying Staff-level engineering principles to build robust, production-grade software.
graph TD
JVM[Java Virtual Machine]
JVM --> Heap[Heap Memory]
JVM --> Stack[Thread Stacks]
JVM --> Metaspace[Metaspace]
Heap --> Eden[Young Gen: Eden]
Heap --> Survivor[Young Gen: Survivor]
Heap --> Old[Old Generation]
The default timeout is often too high, missing, or hidden in a library. Every outbound call should have a connect timeout and a read/response timeout.
For WebClient:
HttpClient httpClient = HttpClient.create()
.option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 1000)
.responseTimeout(Duration.ofSeconds(2));
WebClient client = WebClient.builder()
.clientConnector(new ReactorClientHttpConnector(httpClient))
.baseUrl("https://payment-service")
.build();
For RestTemplate:
SimpleClientHttpRequestFactory factory = new SimpleClientHttpRequestFactory();
factory.setConnectTimeout(1000);
factory.setReadTimeout(2000);
RestTemplate restTemplate = new RestTemplate(factory);
Timeouts should be lower than the upstream caller's timeout. If your load balancer times out at 30 seconds, your service should fail dependency calls much earlier and return a controlled error.
2. Tune Database Pooling
HikariCP is fast, but it cannot guess your production topology. Set pool size based on database capacity and pod count:
spring:
datasource:
hikari:
maximum-pool-size: 15
minimum-idle: 5
connection-timeout: 1000
max-lifetime: 1800000
leak-detection-threshold: 30000
Alert on:
hikaricp.connections.pending
hikaricp.connections.timeout
hikaricp.connections.acquire
hikaricp.connections.usage
If pending rises, do not blindly increase pool size. Check slow queries and transaction scope first.
3. Keep Transactions Short
Do not wrap HTTP calls inside database transactions:
@Transactional
public void badCheckout(Order order) {
orderRepository.save(order);
paymentClient.charge(order); // holds DB transaction while waiting
}
Prefer:
public void checkout(Order order) {
PaymentResult payment = paymentClient.charge(order);
persistOrder(order, payment);
}
@Transactional
public void persistOrder(Order order, PaymentResult payment) {
orderRepository.save(order.withPayment(payment));
}
Transactions should protect data consistency, not the whole workflow.
4. Expose Useful Health Checks
Enable Actuator:
management:
endpoints:
web:
exposure:
include: health,info,metrics,prometheus
endpoint:
health:
probes:
enabled: true
Use separate liveness and readiness probes:
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
Liveness means "restart me if I am dead." Readiness means "do not send me traffic right now." Do not make liveness depend on the database, or a database outage can cause every pod to restart repeatedly.
5. Graceful Shutdown
When Kubernetes terminates a pod, the service needs time to stop accepting traffic and finish in-flight requests.
server:
shutdown: graceful
spring:
lifecycle:
timeout-per-shutdown-phase: 30s
Kubernetes:
terminationGracePeriodSeconds: 45
This prevents connection resets during deployments and node drains.
6. Structured Logging
Logs should answer operational questions quickly. Include request ID, user/tenant where safe, route, status, duration, and error type.
{
"event": "http_request",
"trace_id": "abc123",
"route": "/orders",
"status": 201,
"duration_ms": 84,
"tenant_id": "t_42"
}
Never log secrets, tokens, full card numbers, or raw PII. Add masking at the logging boundary.
7. Metrics That Matter
Expose Prometheus metrics with Micrometer:
management:
metrics:
tags:
application: checkout-api
Alert on symptoms:
- request p95/p99 latency
- error rate by route
- dependency latency
- Hikari pending connections
- JVM GC pauses
- executor queue size
- Kafka consumer lag if applicable
Avoid alerting only on CPU. CPU can be high while the service is healthy, and low while every request is stuck waiting on a dependency.
8. Resilience Defaults
Use circuit breakers for slow dependencies:
resilience4j:
circuitbreaker:
instances:
paymentService:
slidingWindowSize: 50
failureRateThreshold: 50
slowCallDurationThreshold: 2s
slowCallRateThreshold: 50
Retries should be limited and jittered:
resilience4j:
retry:
instances:
paymentService:
maxAttempts: 2
waitDuration: 100ms
Do not retry non-idempotent operations unless the downstream API supports idempotency keys.
9. Deployment Safety
Production deployments should have:
- readiness checks
- rolling updates
- rollback path
- feature flags for risky behavior
- database migrations compatible with old and new code
- canary metrics for error rate and latency
For database changes, follow expand-contract:
- Add nullable column
- Deploy code that writes both old and new
- Backfill
- Deploy code that reads new
- Remove old column later
Final Checklist
- Timeouts on every outbound call
- HikariCP sized by database capacity
- Short transactions
- Separate liveness and readiness probes
- Graceful shutdown enabled
- Structured logs with trace IDs
- Prometheus metrics exposed
- Circuit breakers and bounded retries
- Safe deployment and rollback strategy
- Alerts tied to user impact
Spring Boot gives you strong defaults for development. Production readiness comes from making every important failure mode explicit.
Engineering Standard: The "Staff" Perspective
In high-throughput distributed systems, the code we write is often the easiest part. The difficulty lies in how that code interacts with other components in the stack.
1. Data Integrity and The "P" in CAP
Whenever you are dealing with state (Databases, Caches, or In-memory stores), you must account for Network Partitions. In a standard Java microservice, we often choose Availability (AP) by using Eventual Consistency patterns. However, for financial ledgers, we must enforce Strong Consistency (CP), which usually involves distributed locks (Redis Redlock or Zookeeper) or a strictly linearizable sequence.
2. The Observability Pillar
Writing logic without observability is like flying a plane without a dashboard. Every production service must implement:
- Tracing (OpenTelemetry): Track a single request across 50 microservices.
- Metrics (Prometheus): Monitor Heap usage, Thread saturation, and P99 latencies.
- Structured Logging (ELK/Splunk): Never log raw strings; use JSON so you can query logs like a database.
3. Production Incident Prevention
To survive a 3:00 AM incident, we use:
- Circuit Breakers: Stop the bleeding if a downstream service is down.
- Bulkheads: Isolate thread pools so one failing endpoint doesn't crash the entire app.
- Retries with Exponential Backoff: Avoid the "Thundering Herd" problem when a service comes back online.
Critical Interview Nuance
When an interviewer asks you about this topic, don't just explain the code. Explain the Trade-offs. A Staff Engineer is someone who knows that every architectural decision is a choice between two "bad" outcomes. You are picking the one that aligns with the business goal.
Performance Checklist for High-Load Systems:
- Minimize Object Creation: Use primitive arrays and reusable buffers.
- Batching: Group 1,000 small writes into 1 large batch to save I/O cycles.
- Async Processing: If the user doesn't need the result immediately, move it to a Message Queue (Kafka/SQS).
Key Takeaways
- request p95/p99 latency
- error rate by route
- dependency latency
Read Next
- Spring Boot Performance Tuning: From 200 to 2000 RPS
- Thread Pool Exhaustion in Spring Boot: Diagnosis, Prevention, and Recovery
- Zero-Downtime Database Migrations: Patterns for Production
Verbal Interview Script
Interviewer: "How does the JVM handle memory allocation for this implementation, and what are the GC implications?"
Candidate: "In this implementation, the short-lived objects are allocated in the Eden space of the Young Generation. Because they have a very short lifecycle, they will be quickly collected during a Minor GC, which is highly efficient. However, if we were to maintain strong references to these objects—for instance, in a static Map or a long-lived cache—they would survive multiple GC cycles and get promoted to the Old Generation. This would eventually trigger a Major GC (or Full GC), causing a "Stop-the-World" pause that increases our P99 latency. To mitigate this in a high-throughput environment, I would consider using the ZGC or Shenandoah garbage collectors for predictable sub-millisecond pause times, or optimize the data structures to reduce object churn."