Kubernetes Tip: Readiness vs Liveness vs Startup Probes
Avoid crash loops and bad rollouts by using the right probe for the right job.
Probes control whether your Pod is considered ready, whether it gets restarted, and how smooth your rollouts feel.
The mental model
- Readiness: “Should this Pod receive traffic?”
- Liveness: “Is this container stuck and should be restarted?”
- Startup: “Give me extra time to boot before liveness kicks in.”
Typical mistake: using liveness to gate traffic
If you use liveness as a “traffic gate”, you’ll restart Pods that are simply not ready yet.
Use readiness to remove the Pod from Service endpoints instead.
A good default pattern
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 2
readinessProbe:
httpGet:
path: /readyz
port: 8080
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 3
livenessProbe:
httpGet:
path: /livez
port: 8080
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 3
What to check in each endpoint
/readyz
Readiness should fail when the app cannot serve correct traffic right now, e.g.:
- dependency not connected (DB, cache, upstream)
- migrations still running
- warmup not finished
/livez
Liveness should fail only if the process is broken and a restart helps:
- deadlock / event loop stuck
- internal invariants violated
If your liveness depends on the database being reachable, you may turn a transient outage into a restart storm.
Debugging probe failures
kubectl describe pod <pod>
kubectl logs <pod> -c <container> --previous
kubectl get events -n <ns> --sort-by=.lastTimestamp
A rollout-friendly view (Mermaid)
flowchart LR A[Pod starts] --> B{startupProbe ok?} B -- no --> A B -- yes --> C{readiness ok?} C -- no --> D[No traffic] C -- yes --> E[Receive traffic] E --> F{liveness ok?} F -- no --> A F -- yes --> EChecklist
- Use startupProbe for slow boots
- Readiness controls traffic, not restarts
- Liveness is conservative (restart only helps)
- Probes have sane timeouts and thresholds
Probe types: HTTP, TCP, and exec (choose intentionally)
Kubernetes supports multiple probe mechanisms:
httpGet: best for web services (and easiest to observe)tcpSocket: useful for “port open” checks (less semantic)exec: powerful, but can be expensive and harder to reason about
When to prefer HTTP probes
HTTP probes are great when you can expose small endpoints:
/livez: “process is healthy”/readyz: “safe to receive traffic”
Keep them fast. If a readiness check sometimes takes 5–10 seconds, it’s telling you something is wrong (or that you’re using readiness for a deep dependency check that should be done differently).
When TCP probes make sense
TCP probes only check that a port accepts a connection. This can be okay for:
- simple proxies
- legacy systems with no HTTP endpoint
But be careful: “port open” does not mean “request succeeds”.
When exec probes are appropriate
Exec probes can check internal state (files, sockets, process status), but:
- they run inside the container
- they can be slow
- they can add load at scale (every Pod, every few seconds)
If you use exec probes, keep the command lightweight and fast.
StartupProbe vs initialDelaySeconds (why startupProbe is better)
Older manifests sometimes rely on initialDelaySeconds to give the app time to boot. The downside is:
- the delay is static
- it applies even when the app is healthy and fast
startupProbe is more precise:
- it disables liveness checks until startup succeeds
- it adapts to slower boot times without making everything wait unnecessarily
Rule of thumb:
- Use
startupProbefor slow boots (JVM warmup, large migrations, cold caches). - Keep
initialDelaySecondssmall or zero once you have a good startupProbe.
Readiness and rollouts: what actually happens
During a Deployment rollout:
- new Pods start
- readiness controls when they become endpoints behind a Service
- old Pods are terminated based on rollout strategy
If readiness is too “optimistic”:
- traffic hits Pods that still load configs, warm caches, or are mid-migration
- you get errors during deploys even though “rollout succeeded”
If readiness is too “strict” (e.g., fails when DB is down for 1 second):
- deploys can stall
- you can create cascading failures where nothing becomes ready
The sweet spot is:
- readiness reflects ability to serve requests correctly
- it tolerates short transient dependency jitter
Termination: pair readiness with graceful shutdown
Even perfect probes won’t save you if shutdown is abrupt.
Consider:
terminationGracePeriodSeconds(enough time to drain)- a
preStophook (stop accepting traffic, flush queues, etc.) - readiness behavior during shutdown (return “not ready” quickly)
Example pattern:
- On SIGTERM, your app stops accepting new requests.
- Readiness flips to fail quickly.
- Kubernetes removes the Pod from endpoints.
- In-flight requests drain within grace period.
Timeouts and thresholds: tuning that avoids flapping
Common pitfalls:
timeoutSecondstoo low for real p99 latency (causes false negatives)periodSecondstoo aggressive (excess load and flapping)failureThresholdtoo small (one blip triggers restart or endpoint removal)
Practical defaults:
- readiness:
periodSeconds: 5,timeoutSeconds: 1-2,failureThreshold: 3 - liveness:
periodSeconds: 10,timeoutSeconds: 1-2,failureThreshold: 3 - startup: tune to max expected boot time (
failureThreshold * periodSeconds)
For example, if the app may take up to 60s to boot:
startupProbe.periodSeconds: 2startupProbe.failureThreshold: 30
Don’t couple liveness to external dependencies
This is worth repeating: if your liveness fails when DB is down, you can turn a DB incident into a full restart storm.
Better:
- readiness fails when dependencies are required to serve traffic
- liveness remains true unless the process is stuck/broken
Observability: make probe endpoints visible
Probes are internal checks, but you should still log and measure them:
- count readiness failures and reasons
- expose metrics for “dependency health” separately from “process health”
If you build /readyz as a composable check (cache, DB, queue), return a simple JSON with reason codes (even if the probe only checks status code).
Debugging playbook (when pods are not ready)
- Check endpoints:
kubectl get endpoints -n <ns> <svc> -o wide
- Check readiness/liveness failures:
kubectl describe pod -n <ns> <pod>
kubectl get events -n <ns> --sort-by=.lastTimestamp
- Verify the endpoint inside cluster:
kubectl exec -n <ns> -it <pod> -- sh
curl -sS -i http://127.0.0.1:8080/readyz
If your image is distroless, use an ephemeral container (kubectl debug) to run curl inside the Pod network namespace.
Final advice
Treat probes as part of your deployment contract:
- probes define “healthy enough to receive traffic”
- they shape rollout speed and stability
- they can prevent outages when tuned with grace and realism
References
- Configure liveness, readiness, and startup probes
- Pod lifecycle and container probes
- Debugging a running Pod
FAQ
Q: Should readiness and liveness use the same endpoint? A: Usually no. Readiness should check dependencies and request handling; liveness should be minimal and fast.
Q: What if startup is slow?
A: Prefer startupProbe or increase failure thresholds instead of a long initialDelaySeconds.
Q: Do readiness failures restart containers? A: No. Only liveness failures trigger restarts.