CFN Cloud
Cloud Future New Life
en zh
2025-12-29 · 0 views

Kubernetes Tip: Readiness vs Liveness vs Startup Probes

Avoid crash loops and bad rollouts by using the right probe for the right job.

Probes control whether your Pod is considered ready, whether it gets restarted, and how smooth your rollouts feel.

The mental model

  • Readiness: “Should this Pod receive traffic?”
  • Liveness: “Is this container stuck and should be restarted?”
  • Startup: “Give me extra time to boot before liveness kicks in.”

Typical mistake: using liveness to gate traffic

If you use liveness as a “traffic gate”, you’ll restart Pods that are simply not ready yet.

Use readiness to remove the Pod from Service endpoints instead.

A good default pattern

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 2

readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  periodSeconds: 5
  timeoutSeconds: 2
  failureThreshold: 3

livenessProbe:
  httpGet:
    path: /livez
    port: 8080
  periodSeconds: 10
  timeoutSeconds: 2
  failureThreshold: 3

What to check in each endpoint

/readyz

Readiness should fail when the app cannot serve correct traffic right now, e.g.:

  • dependency not connected (DB, cache, upstream)
  • migrations still running
  • warmup not finished

/livez

Liveness should fail only if the process is broken and a restart helps:

  • deadlock / event loop stuck
  • internal invariants violated

If your liveness depends on the database being reachable, you may turn a transient outage into a restart storm.

Debugging probe failures

kubectl describe pod <pod>
kubectl logs <pod> -c <container> --previous
kubectl get events -n <ns> --sort-by=.lastTimestamp

A rollout-friendly view (Mermaid)

flowchart LR A[Pod starts] --> B{startupProbe ok?} B -- no --> A B -- yes --> C{readiness ok?} C -- no --> D[No traffic] C -- yes --> E[Receive traffic] E --> F{liveness ok?} F -- no --> A F -- yes --> E

Checklist

  • Use startupProbe for slow boots
  • Readiness controls traffic, not restarts
  • Liveness is conservative (restart only helps)
  • Probes have sane timeouts and thresholds

Probe types: HTTP, TCP, and exec (choose intentionally)

Kubernetes supports multiple probe mechanisms:

  • httpGet: best for web services (and easiest to observe)
  • tcpSocket: useful for “port open” checks (less semantic)
  • exec: powerful, but can be expensive and harder to reason about

When to prefer HTTP probes

HTTP probes are great when you can expose small endpoints:

  • /livez: “process is healthy”
  • /readyz: “safe to receive traffic”

Keep them fast. If a readiness check sometimes takes 5–10 seconds, it’s telling you something is wrong (or that you’re using readiness for a deep dependency check that should be done differently).

When TCP probes make sense

TCP probes only check that a port accepts a connection. This can be okay for:

  • simple proxies
  • legacy systems with no HTTP endpoint

But be careful: “port open” does not mean “request succeeds”.

When exec probes are appropriate

Exec probes can check internal state (files, sockets, process status), but:

  • they run inside the container
  • they can be slow
  • they can add load at scale (every Pod, every few seconds)

If you use exec probes, keep the command lightweight and fast.

StartupProbe vs initialDelaySeconds (why startupProbe is better)

Older manifests sometimes rely on initialDelaySeconds to give the app time to boot. The downside is:

  • the delay is static
  • it applies even when the app is healthy and fast

startupProbe is more precise:

  • it disables liveness checks until startup succeeds
  • it adapts to slower boot times without making everything wait unnecessarily

Rule of thumb:

  • Use startupProbe for slow boots (JVM warmup, large migrations, cold caches).
  • Keep initialDelaySeconds small or zero once you have a good startupProbe.

Readiness and rollouts: what actually happens

During a Deployment rollout:

  • new Pods start
  • readiness controls when they become endpoints behind a Service
  • old Pods are terminated based on rollout strategy

If readiness is too “optimistic”:

  • traffic hits Pods that still load configs, warm caches, or are mid-migration
  • you get errors during deploys even though “rollout succeeded”

If readiness is too “strict” (e.g., fails when DB is down for 1 second):

  • deploys can stall
  • you can create cascading failures where nothing becomes ready

The sweet spot is:

  • readiness reflects ability to serve requests correctly
  • it tolerates short transient dependency jitter

Termination: pair readiness with graceful shutdown

Even perfect probes won’t save you if shutdown is abrupt.

Consider:

  • terminationGracePeriodSeconds (enough time to drain)
  • a preStop hook (stop accepting traffic, flush queues, etc.)
  • readiness behavior during shutdown (return “not ready” quickly)

Example pattern:

  1. On SIGTERM, your app stops accepting new requests.
  2. Readiness flips to fail quickly.
  3. Kubernetes removes the Pod from endpoints.
  4. In-flight requests drain within grace period.

Timeouts and thresholds: tuning that avoids flapping

Common pitfalls:

  • timeoutSeconds too low for real p99 latency (causes false negatives)
  • periodSeconds too aggressive (excess load and flapping)
  • failureThreshold too small (one blip triggers restart or endpoint removal)

Practical defaults:

  • readiness: periodSeconds: 5, timeoutSeconds: 1-2, failureThreshold: 3
  • liveness: periodSeconds: 10, timeoutSeconds: 1-2, failureThreshold: 3
  • startup: tune to max expected boot time (failureThreshold * periodSeconds)

For example, if the app may take up to 60s to boot:

  • startupProbe.periodSeconds: 2
  • startupProbe.failureThreshold: 30

Don’t couple liveness to external dependencies

This is worth repeating: if your liveness fails when DB is down, you can turn a DB incident into a full restart storm.

Better:

  • readiness fails when dependencies are required to serve traffic
  • liveness remains true unless the process is stuck/broken

Observability: make probe endpoints visible

Probes are internal checks, but you should still log and measure them:

  • count readiness failures and reasons
  • expose metrics for “dependency health” separately from “process health”

If you build /readyz as a composable check (cache, DB, queue), return a simple JSON with reason codes (even if the probe only checks status code).

Debugging playbook (when pods are not ready)

  1. Check endpoints:
kubectl get endpoints -n <ns> <svc> -o wide
  1. Check readiness/liveness failures:
kubectl describe pod -n <ns> <pod>
kubectl get events -n <ns> --sort-by=.lastTimestamp
  1. Verify the endpoint inside cluster:
kubectl exec -n <ns> -it <pod> -- sh
curl -sS -i http://127.0.0.1:8080/readyz

If your image is distroless, use an ephemeral container (kubectl debug) to run curl inside the Pod network namespace.

Final advice

Treat probes as part of your deployment contract:

  • probes define “healthy enough to receive traffic”
  • they shape rollout speed and stability
  • they can prevent outages when tuned with grace and realism

References

FAQ

Q: Should readiness and liveness use the same endpoint? A: Usually no. Readiness should check dependencies and request handling; liveness should be minimal and fast.

Q: What if startup is slow? A: Prefer startupProbe or increase failure thresholds instead of a long initialDelaySeconds.

Q: Do readiness failures restart containers? A: No. Only liveness failures trigger restarts.