Kubernetes Tip: Safer Rollouts with PDB + Surge/Unavailable

Combine Deployment rollingUpdate settings with PodDisruptionBudgets to keep availability during upgrades and node maintenance.

Most outages during “routine deploys” come from a mismatch between:

how many Pods you allow to go down during rollout
how many can be evicted during disruptions
whether you actually have enough replicas and capacity

Deployment rollingUpdate basics

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 25%
    maxUnavailable: 0

For services that must stay up, maxUnavailable: 0 is a strong default.

PodDisruptionBudget (PDB)

PDB protects against voluntary disruptions (node drain, upgrades, etc.).

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api

Pick one: `minAvailable` or `maxUnavailable`

minAvailable is easier to reason about for fixed replica counts.
maxUnavailable is useful for percentage-based budgets.

The gotcha: too few replicas

If you have only 2 replicas and set minAvailable: 2, then:

draining a node may be blocked (no disruption allowed)

That may be correct for critical systems, but expect ops friction. Align replica count, budget, and maintenance procedures.

Recommended pattern (common services)

If you run 3 replicas:

Deployment: maxUnavailable: 0, maxSurge: 1 (or 25%)
PDB: minAvailable: 2

This lets you:

rollout without dropping below 2 ready Pods
drain nodes while keeping 2 Pods up

Verify during incidents

kubectl get pdb -A
kubectl describe pdb -n <ns> <pdb>
kubectl rollout status deploy/<deploy> -n <ns>

Footnote

Using maxUnavailable: 0 can temporarily increase load on the cluster (surge). Make sure your node pools have headroom or use cluster autoscaler.

Rollout math: make the numbers explicit

When you set maxSurge and maxUnavailable, you’re defining how many Pods can exist and how many can be down during an update.

Example: 8 replicas, maxSurge: 25%, maxUnavailable: 0

maxSurge allows up to 2 extra Pods (25% of 8 = 2)
maxUnavailable: 0 means Kubernetes tries not to reduce available Pods below 8

This only works if:

the cluster has capacity to schedule the surge Pods
readiness gates actually represent “safe to receive traffic”

If the cluster can’t schedule the surge (common when requests are high), the rollout stalls.

PDB protects against voluntary disruptions only

This is one of the most misunderstood parts of PDB.

PDB helps with:

kubectl drain
platform node upgrades that cordon+drain nodes
some automated maintenance workflows

PDB does not protect against:

node crashes
kernel OOMs
container crashes due to bugs
network partitions or zone outages

So you still need:

enough replicas
spreading across nodes/zones
good health checks and graceful shutdown

Spread your replicas (or PDB won’t save you)

If all replicas land on the same node, a single node drain breaks availability even with a PDB.

Prefer topologySpreadConstraints (modern approach) or pod anti-affinity.

Example: spread across zones:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: api

This improves both:

rollout stability (new Pods spread correctly)
disruption resilience (maintenance doesn’t remove all replicas at once)

Graceful shutdown is part of “availability”

Even with good rollout settings, you can see errors if old Pods terminate abruptly.

Make sure:

readiness flips to “not ready” quickly on shutdown
terminationGracePeriodSeconds is long enough
optional preStop hook helps drain

This matters for:

long-lived HTTP keep-alives
gRPC streams
background workers processing jobs

Deployment knobs that reduce rollout risk

`minReadySeconds`

Ensures a Pod stays ready for a minimum time before it’s considered “available”. This reduces flip-flopping readiness during warmup.

`progressDeadlineSeconds`

Controls when Kubernetes marks a rollout as failed. Helpful for alerting and automation.

`revisionHistoryLimit`

Keeps old ReplicaSets for rollback. Keep enough history to safely undo.

Interaction with HPA and Cluster Autoscaler

Rollouts often create temporary extra Pods (surge). If the cluster lacks headroom, Cluster Autoscaler may add nodes—but:

provisioning time can slow rollouts
quotas and scale-up limits can block surge Pods

If HPA is active and traffic is high, you can also get competing behaviors:

HPA scales up for load
rollout creates surge Pods

Practical suggestions:

roll out during lower-traffic windows when possible
ensure node pools have buffer or autoscaler is configured well
ensure requests are realistic (HPA uses requests in utilization calculations)

StatefulSets: similar goals, different mechanics

StatefulSets roll out in order (pod-0, pod-1, …). For stateful systems:

PDB still helps for voluntary disruptions
but you must understand whether the app can tolerate sequential restarts

For databases, follow operator guidance and validate replication/leader behavior.

Runbook: when a rollout stalls

Check rollout status:

kubectl rollout status deploy/<name> -n <ns>
kubectl describe deploy/<name> -n <ns>

Check scheduling failures (common with surge):

kubectl get events -n <ns> --sort-by=.lastTimestamp | rg -n "FailedScheduling|Insufficient"

Check readiness failures:

kubectl describe pod -n <ns> <pod>
kubectl logs -n <ns> <pod> -c <container> --previous

Roll back if needed:

kubectl rollout undo deploy/<name> -n <ns>

Recommended patterns by service type

Standard stateless API (3+ replicas)

Deployment: maxUnavailable: 0, maxSurge: 1
PDB: minAvailable: replicas-1 (e.g., 2 of 3)
Spread across zones

High scale service (10+ replicas)

Deployment: maxUnavailable: 10%, maxSurge: 10%
PDB: align budget with rollout (percentage based)
Strong observability and progressive delivery if available

Final takeaway

PDB + rollout settings work when:

you have enough replicas
you spread them across failure domains
readiness/liveness are correct
you have capacity for surge or autoscaling configured properly

How `kubectl drain` interacts with PDB (what you’ll see)

When you drain a node, Kubernetes will try to evict Pods. For Pods covered by a PDB:

if evicting would violate the budget, the eviction is blocked
the drain command may “hang” (it’s waiting for enough Pods to be available elsewhere)

This is not a bug—it’s the budget doing its job. But it means you should:

test node drains in staging
ensure your workloads can reschedule quickly (requests, node selectors, tolerations)
avoid over-constraining placement (too strict affinity rules)

Canary and blue/green: when rollingUpdate isn’t enough

Deployments and PDBs are foundational, but some changes are high risk:

schema migrations
dependency upgrades
major config changes

For these, consider progressive delivery:

canary (shift 1%, then 10%, then 50%)
blue/green (swap traffic between two stable environments)

Even if you don’t have a full progressive delivery platform, you can approximate canary by:

creating a second Deployment with a small replica count
routing a small portion of traffic (ingress rules, header-based routing)

The key idea is to limit blast radius while you validate the new version.

Watch the right signals during rollout

Don’t rely only on “rollout status = success”. Monitor:

error rate (5xx, gRPC errors)
latency (p95/p99)
saturation (CPU throttling, memory pressure)
readiness failures and restart loops

If these regress, pause or roll back early.

References

FAQ

Q: What does a PDB protect against? A: Voluntary disruptions (drains, upgrades), not node crashes.

Q: Why is my rollout stuck? A: A strict PDB or readiness failures can block progress. Check events and adjust maxUnavailable or minAvailable.

Q: How should I set surge/unavailable? A: Keep enough surge to create new Pods while staying within PDB constraints. Avoid zero-surge if you need strict availability.