Kubernetes Tip: Safer Rollouts with PDB + Surge/Unavailable
Combine Deployment rollingUpdate settings with PodDisruptionBudgets to keep availability during upgrades and node maintenance.
Most outages during “routine deploys” come from a mismatch between:
- how many Pods you allow to go down during rollout
- how many can be evicted during disruptions
- whether you actually have enough replicas and capacity
Deployment rollingUpdate basics
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 0
For services that must stay up, maxUnavailable: 0 is a strong default.
PodDisruptionBudget (PDB)
PDB protects against voluntary disruptions (node drain, upgrades, etc.).
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: api
Pick one: minAvailable or maxUnavailable
minAvailableis easier to reason about for fixed replica counts.maxUnavailableis useful for percentage-based budgets.
The gotcha: too few replicas
If you have only 2 replicas and set minAvailable: 2, then:
- draining a node may be blocked (no disruption allowed)
That may be correct for critical systems, but expect ops friction. Align replica count, budget, and maintenance procedures.
Recommended pattern (common services)
If you run 3 replicas:
- Deployment:
maxUnavailable: 0,maxSurge: 1(or 25%) - PDB:
minAvailable: 2
This lets you:
- rollout without dropping below 2 ready Pods
- drain nodes while keeping 2 Pods up
Verify during incidents
kubectl get pdb -A
kubectl describe pdb -n <ns> <pdb>
kubectl rollout status deploy/<deploy> -n <ns>
Footnote
Using maxUnavailable: 0 can temporarily increase load on the cluster (surge). Make sure your node pools have headroom or use cluster autoscaler.
Rollout math: make the numbers explicit
When you set maxSurge and maxUnavailable, you’re defining how many Pods can exist and how many can be down during an update.
Example: 8 replicas, maxSurge: 25%, maxUnavailable: 0
maxSurgeallows up to 2 extra Pods (25% of 8 = 2)maxUnavailable: 0means Kubernetes tries not to reduce available Pods below 8
This only works if:
- the cluster has capacity to schedule the surge Pods
- readiness gates actually represent “safe to receive traffic”
If the cluster can’t schedule the surge (common when requests are high), the rollout stalls.
PDB protects against voluntary disruptions only
This is one of the most misunderstood parts of PDB.
PDB helps with:
kubectl drain- platform node upgrades that cordon+drain nodes
- some automated maintenance workflows
PDB does not protect against:
- node crashes
- kernel OOMs
- container crashes due to bugs
- network partitions or zone outages
So you still need:
- enough replicas
- spreading across nodes/zones
- good health checks and graceful shutdown
Spread your replicas (or PDB won’t save you)
If all replicas land on the same node, a single node drain breaks availability even with a PDB.
Prefer topologySpreadConstraints (modern approach) or pod anti-affinity.
Example: spread across zones:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api
This improves both:
- rollout stability (new Pods spread correctly)
- disruption resilience (maintenance doesn’t remove all replicas at once)
Graceful shutdown is part of “availability”
Even with good rollout settings, you can see errors if old Pods terminate abruptly.
Make sure:
- readiness flips to “not ready” quickly on shutdown
terminationGracePeriodSecondsis long enough- optional
preStophook helps drain
This matters for:
- long-lived HTTP keep-alives
- gRPC streams
- background workers processing jobs
Deployment knobs that reduce rollout risk
minReadySeconds
Ensures a Pod stays ready for a minimum time before it’s considered “available”. This reduces flip-flopping readiness during warmup.
progressDeadlineSeconds
Controls when Kubernetes marks a rollout as failed. Helpful for alerting and automation.
revisionHistoryLimit
Keeps old ReplicaSets for rollback. Keep enough history to safely undo.
Interaction with HPA and Cluster Autoscaler
Rollouts often create temporary extra Pods (surge). If the cluster lacks headroom, Cluster Autoscaler may add nodes—but:
- provisioning time can slow rollouts
- quotas and scale-up limits can block surge Pods
If HPA is active and traffic is high, you can also get competing behaviors:
- HPA scales up for load
- rollout creates surge Pods
Practical suggestions:
- roll out during lower-traffic windows when possible
- ensure node pools have buffer or autoscaler is configured well
- ensure requests are realistic (HPA uses requests in utilization calculations)
StatefulSets: similar goals, different mechanics
StatefulSets roll out in order (pod-0, pod-1, …). For stateful systems:
- PDB still helps for voluntary disruptions
- but you must understand whether the app can tolerate sequential restarts
For databases, follow operator guidance and validate replication/leader behavior.
Runbook: when a rollout stalls
- Check rollout status:
kubectl rollout status deploy/<name> -n <ns>
kubectl describe deploy/<name> -n <ns>
- Check scheduling failures (common with surge):
kubectl get events -n <ns> --sort-by=.lastTimestamp | rg -n "FailedScheduling|Insufficient"
- Check readiness failures:
kubectl describe pod -n <ns> <pod>
kubectl logs -n <ns> <pod> -c <container> --previous
- Roll back if needed:
kubectl rollout undo deploy/<name> -n <ns>
Recommended patterns by service type
Standard stateless API (3+ replicas)
- Deployment:
maxUnavailable: 0,maxSurge: 1 - PDB:
minAvailable: replicas-1(e.g., 2 of 3) - Spread across zones
High scale service (10+ replicas)
- Deployment:
maxUnavailable: 10%,maxSurge: 10% - PDB: align budget with rollout (percentage based)
- Strong observability and progressive delivery if available
Final takeaway
PDB + rollout settings work when:
- you have enough replicas
- you spread them across failure domains
- readiness/liveness are correct
- you have capacity for surge or autoscaling configured properly
How kubectl drain interacts with PDB (what you’ll see)
When you drain a node, Kubernetes will try to evict Pods. For Pods covered by a PDB:
- if evicting would violate the budget, the eviction is blocked
- the drain command may “hang” (it’s waiting for enough Pods to be available elsewhere)
This is not a bug—it’s the budget doing its job. But it means you should:
- test node drains in staging
- ensure your workloads can reschedule quickly (requests, node selectors, tolerations)
- avoid over-constraining placement (too strict affinity rules)
Canary and blue/green: when rollingUpdate isn’t enough
Deployments and PDBs are foundational, but some changes are high risk:
- schema migrations
- dependency upgrades
- major config changes
For these, consider progressive delivery:
- canary (shift 1%, then 10%, then 50%)
- blue/green (swap traffic between two stable environments)
Even if you don’t have a full progressive delivery platform, you can approximate canary by:
- creating a second Deployment with a small replica count
- routing a small portion of traffic (ingress rules, header-based routing)
The key idea is to limit blast radius while you validate the new version.
Watch the right signals during rollout
Don’t rely only on “rollout status = success”. Monitor:
- error rate (5xx, gRPC errors)
- latency (p95/p99)
- saturation (CPU throttling, memory pressure)
- readiness failures and restart loops
If these regress, pause or roll back early.
References
FAQ
Q: What does a PDB protect against? A: Voluntary disruptions (drains, upgrades), not node crashes.
Q: Why is my rollout stuck?
A: A strict PDB or readiness failures can block progress. Check events and adjust maxUnavailable or minAvailable.
Q: How should I set surge/unavailable? A: Keep enough surge to create new Pods while staying within PDB constraints. Avoid zero-surge if you need strict availability.