Kubernetes Tip: Autoscaling Without Thrash (HPA + VPA + Cluster Autoscaler)
How to make autoscaling predictable: right requests, sane HPA behavior, VPA recommendations, and capacity-aware cluster scaling.
Autoscaling is a system of systems. If it’s unstable, it’s rarely “HPA is bad”—it’s usually that the inputs are wrong or multiple controllers are fighting each other.
This guide focuses on practical autoscaling that behaves well in production.
The three layers
- HPA (Horizontal Pod Autoscaler): changes replica count based on metrics.
- VPA (Vertical Pod Autoscaler): changes (or recommends) requests/limits.
- Cluster Autoscaler: adds/removes nodes when the cluster lacks capacity.
If you want stability, you must align all three.
Step 0: Requests are the foundation
Many autoscaling behaviors depend on requests:
- HPA CPU utilization is often computed as:
current CPU usage / CPU requests - Scheduler placement uses requests
- Cluster autoscaler reacts to unschedulable Pods (often caused by requests)
If requests are wrong:
- HPA scales at the wrong time
- Pods pack poorly
- scaling becomes either too aggressive or too slow
Before tuning HPA, confirm every container has reasonable requests.
HPA basics (CPU utilization example)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api
namespace: app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 30
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
Interpretation:
- if average CPU usage across Pods is ~60% of request, HPA tries to hold steady
- above 60% it scales up, below it scales down
Avoid oscillation: stabilization and scaling policies
HPA v2 supports behavior tuning:
- stabilization windows (avoid rapid scale down/up)
- scaling policies (cap how fast it changes)
Example:
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 20
periodSeconds: 60
Why this works:
- scale up quickly (handle bursts)
- scale down slowly (avoid flapping when traffic dips briefly)
Pick the right metric (CPU is not always correct)
CPU utilization is a good default for:
- stateless web services
But it’s often wrong for:
- queue workers (better: queue depth / lag)
- memory-bound workloads (CPU stays low while memory grows)
- I/O bound services
Consider:
- custom metrics (queue length, request latency, RPS)
- external metrics (cloud load balancer metrics)
If you scale on latency, make sure you avoid feedback loops (latency is affected by many factors).
Cold start, warmups, and readiness
Autoscaling adds Pods, but Pods need time to become useful:
- image pull time
- startup time (JVM warmup, migrations)
- readiness gating
If your service has cold starts, you may need:
- higher
minReplicas - faster scale up policies
- pre-pulled images (node cache)
- startupProbe/readiness tuned to avoid serving too early
Cluster Autoscaler: the capacity layer
HPA can request more replicas, but if the cluster has no room, Pods go Pending. That is where Cluster Autoscaler should add nodes.
Common reasons cluster autoscaler does not help:
- Pods have node selectors/affinity that no node group can satisfy
- requests are so high that no node type can fit the Pod
- node group max size is reached (quota)
- PodDisruptionBudgets or constraints block scale down
During an incident, check:
kubectl get pod -n <ns> | rg -n "Pending"
kubectl describe pod -n <ns> <pod> | rg -n "FailedScheduling|Insufficient"
If Pods are Pending due to insufficient resources, autoscaler should react—if it doesn’t, inspect autoscaler logs and node group configuration.
VPA: recommendations are valuable even if you don’t auto-apply
VPA can operate in multiple modes:
- Off: only recommends
- Auto: updates requests automatically (can be disruptive)
- Initial: sets requests at Pod creation time
Many teams start with “Off” to learn:
- what requests should be
- which workloads have stable profiles vs highly variable usage
Then move to “Initial” for safer automation.
HPA + VPA together: avoid fighting
If you use both:
- use HPA for CPU-based scaling
- use VPA for memory request recommendations (or non-HPA-controlled resources)
Be careful if VPA changes CPU requests frequently while HPA scales on CPU utilization:
- changing CPU requests changes the denominator
- HPA may scale differently even if real load is unchanged
Practical pattern:
- enable VPA recommendations
- adopt them periodically (manual review) or via “Initial” mode
A realistic rollout plan
- Ensure requests exist and are reasonable (baseline).
- Add HPA with conservative targets and slow scale down.
- Validate behavior under load tests or real traffic.
- Ensure cluster autoscaler can add capacity when Pods are Pending.
- Add VPA in recommendation mode to improve requests over time.
Troubleshooting autoscaling (what to check)
HPA status and events
kubectl get hpa -n <ns>
kubectl describe hpa -n <ns> <hpa>
Look for:
- “unable to get metrics”
- “missing request for cpu”
- rapid scale events
Metrics pipeline
If HPA can’t read metrics, it can’t work. Confirm:
- metrics-server (for resource metrics)
- or Prometheus adapter / custom metrics adapter (for custom/external metrics)
Requests and QoS
If requests are missing or extreme, scaling will be weird. Audit requests in Deployments and sidecars.
Checklist
- Requests are set for all containers
- HPA scale down is stabilized (avoid flapping)
- Metrics pipeline is reliable (no “unable to get metrics”)
- Cluster autoscaler can satisfy placement constraints
- Use VPA recommendations to continuously improve requests
Memory-based scaling: be careful
HPA can scale on memory, but memory is often a lagging indicator:
- memory grows slowly with traffic and caches
- scaling based on memory can over-react if your app caches aggressively
If you scale on memory:
- keep scale-down conservative (large stabilization window)
- ensure memory requests are accurate (or the percentage target is meaningless)
For many services, a better pattern is:
- HPA on CPU (or request rate)
- VPA to tune memory requests
Event-driven scaling (queues, streams): consider KEDA-style signals
If your workload is driven by a queue:
- CPU utilization may not represent backlog
- you may want scaling on queue depth, lag, or consumer group metrics
That’s where event-driven autoscaling approaches (e.g., KEDA patterns) shine:
- scale based on backlog
- scale to zero for idle workloads (if allowed)
Even without a full event-driven stack, the key lesson is:
Choose metrics that reflect the business load, not just resource consumption.
Practical guardrail: always test scaling behavior
Autoscaling “looks fine” until your first real burst.
Before relying on it in production, validate:
- how long it takes to add Pods (startup + readiness)
- how long it takes to add nodes (autoscaler + provisioning)
- whether the service can handle partial capacity during the ramp
If your node provisioning takes 3–5 minutes, you may need:
- higher minReplicas
- pre-warmed node pools
- faster scale-up policies
Stable autoscaling is less about one knob and more about aligning boot time, metrics, and capacity.
References
- Kubernetes autoscaling concepts
- Horizontal Pod Autoscaler
- Vertical Pod Autoscaler
- Cluster Autoscaler
FAQ
Q: When should I use HPA vs VPA? A: Use HPA to scale replica count by metrics; use VPA to right-size requests and limits. Avoid letting both control the same workload at the same time; run VPA in recommendation mode if HPA is enabled.
Q: Why does HPA oscillate? A: Metrics lag and aggressive policies can cause thrash. Add stabilization windows, slow scale down, and tune targets.
Q: Do I need metrics-server? A: For CPU or memory metrics, yes. For custom metrics, use a metrics pipeline (for example Prometheus plus adapter).