Kubernetes Tip: Autoscaling Without Thrash (HPA + VPA + Cluster Autoscaler)

How to make autoscaling predictable: right requests, sane HPA behavior, VPA recommendations, and capacity-aware cluster scaling.

Autoscaling is a system of systems. If it’s unstable, it’s rarely “HPA is bad”—it’s usually that the inputs are wrong or multiple controllers are fighting each other.

This guide focuses on practical autoscaling that behaves well in production.

The three layers

HPA (Horizontal Pod Autoscaler): changes replica count based on metrics.
VPA (Vertical Pod Autoscaler): changes (or recommends) requests/limits.
Cluster Autoscaler: adds/removes nodes when the cluster lacks capacity.

If you want stability, you must align all three.

Step 0: Requests are the foundation

Many autoscaling behaviors depend on requests:

HPA CPU utilization is often computed as: current CPU usage / CPU requests
Scheduler placement uses requests
Cluster autoscaler reacts to unschedulable Pods (often caused by requests)

If requests are wrong:

HPA scales at the wrong time
Pods pack poorly
scaling becomes either too aggressive or too slow

Before tuning HPA, confirm every container has reasonable requests.

HPA basics (CPU utilization example)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api
  namespace: app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

Interpretation:

if average CPU usage across Pods is ~60% of request, HPA tries to hold steady
above 60% it scales up, below it scales down

Avoid oscillation: stabilization and scaling policies

HPA v2 supports behavior tuning:

stabilization windows (avoid rapid scale down/up)
scaling policies (cap how fast it changes)

Example:

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
      - type: Percent
        value: 100
        periodSeconds: 60
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
      - type: Percent
        value: 20
        periodSeconds: 60

Why this works:

scale up quickly (handle bursts)
scale down slowly (avoid flapping when traffic dips briefly)

Pick the right metric (CPU is not always correct)

CPU utilization is a good default for:

stateless web services

But it’s often wrong for:

queue workers (better: queue depth / lag)
memory-bound workloads (CPU stays low while memory grows)
I/O bound services

Consider:

custom metrics (queue length, request latency, RPS)
external metrics (cloud load balancer metrics)

If you scale on latency, make sure you avoid feedback loops (latency is affected by many factors).

Cold start, warmups, and readiness

Autoscaling adds Pods, but Pods need time to become useful:

image pull time
startup time (JVM warmup, migrations)
readiness gating

If your service has cold starts, you may need:

higher minReplicas
faster scale up policies
pre-pulled images (node cache)
startupProbe/readiness tuned to avoid serving too early

Cluster Autoscaler: the capacity layer

HPA can request more replicas, but if the cluster has no room, Pods go Pending. That is where Cluster Autoscaler should add nodes.

Common reasons cluster autoscaler does not help:

Pods have node selectors/affinity that no node group can satisfy
requests are so high that no node type can fit the Pod
node group max size is reached (quota)
PodDisruptionBudgets or constraints block scale down

During an incident, check:

kubectl get pod -n <ns> | rg -n "Pending"
kubectl describe pod -n <ns> <pod> | rg -n "FailedScheduling|Insufficient"

If Pods are Pending due to insufficient resources, autoscaler should react—if it doesn’t, inspect autoscaler logs and node group configuration.

VPA: recommendations are valuable even if you don’t auto-apply

VPA can operate in multiple modes:

Off: only recommends
Auto: updates requests automatically (can be disruptive)
Initial: sets requests at Pod creation time

Many teams start with “Off” to learn:

what requests should be
which workloads have stable profiles vs highly variable usage

Then move to “Initial” for safer automation.

HPA + VPA together: avoid fighting

If you use both:

use HPA for CPU-based scaling
use VPA for memory request recommendations (or non-HPA-controlled resources)

Be careful if VPA changes CPU requests frequently while HPA scales on CPU utilization:

changing CPU requests changes the denominator
HPA may scale differently even if real load is unchanged

Practical pattern:

enable VPA recommendations
adopt them periodically (manual review) or via “Initial” mode

A realistic rollout plan

Ensure requests exist and are reasonable (baseline).
Add HPA with conservative targets and slow scale down.
Validate behavior under load tests or real traffic.
Ensure cluster autoscaler can add capacity when Pods are Pending.
Add VPA in recommendation mode to improve requests over time.

Troubleshooting autoscaling (what to check)

HPA status and events

kubectl get hpa -n <ns>
kubectl describe hpa -n <ns> <hpa>

Look for:

“unable to get metrics”
“missing request for cpu”
rapid scale events

Metrics pipeline

If HPA can’t read metrics, it can’t work. Confirm:

metrics-server (for resource metrics)
or Prometheus adapter / custom metrics adapter (for custom/external metrics)

Requests and QoS

If requests are missing or extreme, scaling will be weird. Audit requests in Deployments and sidecars.

Checklist

Requests are set for all containers
HPA scale down is stabilized (avoid flapping)
Metrics pipeline is reliable (no “unable to get metrics”)
Cluster autoscaler can satisfy placement constraints
Use VPA recommendations to continuously improve requests

Memory-based scaling: be careful

HPA can scale on memory, but memory is often a lagging indicator:

memory grows slowly with traffic and caches
scaling based on memory can over-react if your app caches aggressively

If you scale on memory:

keep scale-down conservative (large stabilization window)
ensure memory requests are accurate (or the percentage target is meaningless)

For many services, a better pattern is:

HPA on CPU (or request rate)
VPA to tune memory requests

Event-driven scaling (queues, streams): consider KEDA-style signals

If your workload is driven by a queue:

CPU utilization may not represent backlog
you may want scaling on queue depth, lag, or consumer group metrics

That’s where event-driven autoscaling approaches (e.g., KEDA patterns) shine:

scale based on backlog
scale to zero for idle workloads (if allowed)

Even without a full event-driven stack, the key lesson is:

Choose metrics that reflect the business load, not just resource consumption.

Practical guardrail: always test scaling behavior

Autoscaling “looks fine” until your first real burst.

Before relying on it in production, validate:

how long it takes to add Pods (startup + readiness)
how long it takes to add nodes (autoscaler + provisioning)
whether the service can handle partial capacity during the ramp

If your node provisioning takes 3–5 minutes, you may need:

higher minReplicas
pre-warmed node pools
faster scale-up policies

Stable autoscaling is less about one knob and more about aligning boot time, metrics, and capacity.

References

FAQ

Q: When should I use HPA vs VPA? A: Use HPA to scale replica count by metrics; use VPA to right-size requests and limits. Avoid letting both control the same workload at the same time; run VPA in recommendation mode if HPA is enabled.

Q: Why does HPA oscillate? A: Metrics lag and aggressive policies can cause thrash. Add stabilization windows, slow scale down, and tune targets.

Q: Do I need metrics-server? A: For CPU or memory metrics, yes. For custom metrics, use a metrics pipeline (for example Prometheus plus adapter).