CFN Cloud
Cloud Future New Life
en zh
2025-12-29 · 0 views

Kubernetes Tip: Requests & Limits (Without Surprises)

How CPU/memory requests and limits actually affect scheduling, throttling, OOMKills, and autoscaling.

Resource settings are one of the fastest ways to make a cluster feel “stable” (or mysteriously broken).

TL;DR

  • Set requests for every container (CPU + memory). This is what the scheduler uses.
  • Be careful with CPU limits: CPU limits can cause throttling under load.
  • Memory limits are important: without a memory limit, a container can consume node memory and destabilize neighbors.
  • HPA looks at requests (by default). Wrong requests => wrong autoscaling decisions.

What the scheduler really uses

The scheduler places Pods based on requests, not limits.

If you omit requests:

  • CPU request defaults to 0 (effectively “best effort”)
  • Memory request defaults to 0

That makes the Pod easy to schedule, but it can also lead to noisy-neighbor issues.

A sane baseline YAML

resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    memory: "512Mi"

Why this baseline:

  • You get predictable scheduling (requests set).
  • You avoid node-wide memory pressure (memory limit set).
  • You avoid CPU throttling surprises by not setting a CPU limit initially.

CPU limits and throttling

CPU limits are enforced by CFS quota. If you set a CPU limit too low:

  • Requests can be met, but bursty workloads will get throttled.
  • Latency spikes show up even though “CPU usage looks fine”.

When you do set CPU limits, try to keep:

  • limits.cpu >= requests.cpu (always)
  • and preferably not too close for bursty apps

Memory limits and OOMKills

Memory is different: if a container exceeds its memory limit, the container is typically OOMKilled.

If you see restarts with:

kubectl describe pod <pod> | rg -n "OOMKilled|Killed"

Consider:

  • increasing limits.memory
  • optimizing memory usage
  • enabling a VPA recommendation loop (even if you don’t auto-apply)

HPA (CPU utilization) typically uses:

current CPU usage / CPU requests

So if CPU requests are too high:

  • HPA thinks utilization is low → scales too late

If CPU requests are too low:

  • HPA scales too aggressively

Quick audit commands

List containers missing requests/limits:

kubectl get deploy -A -o json | jq -r '
  .items[]
  | .metadata.namespace as $ns
  | .metadata.name as $name
  | .spec.template.spec.containers[]
  | select(.resources.requests == null or .resources.requests.cpu == null or .resources.requests.memory == null)
  | "\($ns)/\($name) container=\(.name) missing requests"
'

Checklist

  • Every container has CPU+memory requests
  • Every container has a memory limit
  • CPU limits only where you truly need them
  • Requests match reality (HPA/VPA alignment)

Requests, limits, and QoS classes (why it matters)

Kubernetes assigns each Pod a QoS class based on the resources you set. QoS affects eviction priority when a node is under pressure (especially memory pressure).

QoS classes in practice

  • Guaranteed: Every container sets CPU+memory requests and limits, and requests == limits for both.
  • Burstable: You set at least some requests/limits, but the Pod is not Guaranteed.
  • BestEffort: No requests/limits set at all.

What this means operationally:

  • When a node hits memory pressure, BestEffort Pods are typically evicted first, then Burstable, and Guaranteed last.
  • If you want a component to be “reliable”, don’t let it accidentally fall into BestEffort.

Quick check:

kubectl get pod -n <ns> <pod> -o jsonpath='{.status.qosClass}{"\n"}'

CPU: scheduling vs throttling (and why limits can hurt latency)

CPU is “compressible”: if there isn’t enough CPU, processes usually slow down rather than crash.

  • Requests.cpu affects placement (bin packing): the scheduler uses requests.
  • Limits.cpu affects runtime enforcement: CPU quota can throttle the container.

This is why a common production baseline is:

  • set CPU requests for every container
  • avoid CPU limits unless you have a clear reason (hard fairness, strict multi-tenancy policies, or capacity control)

A real-world symptom of too-low CPU limits

Your application is “healthy” but p95 latency spikes during traffic bursts. Metrics show CPU usage “below the limit”, yet response time is bad. Often the missing piece is CPU throttling (CFS quota), which may not be obvious unless you monitor throttling counters.

If you must set CPU limits, consider:

  • limits.cpu >= requests.cpu (always)
  • enough headroom for bursty code paths, GC, TLS handshakes, and cold caches

Memory: protect the node, but design for OOMKills

Memory is not compressible: going above a memory limit can kill the container.

Treat memory limits as:

  • a safety boundary for the node and other workloads
  • a failure scenario your app must tolerate (restarts, warmups, cache rebuilds)

Language runtime alignment

If your runtime can “think” it has more memory than the container allows, you’ll get surprises:

  • JVM: align -Xmx with limits.memory (leave headroom for native memory)
  • Node: align --max-old-space-size
  • Go: memory is dynamic; still watch peak usage and tune caches/buffers

Node allocatable and “why did the node OOM?”

Nodes have:

  • Capacity: physical resources
  • Allocatable: what Kubernetes offers to Pods after reserving for OS/system

You can be “within requests” but still destabilize a node at runtime if:

  • requests are far below real usage (aggressive overcommit)
  • multiple workloads spike together
  • the node is running heavy DaemonSets (logging, monitoring, service mesh)

Inspect allocatable and current allocations:

kubectl describe node <node> | rg -n "Allocatable|Allocated resources"
kubectl get events -A --sort-by=.lastTimestamp | rg -n "Evicted|MemoryPressure|OOM"

Sidecars and initContainers: the hidden tax

Many production Pods include sidecars:

  • service mesh proxies (Envoy)
  • log forwarders
  • security agents

These containers need their own requests/limits. If you size only the “main” container but ignore sidecars, the Pod’s total resource footprint can be significantly larger than expected.

Also remember:

  • initContainers can be CPU/memory heavy (migrations, warmups, downloads)
  • they affect startup time, scheduling, and can cause node pressure during large rollouts

Guardrails: LimitRange and ResourceQuota

To prevent “someone forgot requests” from entering production, many teams enforce guardrails:

LimitRange (defaults + min/max)

LimitRange can provide defaults and also enforce min/max per container. This is useful when teams forget to set requests (which leads to BestEffort/Burstable chaos).

apiVersion: v1
kind: LimitRange
metadata:
  name: defaults
  namespace: app
spec:
  limits:
    - type: Container
      defaultRequest:
        cpu: 100m
        memory: 128Mi
      default:
        memory: 512Mi
      min:
        cpu: 25m
        memory: 64Mi
      max:
        cpu: "2"
        memory: 4Gi

ResourceQuota (namespace budget)

ResourceQuota prevents a namespace from consuming unlimited cluster capacity and can enforce request/limit usage.

How to pick numbers (a practical sizing loop)

Think in iterations:

  1. Start simple: set CPU+memory requests; set memory limits; skip CPU limits initially.
  2. Measure in real traffic conditions:
  • peak memory (p95/p99)
  • CPU usage and throttling (if limits exist)
  • restart/OOMKilled rate
  • HPA behavior (scale timing and stability)
  1. Adjust:
  • requests too high => slow scaling + wasted bin packing
  • requests too low => noisy neighbors + unstable autoscaling
  • memory limit too low => OOMKills; too high => reduced density
  1. Automate recommendations:
  • VPA is excellent for recommendations even if you don’t auto-apply

A quick decision table

Goal CPU request CPU limit Memory request Memory limit
General web service required optional required required
Strict multi-tenant required required required required
Batch job required optional required required

Final rule of thumb

If you do only one thing:

Set requests for CPU+memory, set memory limits, and only set CPU limits when you can measure and accept the throttling trade-offs.

References

FAQ

Q: Should requests equal limits? A: Only if you need strict guarantees. For most services, Burstable (requests < limits) is fine.

Q: Why do Pods get OOMKilled? A: The memory limit is too low for peak usage. Raise limits or reduce memory spikes.

Q: Do limits affect scheduling? A: Scheduling uses requests. Limits affect runtime enforcement and eviction pressure.