CFN Cloud
Cloud Future New Life
en zh
2025-12-29 · 0 views

Kubernetes Tip: A Production Troubleshooting Playbook

A structured workflow for diagnosing Pending pods, CrashLoopBackOff, traffic failures, and node-level issues—without guessing.

When Kubernetes “breaks”, the fastest teams aren’t the ones with magical intuition—they’re the ones with a repeatable workflow.

This playbook is designed to be used during incidents. It’s intentionally systematic: start broad, narrow down, confirm hypotheses with evidence, and only then apply fixes.

First principles: classify the failure mode

Most production problems fall into a few buckets:

  1. Scheduling / Pending: Pods can’t be placed on nodes.
  2. Startup / CrashLoopBackOff: Pods start and crash repeatedly.
  3. Readiness / Not receiving traffic: Pods run, but Services don’t route to them.
  4. Networking: DNS, egress, ingress, or NetworkPolicy issues.
  5. Node / platform: disk pressure, CNI/CSI problems, kubelet/runtime issues.

Start by identifying which bucket you’re in.

A high-level flow (Mermaid)

flowchart TD A[Symptom observed] --> B{Pod Pending?} B -- yes --> C[Check events + scheduling constraints] B -- no --> D{CrashLoopBackOff?} D -- yes --> E[Check logs, exit code, resources, config] D -- no --> F{Ready?} F -- no --> G[Readiness probe, endpoints, dependencies] F -- yes --> H{Traffic failing?} H -- yes --> I[Ingress, Service, DNS, NetworkPolicy] H -- no --> J[Node / platform diagnostics]

The 5 commands you run first

These give you the “shape” of the incident:

kubectl get pod -n <ns> -o wide
kubectl describe pod -n <ns> <pod>
kubectl get events -n <ns> --sort-by=.lastTimestamp
kubectl logs -n <ns> <pod> -c <container> --previous
kubectl top pod -n <ns> --containers

If you don’t have metrics-server for kubectl top, rely on your monitoring system, but still do the rest.

1) Pods are Pending (scheduling issues)

When a Pod is Pending, it usually means no node satisfied the constraints.

Typical causes

  • requests too high (Insufficient cpu/memory)
  • node selectors / affinity too strict
  • taints not tolerated
  • topology spread constraints that can’t be satisfied
  • quota/limit errors in the namespace

What to check

Look for FailedScheduling in events:

kubectl get events -n <ns> --sort-by=.lastTimestamp | rg -n "FailedScheduling|Insufficient|taint|affinity"

Then verify node capacity and allocatable:

kubectl describe node <node> | rg -n "Allocatable|Allocated resources|Taints"

Fix approach (order matters)

  1. Confirm requests are realistic (don’t blindly reduce).
  2. If constraints are too strict, relax them (carefully).
  3. If capacity is truly insufficient, scale node pools or reduce workload footprint.

2) CrashLoopBackOff (startup failures)

CrashLoopBackOff means the container exits repeatedly and Kubernetes backs off restarts.

What to look for

  1. Container exit code and reason:
kubectl describe pod -n <ns> <pod> | rg -n "Reason|Exit Code|OOMKilled|Error"
  1. Previous logs (critical):
kubectl logs -n <ns> <pod> -c <container> --previous

Common root causes

  • misconfigured env vars / missing secrets
  • wrong command/args
  • image mismatch (wrong tag, wrong arch)
  • dependency connection failures (DB URL, TLS)
  • OOMKills (memory limit too low or leak)

If the Pod can’t even start because the image won’t pull, you’ll see ImagePullBackOff or ErrImagePull.

Check:

  • image name/tag (typos are common)
  • registry auth (imagePullSecrets)
  • node connectivity to registry (network policy, firewall, DNS)
  • architecture mismatch (arm64 vs amd64)

Fast sanity checks

  • confirm the container image and pull status
  • confirm ConfigMap/Secret mounts exist
  • check if the app expects a writable filesystem but runs read-only

2.5) InitContainers: “stuck before it starts”

If your Pod has initContainers, the app containers won’t start until initContainers succeed.

Common issues:

  • initContainer tries to reach a dependency blocked by NetworkPolicy
  • initContainer runs migrations and times out
  • initContainer expects permissions that aren’t available

Check initContainer logs:

kubectl logs -n <ns> <pod> -c <init-container>

3) Pods run but don’t receive traffic (readiness / endpoints)

If Pods are running but traffic fails, check readiness and endpoints.

Check readiness status

kubectl get pod -n <ns> <pod> -o jsonpath='{.status.containerStatuses[*].ready}{"\n"}'
kubectl describe pod -n <ns> <pod> | rg -n "Readiness probe failed|Liveness probe failed"

Check Service endpoints

kubectl get svc -n <ns>
kubectl get endpoints -n <ns> <svc> -o wide

If endpoints are empty, your selectors may not match Pod labels, or Pods aren’t Ready.

Verify inside the cluster

Port-forward (simple, but not always identical to in-cluster):

kubectl port-forward -n <ns> pod/<pod> 18080:8080
curl -sS -i http://127.0.0.1:18080/readyz

Then verify in-cluster DNS and routing using a debug container:

kubectl debug -n <ns> -it pod/<pod> --image=curlimages/curl:8.5.0
curl -sS -i http://<svc>.<ns>.svc.cluster.local:8080/readyz

4) Networking failures (DNS, egress, ingress)

Networking issues are often misdiagnosed as “application bugs”.

DNS checks

From inside a Pod:

nslookup kubernetes.default.svc.cluster.local
nslookup <svc>.<ns>.svc.cluster.local
cat /etc/resolv.conf

If DNS is failing:

  • check CoreDNS Pods
  • check NetworkPolicy egress rules (DNS must be allowed)
  • check node-level DNS and CNI health

Egress failures

If outbound requests fail:

  • confirm NetworkPolicy egress rules
  • confirm NAT / firewall / cloud routes
  • confirm TLS and SNI settings for external endpoints

Ingress failures

If traffic from the internet doesn’t arrive:

  • check Ingress/Gateway status and controller logs
  • check Service type and endpoints
  • confirm health checks and readiness

5) Node and platform issues (disk, CNI, CSI, kubelet)

Sometimes workloads are healthy but nodes are not.

Disk pressure

Symptoms:

  • Pods evicted
  • image pulls failing
  • kubelet complaining about disk

Check:

kubectl describe node <node> | rg -n "DiskPressure|ImageFS|Eviction"

CNI problems

Symptoms:

  • Pods can’t reach Services
  • DNS breaks intermittently
  • new Pods stuck in ContainerCreating

Check:

  • CNI daemonset status in kube-system
  • node events for networking errors

CSI / storage problems

Symptoms:

  • Pods stuck in ContainerCreating
  • volumes failing to attach/mount

Check:

kubectl describe pod -n <ns> <pod> | rg -n "MountVolume|AttachVolume|failed"
kubectl get pvc -n <ns>
kubectl describe pvc -n <ns> <pvc>

“Don’t make it worse”: safe mitigation principles

During incidents, it’s tempting to apply random changes. A safer approach:

  • roll back the last deployment if the timing matches
  • scale out temporarily if you’re CPU-bound (after confirming)
  • avoid deleting random Pods unless you understand why
  • capture evidence (events/logs) before changing things

Post-incident: turn findings into guardrails

After the incident, convert lessons into automation:

  • missing requests/limits => enforce via policy (LimitRange, admission)
  • recurring OOMKills => tune memory and align runtime settings
  • probe misconfiguration => standardize probes per service type
  • NetworkPolicy surprises => maintain a platform allow-list and staged rollout

Checklist (printable)

  • Identify the bucket (Pending / CrashLoop / NotReady / Network / Node)
  • Run: get/describe/events/logs/top
  • Validate endpoints and selectors for traffic issues
  • Use in-Pod testing (debug container) for DNS/network questions
  • Prefer rollback over random edits
  • Convert root cause into a guardrail after recovery

References

FAQ

Q: What is the first place to look? A: kubectl describe and recent events usually point to the root cause fast.

Q: How do I separate app issues from cluster issues? A: Compare other namespaces, check node pressure, and validate service dependencies.

Q: When should I use ephemeral containers? A: When the image lacks tools and you need in-pod inspection without rebuilding.