Kubernetes Tip: A Production Troubleshooting Playbook
A structured workflow for diagnosing Pending pods, CrashLoopBackOff, traffic failures, and node-level issues—without guessing.
When Kubernetes “breaks”, the fastest teams aren’t the ones with magical intuition—they’re the ones with a repeatable workflow.
This playbook is designed to be used during incidents. It’s intentionally systematic: start broad, narrow down, confirm hypotheses with evidence, and only then apply fixes.
First principles: classify the failure mode
Most production problems fall into a few buckets:
- Scheduling / Pending: Pods can’t be placed on nodes.
- Startup / CrashLoopBackOff: Pods start and crash repeatedly.
- Readiness / Not receiving traffic: Pods run, but Services don’t route to them.
- Networking: DNS, egress, ingress, or NetworkPolicy issues.
- Node / platform: disk pressure, CNI/CSI problems, kubelet/runtime issues.
Start by identifying which bucket you’re in.
A high-level flow (Mermaid)
flowchart TD A[Symptom observed] --> B{Pod Pending?} B -- yes --> C[Check events + scheduling constraints] B -- no --> D{CrashLoopBackOff?} D -- yes --> E[Check logs, exit code, resources, config] D -- no --> F{Ready?} F -- no --> G[Readiness probe, endpoints, dependencies] F -- yes --> H{Traffic failing?} H -- yes --> I[Ingress, Service, DNS, NetworkPolicy] H -- no --> J[Node / platform diagnostics]The 5 commands you run first
These give you the “shape” of the incident:
kubectl get pod -n <ns> -o wide
kubectl describe pod -n <ns> <pod>
kubectl get events -n <ns> --sort-by=.lastTimestamp
kubectl logs -n <ns> <pod> -c <container> --previous
kubectl top pod -n <ns> --containers
If you don’t have metrics-server for kubectl top, rely on your monitoring system, but still do the rest.
1) Pods are Pending (scheduling issues)
When a Pod is Pending, it usually means no node satisfied the constraints.
Typical causes
- requests too high (
Insufficient cpu/memory) - node selectors / affinity too strict
- taints not tolerated
- topology spread constraints that can’t be satisfied
- quota/limit errors in the namespace
What to check
Look for FailedScheduling in events:
kubectl get events -n <ns> --sort-by=.lastTimestamp | rg -n "FailedScheduling|Insufficient|taint|affinity"
Then verify node capacity and allocatable:
kubectl describe node <node> | rg -n "Allocatable|Allocated resources|Taints"
Fix approach (order matters)
- Confirm requests are realistic (don’t blindly reduce).
- If constraints are too strict, relax them (carefully).
- If capacity is truly insufficient, scale node pools or reduce workload footprint.
2) CrashLoopBackOff (startup failures)
CrashLoopBackOff means the container exits repeatedly and Kubernetes backs off restarts.
What to look for
- Container exit code and reason:
kubectl describe pod -n <ns> <pod> | rg -n "Reason|Exit Code|OOMKilled|Error"
- Previous logs (critical):
kubectl logs -n <ns> <pod> -c <container> --previous
Common root causes
- misconfigured env vars / missing secrets
- wrong command/args
- image mismatch (wrong tag, wrong arch)
- dependency connection failures (DB URL, TLS)
- OOMKills (memory limit too low or leak)
Related bucket: ImagePullBackOff / ErrImagePull
If the Pod can’t even start because the image won’t pull, you’ll see ImagePullBackOff or ErrImagePull.
Check:
- image name/tag (typos are common)
- registry auth (imagePullSecrets)
- node connectivity to registry (network policy, firewall, DNS)
- architecture mismatch (arm64 vs amd64)
Fast sanity checks
- confirm the container image and pull status
- confirm
ConfigMap/Secretmounts exist - check if the app expects a writable filesystem but runs read-only
2.5) InitContainers: “stuck before it starts”
If your Pod has initContainers, the app containers won’t start until initContainers succeed.
Common issues:
- initContainer tries to reach a dependency blocked by NetworkPolicy
- initContainer runs migrations and times out
- initContainer expects permissions that aren’t available
Check initContainer logs:
kubectl logs -n <ns> <pod> -c <init-container>
3) Pods run but don’t receive traffic (readiness / endpoints)
If Pods are running but traffic fails, check readiness and endpoints.
Check readiness status
kubectl get pod -n <ns> <pod> -o jsonpath='{.status.containerStatuses[*].ready}{"\n"}'
kubectl describe pod -n <ns> <pod> | rg -n "Readiness probe failed|Liveness probe failed"
Check Service endpoints
kubectl get svc -n <ns>
kubectl get endpoints -n <ns> <svc> -o wide
If endpoints are empty, your selectors may not match Pod labels, or Pods aren’t Ready.
Verify inside the cluster
Port-forward (simple, but not always identical to in-cluster):
kubectl port-forward -n <ns> pod/<pod> 18080:8080
curl -sS -i http://127.0.0.1:18080/readyz
Then verify in-cluster DNS and routing using a debug container:
kubectl debug -n <ns> -it pod/<pod> --image=curlimages/curl:8.5.0
curl -sS -i http://<svc>.<ns>.svc.cluster.local:8080/readyz
4) Networking failures (DNS, egress, ingress)
Networking issues are often misdiagnosed as “application bugs”.
DNS checks
From inside a Pod:
nslookup kubernetes.default.svc.cluster.local
nslookup <svc>.<ns>.svc.cluster.local
cat /etc/resolv.conf
If DNS is failing:
- check CoreDNS Pods
- check NetworkPolicy egress rules (DNS must be allowed)
- check node-level DNS and CNI health
Egress failures
If outbound requests fail:
- confirm NetworkPolicy egress rules
- confirm NAT / firewall / cloud routes
- confirm TLS and SNI settings for external endpoints
Ingress failures
If traffic from the internet doesn’t arrive:
- check Ingress/Gateway status and controller logs
- check Service type and endpoints
- confirm health checks and readiness
5) Node and platform issues (disk, CNI, CSI, kubelet)
Sometimes workloads are healthy but nodes are not.
Disk pressure
Symptoms:
- Pods evicted
- image pulls failing
- kubelet complaining about disk
Check:
kubectl describe node <node> | rg -n "DiskPressure|ImageFS|Eviction"
CNI problems
Symptoms:
- Pods can’t reach Services
- DNS breaks intermittently
- new Pods stuck in
ContainerCreating
Check:
- CNI daemonset status in kube-system
- node events for networking errors
CSI / storage problems
Symptoms:
- Pods stuck in
ContainerCreating - volumes failing to attach/mount
Check:
kubectl describe pod -n <ns> <pod> | rg -n "MountVolume|AttachVolume|failed"
kubectl get pvc -n <ns>
kubectl describe pvc -n <ns> <pvc>
“Don’t make it worse”: safe mitigation principles
During incidents, it’s tempting to apply random changes. A safer approach:
- roll back the last deployment if the timing matches
- scale out temporarily if you’re CPU-bound (after confirming)
- avoid deleting random Pods unless you understand why
- capture evidence (events/logs) before changing things
Post-incident: turn findings into guardrails
After the incident, convert lessons into automation:
- missing requests/limits => enforce via policy (LimitRange, admission)
- recurring OOMKills => tune memory and align runtime settings
- probe misconfiguration => standardize probes per service type
- NetworkPolicy surprises => maintain a platform allow-list and staged rollout
Checklist (printable)
- Identify the bucket (Pending / CrashLoop / NotReady / Network / Node)
- Run: get/describe/events/logs/top
- Validate endpoints and selectors for traffic issues
- Use in-Pod testing (debug container) for DNS/network questions
- Prefer rollback over random edits
- Convert root cause into a guardrail after recovery
References
FAQ
Q: What is the first place to look?
A: kubectl describe and recent events usually point to the root cause fast.
Q: How do I separate app issues from cluster issues? A: Compare other namespaces, check node pressure, and validate service dependencies.
Q: When should I use ephemeral containers? A: When the image lacks tools and you need in-pod inspection without rebuilding.