Kubernetes Tip: A Production Troubleshooting Playbook

A structured workflow for diagnosing Pending pods, CrashLoopBackOff, traffic failures, and node-level issues—without guessing.

When Kubernetes “breaks”, the fastest teams aren’t the ones with magical intuition—they’re the ones with a repeatable workflow.

This playbook is designed to be used during incidents. It’s intentionally systematic: start broad, narrow down, confirm hypotheses with evidence, and only then apply fixes.

First principles: classify the failure mode

Most production problems fall into a few buckets:

Scheduling / Pending: Pods can’t be placed on nodes.
Startup / CrashLoopBackOff: Pods start and crash repeatedly.
Readiness / Not receiving traffic: Pods run, but Services don’t route to them.
Networking: DNS, egress, ingress, or NetworkPolicy issues.
Node / platform: disk pressure, CNI/CSI problems, kubelet/runtime issues.

Start by identifying which bucket you’re in.

A high-level flow (Mermaid)

flowchart TD A[Symptom observed] --> B{Pod Pending?} B -- yes --> C[Check events + scheduling constraints] B -- no --> D{CrashLoopBackOff?} D -- yes --> E[Check logs, exit code, resources, config] D -- no --> F{Ready?} F -- no --> G[Readiness probe, endpoints, dependencies] F -- yes --> H{Traffic failing?} H -- yes --> I[Ingress, Service, DNS, NetworkPolicy] H -- no --> J[Node / platform diagnostics]

The 5 commands you run first

These give you the “shape” of the incident:

kubectl get pod -n <ns> -o wide
kubectl describe pod -n <ns> <pod>
kubectl get events -n <ns> --sort-by=.lastTimestamp
kubectl logs -n <ns> <pod> -c <container> --previous
kubectl top pod -n <ns> --containers

If you don’t have metrics-server for kubectl top, rely on your monitoring system, but still do the rest.

1) Pods are Pending (scheduling issues)

When a Pod is Pending, it usually means no node satisfied the constraints.

Typical causes

requests too high (Insufficient cpu/memory)
node selectors / affinity too strict
taints not tolerated
topology spread constraints that can’t be satisfied
quota/limit errors in the namespace

What to check

Look for FailedScheduling in events:

kubectl get events -n <ns> --sort-by=.lastTimestamp | rg -n "FailedScheduling|Insufficient|taint|affinity"

Then verify node capacity and allocatable:

kubectl describe node <node> | rg -n "Allocatable|Allocated resources|Taints"

Fix approach (order matters)

Confirm requests are realistic (don’t blindly reduce).
If constraints are too strict, relax them (carefully).
If capacity is truly insufficient, scale node pools or reduce workload footprint.

2) CrashLoopBackOff (startup failures)

CrashLoopBackOff means the container exits repeatedly and Kubernetes backs off restarts.

What to look for

Container exit code and reason:

kubectl describe pod -n <ns> <pod> | rg -n "Reason|Exit Code|OOMKilled|Error"

Previous logs (critical):

kubectl logs -n <ns> <pod> -c <container> --previous

Common root causes

misconfigured env vars / missing secrets
wrong command/args
image mismatch (wrong tag, wrong arch)
dependency connection failures (DB URL, TLS)
OOMKills (memory limit too low or leak)

If the Pod can’t even start because the image won’t pull, you’ll see ImagePullBackOff or ErrImagePull.

Check:

image name/tag (typos are common)
registry auth (imagePullSecrets)
node connectivity to registry (network policy, firewall, DNS)
architecture mismatch (arm64 vs amd64)

Fast sanity checks

confirm the container image and pull status
confirm ConfigMap/Secret mounts exist
check if the app expects a writable filesystem but runs read-only

2.5) InitContainers: “stuck before it starts”

If your Pod has initContainers, the app containers won’t start until initContainers succeed.

Common issues:

initContainer tries to reach a dependency blocked by NetworkPolicy
initContainer runs migrations and times out
initContainer expects permissions that aren’t available

Check initContainer logs:

kubectl logs -n <ns> <pod> -c <init-container>

3) Pods run but don’t receive traffic (readiness / endpoints)

If Pods are running but traffic fails, check readiness and endpoints.

Check readiness status

kubectl get pod -n <ns> <pod> -o jsonpath='{.status.containerStatuses[*].ready}{"\n"}'
kubectl describe pod -n <ns> <pod> | rg -n "Readiness probe failed|Liveness probe failed"

Check Service endpoints

kubectl get svc -n <ns>
kubectl get endpoints -n <ns> <svc> -o wide

If endpoints are empty, your selectors may not match Pod labels, or Pods aren’t Ready.

Verify inside the cluster

Port-forward (simple, but not always identical to in-cluster):

kubectl port-forward -n <ns> pod/<pod> 18080:8080
curl -sS -i http://127.0.0.1:18080/readyz

Then verify in-cluster DNS and routing using a debug container:

kubectl debug -n <ns> -it pod/<pod> --image=curlimages/curl:8.5.0
curl -sS -i http://<svc>.<ns>.svc.cluster.local:8080/readyz

4) Networking failures (DNS, egress, ingress)

Networking issues are often misdiagnosed as “application bugs”.

DNS checks

From inside a Pod:

nslookup kubernetes.default.svc.cluster.local
nslookup <svc>.<ns>.svc.cluster.local
cat /etc/resolv.conf

If DNS is failing:

check CoreDNS Pods
check NetworkPolicy egress rules (DNS must be allowed)
check node-level DNS and CNI health

Egress failures

If outbound requests fail:

confirm NetworkPolicy egress rules
confirm NAT / firewall / cloud routes
confirm TLS and SNI settings for external endpoints

Ingress failures

If traffic from the internet doesn’t arrive:

check Ingress/Gateway status and controller logs
check Service type and endpoints
confirm health checks and readiness

5) Node and platform issues (disk, CNI, CSI, kubelet)

Sometimes workloads are healthy but nodes are not.

Disk pressure

Symptoms:

Pods evicted
image pulls failing
kubelet complaining about disk

Check:

kubectl describe node <node> | rg -n "DiskPressure|ImageFS|Eviction"

CNI problems

Symptoms:

Pods can’t reach Services
DNS breaks intermittently
new Pods stuck in ContainerCreating

Check:

CNI daemonset status in kube-system
node events for networking errors

CSI / storage problems

Symptoms:

Pods stuck in ContainerCreating
volumes failing to attach/mount

Check:

kubectl describe pod -n <ns> <pod> | rg -n "MountVolume|AttachVolume|failed"
kubectl get pvc -n <ns>
kubectl describe pvc -n <ns> <pvc>

“Don’t make it worse”: safe mitigation principles

During incidents, it’s tempting to apply random changes. A safer approach:

roll back the last deployment if the timing matches
scale out temporarily if you’re CPU-bound (after confirming)
avoid deleting random Pods unless you understand why
capture evidence (events/logs) before changing things

Post-incident: turn findings into guardrails

After the incident, convert lessons into automation:

missing requests/limits => enforce via policy (LimitRange, admission)
recurring OOMKills => tune memory and align runtime settings
probe misconfiguration => standardize probes per service type
NetworkPolicy surprises => maintain a platform allow-list and staged rollout

Checklist (printable)

Identify the bucket (Pending / CrashLoop / NotReady / Network / Node)
Run: get/describe/events/logs/top
Validate endpoints and selectors for traffic issues
Use in-Pod testing (debug container) for DNS/network questions
Prefer rollback over random edits
Convert root cause into a guardrail after recovery

References

FAQ

Q: What is the first place to look? A: kubectl describe and recent events usually point to the root cause fast.

Q: How do I separate app issues from cluster issues? A: Compare other namespaces, check node pressure, and validate service dependencies.

Q: When should I use ephemeral containers? A: When the image lacks tools and you need in-pod inspection without rebuilding.

Kubernetes Tip: A Production Troubleshooting Playbook

First principles: classify the failure mode

A high-level flow (Mermaid)

The 5 commands you run first

1) Pods are Pending (scheduling issues)

Typical causes

What to check

Fix approach (order matters)

2) CrashLoopBackOff (startup failures)

What to look for

Common root causes

Related bucket: ImagePullBackOff / ErrImagePull

Fast sanity checks

2.5) InitContainers: “stuck before it starts”

3) Pods run but don’t receive traffic (readiness / endpoints)

Check readiness status

Check Service endpoints

Verify inside the cluster

4) Networking failures (DNS, egress, ingress)

DNS checks

Egress failures

Ingress failures

5) Node and platform issues (disk, CNI, CSI, kubelet)

Disk pressure

CNI problems

CSI / storage problems

“Don’t make it worse”: safe mitigation principles

Post-incident: turn findings into guardrails

Checklist (printable)

References

FAQ