Debugging CrashLoopBackOff, ImagePullBackOff, Pending pods, OOMKilled, events, logs, and systematic debugging methodology.
When something breaks in Kubernetes, the instinct is to start guessing. Resist that. Troubleshooting in Kubernetes is systematic. The cluster tells you exactly what is wrong — you just need to ask the right questions in the right order.
Follow this order every time:
┌─────────────────────────────────────────────────────────────────┐
│ SYSTEMATIC DEBUGGING FLOW │
│ │
│ 1. CHECK STATUS kubectl get <resource> │
│ │ What state is it in? │
│ ▼ │
│ 2. DESCRIBE kubectl describe <resource> │
│ │ What do the Events say? │
│ ▼ │
│ 3. CHECK LOGS kubectl logs <pod> │
│ │ What is the application saying? │
│ ▼ │
│ 4. CHECK NETWORKING Can Pods reach each other? │
│ │ Is DNS working? Endpoints populated? │
│ ▼ │
│ 5. CHECK NODE kubectl describe node │
│ Is the node healthy? │
└─────────────────────────────────────────────────────────────────┘
Work from the outside in. Start with the broadest view (get) and drill down. Most problems reveal themselves by step 2 or 3.
Tip: Before diving deep, always run
kubectl get events --sort-by=.lastTimestampto see a timeline of everything that happened recently across the cluster. This single command often points you straight to the problem.
Pods are the most common thing that breaks. Here is every Pod failure state, what causes it, and how to diagnose it.
kubectl get pods
│
┌────────────┼──────────────┬──────────────────┐
│ │ │ │
Pending ImagePull CrashLoop Running
BackOff BackOff but not
│ │ │ Ready
│ │ │ │
┌──────┴──────┐ │ ┌──────┴──────┐ │
│ No node │ │ │ App crash │ Readiness
│ fits (CPU/ │ │ │ OOMKilled │ probe
│ memory) │ │ │ Bad cmd/args│ failing
│ PVC not │ │ │ Missing env │
│ bound │ │ │ or config │
│ Taint with │ │ └─────────────┘
│ no tolerate │ │
└─────────────┘ │
Wrong image
No pull secret
Tag missing
A Pod stays Pending when the scheduler cannot place it on any node.
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# web-7d4b8c6f5-abc12 0/1 Pending 0 3m
kubectl describe pod web-7d4b8c6f5-abc12
# ...
# Events:
# Type Reason Age From Message
# ---- ------ ---- ---- -------
# Warning FailedScheduling 3m default-scheduler 0/3 nodes are available:
# 3 Insufficient cpu.
Common causes and fixes:
| Cause | Event Message | Fix |
|---|---|---|
| Not enough CPU/memory | Insufficient cpu or Insufficient memory |
Lower resource requests, add nodes, or free up resources |
| nodeSelector mismatch | didn't match Pod's node affinity/selector |
Fix the nodeSelector or label the node |
| Taint with no toleration | had taint {key=value:NoSchedule} |
Add a toleration to the Pod spec |
| PVC not bound | persistentvolumeclaim "data" not found or unbound |
Create the PV/PVC or fix the storageClass |
kubectl describe node node-1 | grep -A 5 "Allocated resources"
# Allocated resources:
# Resource Requests Limits
# -------- -------- ------
# cpu 1900m (95%) 3000m (150%)
# memory 1600Mi (82%) 2400Mi (123%)
# Node is nearly full — that's why the Pod can't schedule
kubectl get pvc
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
# data Pending standard 5m
kubectl describe pvc data
# Events:
# Warning ProvisioningFailed no persistent volumes available for this claim
The container runtime cannot pull the image.
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# api-5f8b9c7d6-xyz99 0/1 ImagePullBackOff 0 2m
kubectl describe pod api-5f8b9c7d6-xyz99
# Events:
# Type Reason Age From Message
# ---- ------ ---- ---- -------
# Normal Pulling 2m kubelet Pulling image "myregistry.io/api:v2.1"
# Warning Failed 2m kubelet Failed to pull image "myregistry.io/api:v2.1":
# rpc error: code = Unknown desc = Error response from daemon:
# manifest for myregistry.io/api:v2.1 not found
# Warning Failed 2m kubelet Error: ImagePullBackOff
Common causes and fixes:
| Cause | What You'll See | Fix |
|---|---|---|
| Typo in image name | repository does not exist |
Fix the image name in the spec |
| Tag doesn't exist | manifest not found |
Use a valid tag, check your registry |
| Private registry, no auth | unauthorized: authentication required |
Create an imagePullSecret |
| Registry unreachable | dial tcp: lookup ... no such host |
Check network/DNS from the node |
# Create the pull secret
kubectl create secret docker-registry regcred \
--docker-server=myregistry.io \
--docker-username=myuser \
--docker-password=mypass
# Reference it in the Pod spec
# spec:
# imagePullSecrets:
# - name: regcred
# containers:
# - name: api
# image: myregistry.io/api:v2.1
The container starts, crashes, Kubernetes restarts it, it crashes again, repeat. The backoff delay increases each time (10s, 20s, 40s, ..., up to 5 minutes).
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# web-6c8f9d4e7-lmn01 0/1 CrashLoopBackOff 5 (45s ago) 4m
# Step 1: Check logs from the CURRENT crash attempt
kubectl logs web-6c8f9d4e7-lmn01
# Error: FATAL: database "mydb" does not exist
# Step 2: If the container has already restarted, check the PREVIOUS crash
kubectl logs web-6c8f9d4e7-lmn01 --previous
# Error: FATAL: database "mydb" does not exist
Common causes and fixes:
| Cause | Logs / Events | Fix |
|---|---|---|
| App crashes on start | Application error in logs | Fix the app code or configuration |
| OOMKilled | OOMKilled in describe output, exit code 137 |
Increase memory limits or fix memory leak |
| Wrong command/args | exec: "wrong-cmd": not found |
Fix command or args in the spec |
| Missing env var | KeyError: 'DATABASE_URL' |
Add the env var via ConfigMap, Secret, or env |
| Missing config file | FileNotFoundError: /etc/app/config.yaml |
Mount the ConfigMap/Secret as a volume |
kubectl describe pod web-6c8f9d4e7-lmn01
# Containers:
# app:
# State: Waiting
# Reason: CrashLoopBackOff
# Last State: Terminated
# Reason: OOMKilled ← the container exceeded memory limits
# Exit Code: 137
# Limits:
# memory: 128Mi ← limit is too low
Gotcha: Exit code 137 means the process was killed by SIGKILL (OOM killer). Exit code 1 means the application exited with an error. Exit code 0 means normal exit — but Kubernetes restarts it because it's supposed to keep running (CrashLoopBackOff with exit 0 means the process completed but the Pod's restartPolicy is Always).
The container cannot start because a referenced ConfigMap or Secret is missing.
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# api-5f8b9c7d6-abc12 0/1 CreateContainerConfigError 0 1m
kubectl describe pod api-5f8b9c7d6-abc12
# Events:
# Warning Failed 1m kubelet Error: configmap "app-config" not found
Fix: create the missing ConfigMap or Secret in the same namespace as the Pod.
kubectl get configmaps
# No resources found in default namespace.
# Create the missing ConfigMap
kubectl create configmap app-config --from-literal=DB_HOST=postgres
The Pod is running, but its readiness probe is failing, so it is not added to Service endpoints.
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# web-7d4b8c6f5-abc12 0/1 Running 0 3m
# ^^^
# 0 of 1 containers ready
kubectl describe pod web-7d4b8c6f5-abc12
# Conditions:
# Type Status
# Initialized True
# Ready False ← not ready
# ContainersReady False
# PodScheduled True
# Events:
# Warning Unhealthy 30s (x10 over 3m) kubelet Readiness probe failed:
# Get "http://10.1.0.15:8080/healthz": dial tcp 10.1.0.15:8080:
# connect: connection refused
Fix: the application is not listening on the port the probe expects. Check the probe path, port, and the application's startup behavior. Consider adding a startupProbe if the app needs time to initialize.
When a node runs out of memory or disk, the kubelet evicts Pods to protect the node.
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# web-7d4b8c6f5-abc12 0/1 Evicted 0 30m
kubectl describe pod web-7d4b8c6f5-abc12
# Status: Failed
# Reason: Evicted
# Message: The node was low on resource: memory.
# Check the node
kubectl describe node node-1 | grep -A 3 "Conditions"
# Conditions:
# Type Status Reason
# MemoryPressure True KubeletHasInsufficientMemory
# DiskPressure False KubeletHasNoDiskPressure
Fix: add resource limits to Pods so they can't consume unbounded memory, add more nodes, or clean up disk space.
# Clean up evicted Pods (they linger in Failed status)
kubectl delete pods --field-selector=status.phase=Failed
These are the commands you'll use in every troubleshooting session.
# Quick Pod status
kubectl get pods
kubectl get pods -o wide # includes Node and IP
kubectl get pods -A # all namespaces
# Watch for changes in real time
kubectl get pods -w
# Filter by status
kubectl get pods --field-selector=status.phase!=Running
# The Events section is the most valuable part
kubectl describe pod <name>
kubectl describe svc <name>
kubectl describe node <name>
kubectl describe ingress <name>
# Current container logs
kubectl logs <pod-name>
# Logs from a specific container in a multi-container Pod
kubectl logs <pod-name> -c <container-name>
# Logs from the previous (crashed) instance
kubectl logs <pod-name> --previous
# Follow logs in real time
kubectl logs <pod-name> -f
# Logs from all Pods matching a label
kubectl logs -l app=web --all-containers
# Last 50 lines only
kubectl logs <pod-name> --tail=50
# Logs from the last hour
kubectl logs <pod-name> --since=1h
Tip: When a container is in CrashLoopBackOff,
kubectl logs --previousis essential. The current container might not have any logs yet (it just restarted), but the previous instance's crash output is still available.
# Open a shell
kubectl exec -it <pod-name> -- /bin/sh
# Run a single command
kubectl exec <pod-name> -- cat /etc/config/app.yaml
# In a multi-container Pod, specify the container
kubectl exec -it <pod-name> -c <container-name> -- /bin/bash
# Check if the app is listening on the expected port
kubectl exec <pod-name> -- netstat -tlnp
# Proto Local Address State
# tcp 0.0.0.0:8080 LISTEN
# Check DNS resolution from inside the Pod
kubectl exec <pod-name> -- nslookup kubernetes
# Server: 10.96.0.10
# Name: kubernetes.default.svc.cluster.local
# Address: 10.96.0.1
Some images (distroless, scratch-based) have no shell. Use kubectl debug to attach a debug container to the Pod:
# Attach a debug container to a running Pod
kubectl debug -it <pod-name> --image=busybox --target=<container-name>
# Targeting container "app". If you don't see processes, try --target.
# / # ps aux
# PID USER COMMAND
# 1 root /app
# 12 root sh
# Create a copy of the Pod with a debug container
kubectl debug <pod-name> -it --image=busybox --copy-to=debug-pod
# Debug a node directly
kubectl debug node/node-1 -it --image=busybox
# (drops you into a shell on the node, with the host filesystem at /host)
Gotcha:
kubectl debugwith--targetshares the process namespace with the target container, so you can see its processes. Without--target, you get an isolated container. The--targetflag requires the EphemeralContainers feature (enabled by default since Kubernetes 1.25).
# All recent events, sorted by time
kubectl get events --sort-by=.lastTimestamp
# Events for a specific namespace
kubectl get events -n production --sort-by=.lastTimestamp
# Only Warning events
kubectl get events --field-selector type=Warning
# Watch events in real time
kubectl get events -w
# Sample output
kubectl get events --sort-by=.lastTimestamp
# LAST SEEN TYPE REASON OBJECT MESSAGE
# 30s Warning FailedScheduling pod/web-abc12 0/3 nodes: insufficient cpu
# 1m Warning Unhealthy pod/api-xyz99 Readiness probe failed
# 2m Normal Pulled pod/db-lmn01 Successfully pulled image
# 3m Warning FailedMount pod/app-def34 MountVolume.SetUp failed
Requires the Metrics Server to be installed.
kubectl top pods
# NAME CPU(cores) MEMORY(bytes)
# web-7d4b8c6f5-abc12 15m 64Mi
# api-5f8b9c7d6-xyz99 250m 512Mi ← using a lot of CPU and memory
# db-8e7f6a5b4-lmn01 50m 256Mi
kubectl top pods --sort-by=memory # sort by memory usage
kubectl top pods --sort-by=cpu # sort by CPU usage
kubectl top nodes
# NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
# node-1 950m 47% 2100Mi 68%
# node-2 400m 20% 1200Mi 39%
# node-3 1800m 90% 3500Mi 90% ← this node is under pressure
The most common Service problem: traffic is not reaching your Pods.
┌──────────────────────────────────────────────────────────────┐
│ SERVICE DEBUGGING FLOW │
│ │
│ 1. Check endpoints kubectl get endpoints <svc> │
│ │ │
│ ├─── Empty? → Selector doesn't match Pod labels │
│ │ or Pods aren't Ready │
│ │ │
│ └─── Has IPs → Service can reach Pods. │
│ Problem is elsewhere. │
│ │
│ 2. Check selectors kubectl get svc <svc> -o yaml │
│ │ kubectl get pods --show-labels │
│ │ │
│ 3. Check Pod status kubectl get pods │
│ │ Are they Running AND Ready? │
│ │ │
│ 4. Test from inside kubectl run debug --image=busybox │
│ --rm -it -- wget -qO- http://<svc> │
└──────────────────────────────────────────────────────────────┘
kubectl get endpoints web
# NAME ENDPOINTS AGE
# web <none> 5m
# ^^^^^^^^
# No endpoints — no Pods match the Service selector
# What selector does the Service use?
kubectl get svc web -o jsonpath='{.spec.selector}'
# {"app":"web"}
# What labels do the Pods have?
kubectl get pods --show-labels
# NAME READY STATUS LABELS
# web-7d4b8c6f5-abc12 1/1 Running app=frontend,pod-template-hash=7d4b8c6f5
# ^^^^^^^^^^^^
# Label is "app=frontend", not "app=web" — MISMATCH
Step 3: Fix the mismatch
Either fix the Service selector or the Pod labels so they match.
# Spin up a temporary Pod and test
kubectl run debug --image=busybox --rm -it --restart=Never -- wget -qO- http://web
# <html>Welcome to nginx!</html>
# Or test with curl
kubectl run debug --image=curlimages/curl --rm -it --restart=Never -- curl -s http://web
# <html>Welcome to nginx!</html>
If the Service has endpoints but Pods can't reach it by name, DNS is the problem.
# Test DNS resolution from inside a Pod
kubectl run dns-test --image=busybox --rm -it --restart=Never -- nslookup web
# Server: 10.96.0.10
# Address: 10.96.0.10:53
#
# Name: web.default.svc.cluster.local
# Address: 10.96.50.100
# If DNS fails:
kubectl run dns-test --image=busybox --rm -it --restart=Never -- nslookup web
# Server: 10.96.0.10
# Address: 10.96.0.10:53
# ** server can't find web.default.svc.cluster.local: NXDOMAIN
# Check if CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns
# NAME READY STATUS RESTARTS AGE
# coredns-5d78c9869d-abc12 1/1 Running 0 24h
# coredns-5d78c9869d-def34 1/1 Running 0 24h
# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
The three-port confusion causes many Service issues.
# The Service definition
kubectl get svc web -o yaml
# spec:
# ports:
# - port: 80 ← clients connect to Service:80
# targetPort: 3000 ← traffic forwards to Pod:3000
# protocol: TCP
# But the container is listening on 8080, not 3000
kubectl exec deploy/web -- ss -tlnp
# State Local Address:Port
# LISTEN 0.0.0.0:8080 ← app listens on 8080, but targetPort says 3000
Fix: change targetPort to 8080, or reconfigure the app to listen on 3000.
Gotcha: The
containerPortfield in the Pod spec is purely informational. It does NOT control what port the app listens on. If your app listens on 8080, the ServicetargetPortmust be 8080 regardless of whatcontainerPortsays.
kubectl get ingress
# NAME CLASS HOSTS ADDRESS PORTS AGE
# app-ingress nginx myapp.local 80 5m
# ^^^^^^^
# Empty — no address assigned
This usually means the Ingress Controller is not running.
# Check for the Ingress Controller
kubectl get pods -n ingress-nginx
# No resources found in ingress-nginx namespace. ← not installed
# Or the controller is crashing
kubectl get pods -n ingress-nginx
# NAME READY STATUS RESTARTS
# ingress-nginx-controller-7d4b8c6f5-abc12 0/1 CrashLoopBackOff 5
# Check controller logs
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller
# 404 — the path doesn't match any rule
# Check the Ingress rules
kubectl describe ingress app-ingress
# Rules:
# Host Path Backends
# ---- ---- --------
# myapp.local
# /api api:80 (10.1.0.20:8080,10.1.0.21:8080)
# / web:80 (<none>)
# ^^^^^^
# No endpoints for "web" Service — will produce 503
# 502 — backend Service exists but the Pod isn't responding
# Check the backend Pods
kubectl get pods -l app=web
# NAME READY STATUS RESTARTS
# web-7d4b8c6f5-abc12 0/1 CrashLoopBackOff 8
# 503 — backend Service has no endpoints
kubectl get endpoints web
# NAME ENDPOINTS AGE
# web <none> 10m
# Check if the TLS Secret exists and is valid
kubectl get secret myapp-tls
# NAME TYPE DATA AGE
# myapp-tls kubernetes.io/tls 2 30d
kubectl describe secret myapp-tls
# Type: kubernetes.io/tls
# Data
# ====
# tls.crt: 1164 bytes
# tls.key: 1704 bytes
# Verify the cert details
kubectl get secret myapp-tls -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout
# Validity
# Not Before: Jan 1 00:00:00 2025 GMT
# Not After : Jan 1 00:00:00 2026 GMT ← check if expired
# Subject: CN = myapp.local ← check hostname matches
Tip: If you're using cert-manager, check the Certificate resource status:
kubectl get certificateandkubectl describe certificate <name>. Look at theReadycondition and events.
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# node-1 Ready control-plane 30d v1.29.0
# node-2 Ready <none> 30d v1.29.0
# node-3 NotReady <none> 30d v1.29.0
# ^^^^^^^^
# This node has a problem
kubectl describe node node-3
# Conditions:
# Type Status Reason
# ---- ------ ------
# MemoryPressure False KubeletHasSufficientMemory
# DiskPressure False KubeletHasNoDiskPressure
# PIDPressure False KubeletHasSufficientPID
# Ready False KubeletNotReady
# ^^^^^^^^^^^^^^^^
# kubelet is not responding
Common causes:
| Condition | Meaning | Fix |
|---|---|---|
Ready = False |
kubelet is down or cannot communicate with API Server | SSH to node, check systemctl status kubelet |
MemoryPressure = True |
Node is running low on memory | Evict Pods, add memory, set resource limits |
DiskPressure = True |
Node disk is nearly full | Clean up images/logs, expand disk |
PIDPressure = True |
Too many processes on the node | Investigate runaway processes |
# SSH to the node, then:
systemctl status kubelet
# ● kubelet.service - kubelet
# Loaded: loaded
# Active: inactive (dead) ← kubelet is not running
systemctl restart kubelet
journalctl -u kubelet --no-pager -n 50
# Look for errors in the kubelet logs
When a node has problems, safely move workloads off before investigating.
# Cordon — mark node as unschedulable (no new Pods, existing Pods stay)
kubectl cordon node-3
# node/node-3 cordoned
kubectl get nodes
# NAME STATUS ROLES AGE
# node-3 Ready,SchedulingDisabled <none> 30d
# Drain — evict all Pods from the node (respects PodDisruptionBudgets)
kubectl drain node-3 --ignore-daemonsets --delete-emptydir-data
# evicting pod default/web-7d4b8c6f5-abc12
# evicting pod default/api-5f8b9c7d6-xyz99
# node/node-3 drained
# After fixing the node, uncordon it
kubectl uncordon node-3
# node/node-3 uncordoned
Gotcha:
kubectl drainwill fail if there are Pods not managed by a controller (standalone Pods). Use--forceto evict them, but they won't be rescheduled. Always use Deployments/ReplicaSets so Pods get recreated on other nodes after a drain.
kubectl rollout status deployment/web
# Waiting for deployment "web" rollout to finish: 1 out of 3 new replicas
# have been updated...
# (hangs here)
# Check what's happening
kubectl get rs
# NAME DESIRED CURRENT READY AGE
# web-7d4b8c6f5 3 3 3 2h ← old ReplicaSet (healthy)
# web-9a8b7c6d5 1 1 0 5m ← new ReplicaSet (stuck)
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# web-7d4b8c6f5-abc12 1/1 Running 0 2h
# web-7d4b8c6f5-def34 1/1 Running 0 2h
# web-7d4b8c6f5-ghi56 1/1 Running 0 2h
# web-9a8b7c6d5-jkl78 0/1 ImagePullBackOff 0 5m
# ^^^^^^^^^^^^^^^^
# New Pod can't pull image — rollout is stuck
# Check rollout history
kubectl rollout history deployment/web
# REVISION CHANGE-CAUSE
# 1 <none>
# 2 <none>
# Rollback to previous version
kubectl rollout undo deployment/web
# deployment.apps/web rolled back
# Rollback to a specific revision
kubectl rollout undo deployment/web --to-revision=1
# Verify
kubectl rollout status deployment/web
# deployment "web" successfully rolled out
By default, a Deployment waits 600 seconds (10 minutes) for progress. If no new Pods become ready in that time, the rollout is marked as failed.
kubectl describe deployment web
# Conditions:
# Type Status Reason
# Available True MinimumReplicasAvailable
# Progressing False ProgressDeadlineExceeded
# ^^^^^^^^^^^^^^^^^^^^^^^^^^
# Rollout timed out
The Deployment doesn't automatically roll back. You must manually undo it or fix the issue.
┌─────────────────────────────────────────────────────────────┐
│ NETWORK TROUBLESHOOTING LAYERS │
│ │
│ Layer 1: Pod-to-Pod │
│ Can Pods reach each other by IP? │
│ → Check CNI plugin (Calico, Cilium, Flannel) │
│ │
│ Layer 2: Pod-to-Service │
│ Can Pods reach Services by DNS name? │
│ → Check CoreDNS, endpoints, kube-proxy │
│ │
│ Layer 3: External-to-Service │
│ Can outside traffic reach the cluster? │
│ → Check NodePort, LoadBalancer, Ingress │
└─────────────────────────────────────────────────────────────┘
The nicolaka/netshoot image has every networking tool you'll need:
# Launch a debug Pod with full networking tools
kubectl run tmp --image=nicolaka/netshoot --rm -it --restart=Never -- bash
# Inside the debug Pod:
# Test Pod-to-Pod connectivity
ping 10.1.0.15
# PING 10.1.0.15: 56 data bytes
# 64 bytes from 10.1.0.15: seq=0 ttl=62 time=0.5 ms
# Test Service DNS
nslookup web.default.svc.cluster.local
# Server: 10.96.0.10
# Address: 10.96.0.10#53
# Name: web.default.svc.cluster.local
# Address: 10.96.50.100
# Test HTTP to a Service
curl -s http://web.default.svc.cluster.local
# <html>Welcome to nginx!</html>
# Test specific port
nc -zv web 80
# web (10.96.50.100:80) open
# Trace the network path
traceroute 10.1.0.15
# Check DNS resolver config
cat /etc/resolv.conf
# nameserver 10.96.0.10
# search default.svc.cluster.local svc.cluster.local cluster.local
If Pods can't reach each other but the network layer is healthy, a NetworkPolicy might be blocking traffic.
# Check if any NetworkPolicies exist
kubectl get networkpolicies
# NAME POD-SELECTOR AGE
# deny-all <none> 1h
# allow-web app=web 1h
kubectl describe networkpolicy deny-all
# Spec:
# PodSelector: <none> (Coverage: all pods in the namespace)
# Allowing ingress traffic: <none> (Selected pods are isolated for ingress)
# Allowing egress traffic: <none> (Selected pods are isolated for egress)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# This policy blocks ALL traffic to ALL Pods in the namespace
Tip: NetworkPolicies are additive — if ANY policy selects a Pod, only explicitly allowed traffic gets through. A
deny-allpolicy plus anallow-webpolicy means only traffic matchingallow-webreaches the selected Pods. If you can't figure out why traffic is blocked, list ALL policies and check which ones select your Pod.
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# api-5f8b9c7d6-abc12 0/1 Pending 0 2m
kubectl describe pod api-5f8b9c7d6-abc12
# Events:
# Warning FailedCreate 2m replicaset-controller Error creating: pods "api-..."
# is forbidden: exceeded quota: compute-quota, requested: cpu=500m,
# used: cpu=1500m, limited: cpu=2000m
kubectl describe quota compute-quota
# Name: compute-quota
# Namespace: default
# Resource Used Hard
# -------- ---- ----
# cpu 1500m 2000m ← only 500m left, but Pod requests more
# memory 2Gi 4Gi
# pods 5 10
Fix: reduce the Pod's resource requests, increase the quota, or delete unused Pods.
kubectl describe node node-1
# Capacity:
# cpu: 4
# memory: 8Gi
# Allocatable:
# cpu: 3800m ← usable by Pods (reserves some for system)
# memory: 7600Mi
# Allocated resources:
# Resource Requests Limits
# -------- -------- ------
# cpu 3200m (84%) 6000m (157%)
# memory 5120Mi (67%) 8192Mi (107%)
# ^^^^^^^^^^^
# 84% of CPU is already requested — not much room
Gotcha: Kubernetes schedules based on requests, not limits. If all node CPU requests add up to 3800m, no more Pods can schedule even if actual usage is only 20%. Set requests to what the app actually needs, not worst-case limits.
When everything looks right but something isn't working, check this list:
The number one cause of "Service not reaching Pods" and "Deployment not managing Pods."
# Service selector
kubectl get svc web -o jsonpath='{.spec.selector}' | jq .
# { "app": "web" }
# Deployment Pod template labels
kubectl get deploy web -o jsonpath='{.spec.template.metadata.labels}' | jq .
# { "app": "web", "version": "v2" }
# Actual Pod labels
kubectl get pods --show-labels
# NAME LABELS
# web-7d4b8c6f5-abc12 app=web,version=v2,pod-template-hash=7d4b8c6f5
All three must agree. The Service selector must be a subset of the Pod labels.
# Service
spec:
ports:
- port: 80 # ← what clients connect to
targetPort: 8080 # ← must match what the container listens on
---
# Pod
spec:
containers:
- ports:
- containerPort: 8080 # ← informational, but should match reality
If targetPort doesn't match the actual port the app listens on, traffic will be sent to a port nothing is listening on.
# This shows Pods in the "default" namespace
kubectl get pods
# No resources found in default namespace.
# Your Pods are in another namespace
kubectl get pods -n production
# NAME READY STATUS RESTARTS AGE
# web-7d4b8c6f5-abc12 1/1 Running 0 2h
Tip: Set your default namespace to avoid this:
kubectl config set-context --current --namespace=production. Or usekubensfor quick namespace switching.
ConfigMaps and Secrets are namespaced. A Pod in production cannot reference a ConfigMap in default.
# Pod is in "production" namespace
kubectl get pods -n production
# api-5f8b9c7d6-abc12 0/1 CreateContainerConfigError
kubectl describe pod -n production api-5f8b9c7d6-abc12
# Warning Failed configmap "app-config" not found
# The ConfigMap exists, but in the wrong namespace
kubectl get configmap app-config -n default
# NAME DATA AGE
# app-config 3 1h
# Fix: create it in the right namespace
kubectl get configmap app-config -n default -o yaml | \
sed 's/namespace: default/namespace: production/' | \
kubectl apply -f -
kubectl get pvc
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
# data Bound pv-001 10Gi RWO standard
# RWO = ReadWriteOnce — only one node can mount it
# If you schedule a second Pod on a DIFFERENT node that uses this PVC:
kubectl describe pod db-replica-xyz
# Events:
# Warning FailedAttachVolume Multi-Attach error for volume "pv-001":
# volume is already exclusively attached to one node
Fix: use ReadWriteMany (RWX) if multiple Pods on different nodes need the volume, or ensure all Pods using RWO volumes schedule on the same node.
The best way to learn troubleshooting is to intentionally break things and diagnose them.
kubectl create deployment broken-image --image=nginx:doesnotexist
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# broken-image-6b8c9d7e5-abc12 0/1 ImagePullBackOff 0 1m
kubectl describe pod broken-image-6b8c9d7e5-abc12
# Events:
# Warning Failed manifest for nginx:doesnotexist not found
# Fix
kubectl set image deployment/broken-image nginx=nginx:1.25
# Verify
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# broken-image-7f9a8b6c4-def34 1/1 Running 0 30s
# Clean up
kubectl delete deployment broken-image
kubectl create deployment myapp --image=nginx:1.25
kubectl expose deployment myapp --port=80 --target-port=80
# Now break the selector
kubectl patch svc myapp -p '{"spec":{"selector":{"app":"wrong-name"}}}'
kubectl get endpoints myapp
# NAME ENDPOINTS AGE
# myapp <none> 2m ← no endpoints
kubectl get svc myapp -o jsonpath='{.spec.selector}'
# {"app":"wrong-name"}
kubectl get pods --show-labels
# NAME LABELS
# myapp-7d4b8c6f5-abc12 app=myapp,... ← label is "app=myapp"
# Fix — selector should be "app=myapp"
kubectl patch svc myapp -p '{"spec":{"selector":{"app":"myapp"}}}'
kubectl get endpoints myapp
# NAME ENDPOINTS AGE
# myapp 10.1.0.15:80 3m ← now has an endpoint
# Clean up
kubectl delete deployment myapp
kubectl delete svc myapp
kubectl run greedy --image=nginx:1.25 --restart=Never \
--overrides='{
"spec": {
"containers": [{
"name": "greedy",
"image": "nginx:1.25",
"resources": {
"requests": {
"cpu": "100",
"memory": "500Gi"
}
}
}]
}
}'
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# greedy 0/1 Pending 0 2m
kubectl describe pod greedy
# Events:
# Warning FailedScheduling 0/3 nodes are available:
# 3 Insufficient cpu, 3 Insufficient memory.
# Fix — delete and recreate with reasonable requests
kubectl delete pod greedy
kubectl run greedy --image=nginx:1.25 --restart=Never
# Verify
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# greedy 1/1 Running 0 15s
# Clean up
kubectl delete pod greedy
Progress through each section in order, or jump to where you need practice.
Practice individual concepts you just learned.
Combine concepts and learn patterns. Each challenge has multiple variants at different difficulties.
--previous), look for OOMKilled (exit code 137), verify command/args and environmentsvc.spec.selector with Pod labelsnslookup from inside a Pod, check CoreDNS Pod statussystemctl status kubelet and journalctl -u kubelet on the nodekubectl rollout undo to revertkubectl describe quotakubectl debug for distroless images that have no shell