kubeadm, managed K8s (EKS, GKE, AKS), version upgrades, etcd backup/restore, node maintenance, and cluster federation.
Running a Kubernetes cluster means more than deploying workloads. You need to manage nodes, control where Pods land, perform upgrades without downtime, and back up cluster state. This module covers the operational side of Kubernetes — the skills that separate someone who uses K8s from someone who runs it.
Cluster Management
┌─────────────────────────────────────────────────────────────────┐
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Node Mgmt │ │ Scheduling │ │ Maintenance │ │
│ │ │ │ Controls │ │ │ │
│ │ labels │ │ taints │ │ cordon │ │
│ │ conditions │ │ tolerations │ │ drain │ │
│ │ roles │ │ affinity │ │ uncordon │ │
│ │ capacity │ │ topology │ │ upgrades │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ etcd Backup │ │ Multi-Cluster│ │
│ │ & Restore │ │ Operations │ │
│ │ │ │ │ │
│ │ snapshot │ │ contexts │ │
│ │ restore │ │ federation │ │
│ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Nodes are the machines (physical or virtual) that run your workloads. Managing them is a core cluster admin task.
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# control-01 Ready control-plane 45d v1.30.2
# worker-01 Ready <none> 45d v1.30.2
# worker-02 Ready <none> 45d v1.30.2
# worker-03 Ready <none> 44d v1.30.2
# Wider output shows OS, kernel, container runtime
kubectl get nodes -o wide
# NAME STATUS ROLES AGE VERSION INTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
# control-01 Ready control-plane 45d v1.30.2 10.0.1.10 Ubuntu 22.04.3 LTS 5.15.0-91 containerd://1.7.11
# worker-01 Ready <none> 45d v1.30.2 10.0.1.11 Ubuntu 22.04.3 LTS 5.15.0-91 containerd://1.7.11
# worker-02 Ready <none> 45d v1.30.2 10.0.1.12 Ubuntu 22.04.3 LTS 5.15.0-91 containerd://1.7.11
kubectl describe node gives you everything — capacity, allocatable resources, conditions, taints, labels, and running Pods:
kubectl describe node worker-01
# Name: worker-01
# Roles: <none>
# Labels: kubernetes.io/arch=amd64
# kubernetes.io/hostname=worker-01
# kubernetes.io/os=linux
# node.kubernetes.io/instance-type=m5.xlarge
# Annotations: ...
# Taints: <none>
# Conditions:
# Type Status Reason Message
# ---- ------ ------ -------
# MemoryPressure False KubeletHasSufficientMemory kubelet has sufficient memory
# DiskPressure False KubeletHasNoDiskPressure kubelet has no disk pressure
# PIDPressure False KubeletHasSufficientPID kubelet has sufficient PID
# Ready True KubeletReady kubelet is posting ready status
# Capacity:
# cpu: 4
# memory: 16384Mi
# pods: 110
# Allocatable:
# cpu: 3800m
# memory: 15168Mi
# pods: 110
# Non-terminated Pods: (12 in total)
# Namespace Name CPU Requests Memory Requests
# --------- ---- ------------ ---------------
# default web-abc123 100m 128Mi
# default api-def456 250m 256Mi
# ...
Capacity is the total resources on the node. Allocatable is what's available for Pods after reserving resources for the OS and kubelet. The difference is the system reserved overhead.
Every node reports five conditions. These drive scheduling decisions and alerting:
| Condition | Healthy Value | What It Means |
|---|---|---|
Ready |
True |
Node is healthy and can accept Pods |
MemoryPressure |
False |
Node is not running low on memory |
DiskPressure |
False |
Node is not running low on disk space |
PIDPressure |
False |
Node is not running too many processes |
NetworkUnavailable |
False |
Node network is correctly configured |
If Ready is False or Unknown for the pod-eviction-timeout (default 5 minutes), the node controller starts evicting Pods to healthy nodes.
Gotcha: A node stuck in
NotReadydoes not immediately kill Pods. Kubernetes waits for the eviction timeout. During that window, Pods on the failing node are still reported asRunningbut are unreachable. StatefulSet Pods are especially tricky — they won't be rescheduled until the old Pod is confirmed terminated (or you force-delete it).
Labels let you classify nodes by hardware, location, or role:
# Add a label
kubectl label node worker-01 disktype=ssd
# node/worker-01 labeled
# Add multiple labels
kubectl label node worker-01 gpu=nvidia-a100 zone=us-east-1a
# node/worker-01 labeled
# View labels on all nodes
kubectl get nodes --show-labels
# NAME STATUS ROLES AGE VERSION LABELS
# worker-01 Ready <none> 45d v1.30.2 disktype=ssd,gpu=nvidia-a100,zone=us-east-1a,...
# Remove a label
kubectl label node worker-01 gpu-
# node/worker-01 unlabeled
# Overwrite an existing label
kubectl label node worker-01 disktype=nvme --overwrite
# node/worker-01 labeled
Roles are just labels with a special prefix. The control plane node has:
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# control-01 Ready control-plane 45d v1.30.2
# The role comes from this label:
kubectl get node control-01 --show-labels | tr ',' '\n' | grep role
# node-role.kubernetes.io/control-plane=
# You can assign custom roles to workers
kubectl label node worker-01 node-role.kubernetes.io/gpu-worker=
# node/worker-01 labeled
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# control-01 Ready control-plane 45d v1.30.2
# worker-01 Ready gpu-worker 45d v1.30.2
# worker-02 Ready <none> 45d v1.30.2
Tip: The role label value is irrelevant — Kubernetes only checks if the label key exists.
node-role.kubernetes.io/worker=""andnode-role.kubernetes.io/worker="true"both display asworkerin the ROLES column.
Taints and tolerations work together to control which Pods can run on which nodes. A taint on a node repels Pods. A toleration on a Pod lets it ignore a specific taint.
Taints = "Keep out" signs on nodes
Tolerations = "I'm allowed in" badges on Pods
Node with taint Pod without toleration
┌──────────────┐ ┌──────┐
│ ⛔ taint: │ ──── X ──── │ Pod │ Rejected
│ gpu=true: │ └──────┘
│ NoSchedule │
└──────────────┘ Pod with toleration
┌──────┐
──── ✓ ──── │ Pod │ Allowed
│ tol: │
│ gpu │
└──────┘
# Taint a node
kubectl taint node worker-01 dedicated=gpu:NoSchedule
# node/worker-01 tainted
# View taints
kubectl describe node worker-01 | grep Taints
# Taints: dedicated=gpu:NoSchedule
# Remove a taint (add minus sign at the end)
kubectl taint node worker-01 dedicated=gpu:NoSchedule-
# node/worker-01 untainted
The taint format is key=value:effect. The three effects are:
| Effect | Behavior |
|---|---|
NoSchedule |
New Pods without a matching toleration are not scheduled. Existing Pods stay. |
PreferNoSchedule |
Scheduler tries to avoid the node, but will use it if no other option exists. Soft rule. |
NoExecute |
New Pods are rejected and existing Pods without a matching toleration are evicted. |
To schedule a Pod on a tainted node, add a matching toleration:
apiVersion: v1
kind: Pod
metadata:
name: gpu-training
spec:
containers:
- name: trainer
image: nvidia/cuda:12.0-base
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
nodeSelector:
gpu: nvidia-a100 # also select the right node
The toleration operators:
| Operator | Meaning |
|---|---|
Equal |
Key, value, and effect must all match |
Exists |
Key and effect must match (value is ignored — use when taint has no value) |
# Match any taint with key "dedicated" regardless of value
tolerations:
- key: "dedicated"
operator: "Exists"
effect: "NoSchedule"
# Match ALL taints (use with caution — this Pod goes anywhere)
tolerations:
- operator: "Exists"
With NoExecute, you can control how long a Pod stays before eviction:
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300 # stay for 5 minutes, then evict
If tolerationSeconds is omitted, the Pod tolerates the taint forever and is never evicted.
Kubernetes automatically applies taints to nodes in certain conditions:
| Taint | When Applied |
|---|---|
node-role.kubernetes.io/control-plane:NoSchedule |
Control plane nodes — prevents user workloads on master |
node.kubernetes.io/not-ready:NoExecute |
Node condition Ready is False |
node.kubernetes.io/unreachable:NoExecute |
Node condition Ready is Unknown (lost contact) |
node.kubernetes.io/disk-pressure:NoSchedule |
Node has disk pressure |
node.kubernetes.io/memory-pressure:NoSchedule |
Node has memory pressure |
node.kubernetes.io/pid-pressure:NoSchedule |
Node has PID pressure |
node.kubernetes.io/unschedulable:NoSchedule |
Node is cordoned |
Tip: Kubernetes automatically adds tolerations for
not-readyandunreachableto every Pod with a defaulttolerationSecondsof 300 (5 minutes). That is why Pods are not immediately evicted when a node goes down — they wait 5 minutes first.
team=data-science:NoScheduleNoExecute taints evict Pods during node maintenance (drain does this automatically)Node affinity is a more expressive alternative to nodeSelector. It lets you specify rules about which nodes a Pod can or should be scheduled on.
apiVersion: v1
kind: Pod
metadata:
name: ssd-app
spec:
containers:
- name: app
image: myapp:latest
nodeSelector:
disktype: ssd # must land on a node with this label
nodeSelector is a hard requirement — if no node matches, the Pod stays Pending. It only supports equality matching.
Node affinity supports two rule types:
| Rule | Behavior |
|---|---|
requiredDuringSchedulingIgnoredDuringExecution |
Hard requirement — Pod is not scheduled unless a node matches. Like nodeSelector but more flexible. |
preferredDuringSchedulingIgnoredDuringExecution |
Soft preference — scheduler tries to match but schedules elsewhere if needed. |
The IgnoredDuringExecution part means: if the node labels change after the Pod is running, the Pod stays put. It is not evicted.
apiVersion: v1
kind: Pod
metadata:
name: zone-aware-app
spec:
containers:
- name: app
image: myapp:latest
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
- us-east-1b
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
- weight: 20
preference:
matchExpressions:
- key: node-type
operator: In
values:
- high-memory
This Pod:
us-east-1a or us-east-1b (hard requirement)disktype=ssd (weight 80) or node-type=high-memory (weight 20)The weight field (1-100) lets the scheduler rank preferred nodes. Higher weight = stronger preference.
Supported operators for matchExpressions:
| Operator | Meaning |
|---|---|
In |
Label value is in the provided list |
NotIn |
Label value is not in the list |
Exists |
Label key exists (value ignored) |
DoesNotExist |
Label key does not exist |
Gt |
Label value is greater than (numeric strings only) |
Lt |
Label value is less than (numeric strings only) |
| Feature | nodeSelector | Node Affinity |
|---|---|---|
| Matching | Exact label match only | In, NotIn, Exists, DoesNotExist, Gt, Lt |
| Hard/soft | Hard only | Both required and preferred |
| Multiple rules | AND only | OR within a term, AND across terms |
| Complexity | Simple | More verbose but more powerful |
When to use which: Use nodeSelector for simple cases (e.g., "run on SSD nodes"). Use node affinity when you need OR logic, soft preferences, or operators beyond equality.
While node affinity controls Pod-to-node relationships, pod affinity controls Pod-to-Pod relationships — whether Pods should be co-located or spread apart.
Use case: place a web frontend on the same node as its Redis cache for low-latency access.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-frontend
spec:
replicas: 3
selector:
matchLabels:
app: web-frontend
template:
metadata:
labels:
app: web-frontend
spec:
containers:
- name: web
image: myapp/frontend:latest
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- redis-cache
topologyKey: kubernetes.io/hostname # same node
This says: "Schedule this Pod on a node that already runs a Pod with label app=redis-cache."
Use case: spread replicas across different nodes so a single node failure doesn't take down all instances.
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 3
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
containers:
- name: api
image: myapp/api:latest
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api-server
topologyKey: kubernetes.io/hostname # different nodes
This says: "Don't put two api-server Pods on the same node."
The topologyKey determines the scope of "co-located" or "spread":
| topologyKey | Scope | Use Case |
|---|---|---|
kubernetes.io/hostname |
Per node | Spread replicas across nodes |
topology.kubernetes.io/zone |
Per availability zone | Spread across AZs for HA |
topology.kubernetes.io/region |
Per region | Multi-region workloads |
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api-server
topologyKey: topology.kubernetes.io/zone
Gotcha: Hard pod anti-affinity (
required) withtopologyKey: kubernetes.io/hostnamelimits your replicas to the number of nodes. If you have 3 nodes but want 5 replicas, 2 Pods will stayPendingforever. Usepreferredanti-affinity if you want best-effort spreading without blocking.
Topology spread constraints provide more fine-grained control over how Pods are distributed than pod anti-affinity. They let you specify a maximum allowed imbalance (maxSkew) across topology domains.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 6
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: web
image: nginx:1.25
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: web
This distributes 6 replicas so:
Zone A (2 nodes) Zone B (2 nodes) Zone C (2 nodes)
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│ web │ │ web │ │ web │ │ web │ │ web │ │ web │
│ │ │ │ │ │ │ │ │ │ │ │
└──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘
node-1 node-2 node-3 node-4 node-5 node-6
Zone A: 2 Pods Zone B: 2 Pods Zone C: 2 Pods
maxSkew = 1 → satisfied (max difference between any two zones is 0)
Key fields:
| Field | Description |
|---|---|
maxSkew |
Maximum allowed difference in Pod count between any two topology domains. 1 means evenly distributed. |
topologyKey |
The node label that defines domains (zone, hostname, region) |
whenUnsatisfiable |
DoNotSchedule (hard) or ScheduleAnyway (soft, with skew minimization) |
labelSelector |
Which Pods to count when computing the spread |
Tip: Topology spread constraints are the recommended way to distribute Pods evenly. They are more predictable than pod anti-affinity for workloads where you want even distribution rather than strict "no two on the same node" rules.
When you need to perform maintenance on a node — OS updates, hardware replacement, Kubernetes upgrades — you must safely move workloads off it first.
kubectl cordon worker-02
# node/worker-02 cordoned
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# control-01 Ready control-plane 45d v1.30.2
# worker-01 Ready <none> 45d v1.30.2
# worker-02 Ready,SchedulingDisabled <none> 45d v1.30.2
# worker-03 Ready <none> 44d v1.30.2
A cordoned node shows SchedulingDisabled. Existing Pods continue running — no evictions happen. New Pods will not be scheduled there.
kubectl uncordon worker-02
# node/worker-02 uncordoned
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# worker-02 Ready <none> 45d v1.30.2
Draining cordons the node and then evicts all Pods:
kubectl drain worker-02 --ignore-daemonsets --delete-emptydir-data
# node/worker-02 cordoned
# evicting pod default/web-abc123
# evicting pod default/api-def456
# evicting pod monitoring/prometheus-node-exporter-xyz
# pod/web-abc123 evicted
# pod/api-def456 evicted
# node/worker-02 drained
Common drain flags:
| Flag | Purpose |
|---|---|
--ignore-daemonsets |
Skip DaemonSet Pods (they're supposed to run on every node) |
--delete-emptydir-data |
Delete Pods using emptyDir volumes (data will be lost) |
--force |
Force-delete Pods not managed by a controller (standalone Pods) |
--grace-period=30 |
Override the Pod's termination grace period |
--timeout=300s |
Abort drain if it takes longer than this |
--pod-selector='app=web' |
Only evict Pods matching the selector |
Gotcha:
kubectl drainwithout--ignore-daemonsetswill fail if DaemonSet Pods exist on the node. Without--delete-emptydir-data, it fails if any Pods use emptyDir. Without--force, it fails if standalone Pods (not managed by a controller) exist. In practice, you almost always need--ignore-daemonsets --delete-emptydir-data.
Drain respects PodDisruptionBudgets (PDBs). If evicting a Pod would violate a PDB, the drain blocks:
kubectl drain worker-02 --ignore-daemonsets --delete-emptydir-data
# evicting pod default/web-abc123
# error when evicting pods/"web-abc123" -n "default" (will retry after 5s):
# Cannot evict pod as it would violate the pod's disruption budget.
# evicting pod default/web-abc123
# pod/web-abc123 evicted
# node/worker-02 drained
The drain retries until the PDB is satisfied (other replicas are ready) or the timeout expires.
Step 1: Cordon Step 2: Drain Step 3: Maintain Step 4: Uncordon
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ worker-02 │ │ worker-02 │ │ worker-02 │ │ worker-02 │
│ Scheduling │ ──▶ │ No Pods │ ──▶ │ OS update │ ──▶ │ Ready │
│ Disabled │ │ (evicted) │ │ K8s upgrade │ │ Schedulable │
│ Pods running │ │ Unschedulable│ │ Reboot │ │ New Pods OK │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
# Full maintenance workflow
kubectl cordon worker-02
kubectl drain worker-02 --ignore-daemonsets --delete-emptydir-data
# ... perform maintenance (upgrade OS, kubelet, etc.) ...
kubectl uncordon worker-02
Kubernetes releases a new minor version roughly every 4 months. Staying current is important for security patches, bug fixes, and new features.
Version Skew Policy:
API Server: v1.30 ← must be the highest version component
│
├── kubelet: v1.30 ✓ (same version)
├── kubelet: v1.29 ✓ (one behind)
├── kubelet: v1.28 ✓ (two behind)
└── kubelet: v1.27 ✗ (three behind — not supported)
Managed services handle control plane upgrades for you:
# GKE — upgrade control plane
gcloud container clusters upgrade my-cluster \
--master --cluster-version=1.30.2
# EKS — upgrade control plane
aws eks update-cluster-version \
--name my-cluster --kubernetes-version 1.30
# AKS — upgrade control plane
az aks upgrade --resource-group mygroup \
--name my-cluster --kubernetes-version 1.30.2
For node pools, managed services can perform rolling upgrades automatically (new nodes come up, old nodes drain and terminate).
For clusters built with kubeadm, the upgrade is manual. Here is the workflow:
# Check available versions
sudo kubeadm upgrade plan
# Components that must be upgraded manually after you have upgraded the control plane:
# COMPONENT CURRENT TARGET
# kubelet v1.29.5 v1.30.2
#
# Upgrade to the latest stable version:
# COMPONENT CURRENT TARGET
# kube-apiserver v1.29.5 v1.30.2
# kube-controller-manager v1.29.5 v1.30.2
# kube-scheduler v1.29.5 v1.30.2
# kube-proxy v1.29.5 v1.30.2
# CoreDNS v1.11.1 v1.11.3
# etcd 3.5.10 3.5.12
# Upgrade kubeadm first
sudo apt-get update
sudo apt-get install -y kubeadm=1.30.2-1.1
# Apply the upgrade
sudo kubeadm upgrade apply v1.30.2
# [upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.30.2". Enjoy!
# Upgrade kubelet and kubectl on the control plane node
sudo apt-get install -y kubelet=1.30.2-1.1 kubectl=1.30.2-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet
# On the admin machine: drain the worker
kubectl drain worker-01 --ignore-daemonsets --delete-emptydir-data
# On the worker node: upgrade kubeadm, kubelet
sudo apt-get update
sudo apt-get install -y kubeadm=1.30.2-1.1
sudo kubeadm upgrade node
sudo apt-get install -y kubelet=1.30.2-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet
# On the admin machine: uncordon the worker
kubectl uncordon worker-01
Repeat for each worker node. Verify after each node:
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# control-01 Ready control-plane 45d v1.30.2
# worker-01 Ready <none> 45d v1.30.2 ← upgraded
# worker-02 Ready <none> 45d v1.29.5 ← still on old version
# worker-03 Ready <none> 44d v1.29.5
Tip: Always take an etcd snapshot before upgrading the control plane. If the upgrade fails, you can restore the cluster from the backup. Also, read the changelog for each version — some releases have breaking changes or deprecated APIs that require manifest updates.
etcd is the brain of the cluster. It stores every object — Pods, Services, Secrets, ConfigMaps, RBAC rules — everything. If etcd is lost without a backup, the entire cluster state is gone.
┌──────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ API Server ────▶ etcd │
│ ┌──────────────────┐ │
│ Controller ────▶ │ All cluster data │ │
│ Manager │ - Pods │ │
│ │ - Services │ │
│ Scheduler ────▶ │ - Secrets │ │
│ │ - ConfigMaps │ │
│ │ - RBAC │ │
│ │ - ...everything │ │
│ └──────────────────┘ │
└──────────────────────────────────────────┘
# Set environment variables for etcd connection
export ETCDCTL_API=3
# Take a snapshot
etcdctl snapshot save /opt/backup/etcd-snapshot-$(date +%Y%m%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Snapshot saved at /opt/backup/etcd-snapshot-20260131.db
# Verify the snapshot
etcdctl snapshot status /opt/backup/etcd-snapshot-20260131.db --write-out=table
# +----------+----------+------------+------------+
# | HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
# +----------+----------+------------+------------+
# | 3c9768e5 | 412850 | 1243 | 5.1 MB |
# +----------+----------+------------+------------+
The certificates are required because etcd uses mutual TLS. On kubeadm clusters, the certs are in /etc/kubernetes/pki/etcd/.
Restoring replaces all cluster state with the snapshot contents:
# Stop the API server (on kubeadm, move the static pod manifest)
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
# Restore the snapshot to a new data directory
etcdctl snapshot restore /opt/backup/etcd-snapshot-20260131.db \
--data-dir=/var/lib/etcd-restored \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 2026-01-31T10:00:00Z info restored snapshot
# 2026-01-31T10:00:00Z info added member
# Update the etcd static pod to use the restored data directory
# Edit /etc/kubernetes/manifests/etcd.yaml:
# - --data-dir=/var/lib/etcd-restored
# Also update the hostPath volume:
# path: /var/lib/etcd-restored
# Restore the API server manifest
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
# Wait for the control plane to restart
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# control-01 Ready control-plane 45d v1.30.2
| When | Why |
|---|---|
| Before cluster upgrades | Rollback point if upgrade fails |
| Before major changes | Rollback point for risky operations |
| Scheduled (daily/hourly) | Disaster recovery |
| Before etcd maintenance | Safety net |
Gotcha: An etcd backup captures cluster state at a point in time. Anything created after the snapshot is lost on restore. In production, automate regular snapshots with a CronJob or external tool like Velero. Also, store backups off-cluster — if the cluster is destroyed, you need the backup to be somewhere else.
As organizations grow, a single cluster often is not enough. Multiple clusters provide stronger isolation, separate failure domains, and meet compliance requirements.
| Scenario | Namespaces | Multiple Clusters |
|---|---|---|
| Team isolation (same trust level) | Yes | Overkill |
| Dev/staging/prod environments | Maybe | Recommended |
| Regulatory compliance (PCI, HIPAA) | Usually insufficient | Required |
| Multi-region HA | No | Yes |
| Blast radius reduction | Partial | Full |
| Noisy neighbor protection | ResourceQuotas help | Complete isolation |
kubectl uses contexts to connect to different clusters. Each context combines a cluster, user, and namespace:
# View all contexts
kubectl config get-contexts
# CURRENT NAME CLUSTER AUTHINFO NAMESPACE
# * dev-cluster dev-cluster dev-admin default
# staging-cluster staging-cluster staging-admin default
# prod-cluster prod-cluster prod-admin default
# Switch to a different cluster
kubectl config use-context prod-cluster
# Switched to context "prod-cluster".
# Verify
kubectl config current-context
# prod-cluster
# Run a command against a specific context without switching
kubectl --context=staging-cluster get nodes
# kubectx — fast context switching
kubectx prod-cluster
# Switched to context "prod-cluster".
kubectx - # switch back to previous context
# kubens — fast namespace switching
kubens kube-system
# Context "prod-cluster" modified.
# Active namespace is "kube-system".
Kubernetes federation allows you to manage multiple clusters from a single control plane. While the original Federation v1 was deprecated, the KubeFed project and newer tools like Admiralty and Liqo provide multi-cluster capabilities:
Tip: For most teams, federation is overkill. Start with separate clusters and a GitOps tool (ArgoCD, Flux) that deploys to multiple clusters from a single Git repo. This gives you multi-cluster management without the complexity of federation.
Let's combine everything in a practical scenario. We will taint a node, deploy Pods with tolerations, drain a node for maintenance, and use node affinity to control scheduling.
# Label worker-01 as a GPU node
kubectl label node worker-01 hardware=gpu
# node/worker-01 labeled
# Taint it so only GPU workloads run there
kubectl taint node worker-01 hardware=gpu:NoSchedule
# node/worker-01 tainted
# Verify
kubectl describe node worker-01 | grep -E "Taints|Labels" | head -5
# Labels: hardware=gpu,kubernetes.io/hostname=worker-01,...
# Taints: hardware=gpu:NoSchedule
kubectl run regular-app --image=nginx
# pod/regular-app created
kubectl get pod regular-app -o wide
# NAME READY STATUS NODE
# regular-app 1/1 Running worker-02 ← not on worker-01
Step 3: Deploy a GPU Pod with a toleration
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
spec:
containers:
- name: ml-training
image: nvidia/cuda:12.0-base
command: ["sleep", "3600"]
tolerations:
- key: "hardware"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: hardware
operator: In
values:
- gpu
kubectl apply -f gpu-workload.yaml
# pod/gpu-workload created
kubectl get pod gpu-workload -o wide
# NAME READY STATUS NODE
# gpu-workload 1/1 Running worker-01 ← landed on the GPU node
# Drain worker-02 (where regular-app is running)
kubectl drain worker-02 --ignore-daemonsets --delete-emptydir-data
# node/worker-02 cordoned
# evicting pod default/regular-app
# pod/regular-app evicted
# node/worker-02 drained
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# control-01 Ready control-plane 45d v1.30.2
# worker-01 Ready <none> 45d v1.30.2
# worker-02 Ready,SchedulingDisabled <none> 45d v1.30.2
# worker-03 Ready <none> 44d v1.30.2
# regular-app is gone — it was a standalone Pod, not managed by a controller
kubectl get pods
# NAME READY STATUS NODE
# gpu-workload 1/1 Running worker-01
Gotcha: Standalone Pods (not managed by a Deployment, StatefulSet, or DaemonSet) are permanently lost during a drain. They are evicted but nothing recreates them. Always use a controller for production workloads.
kubectl uncordon worker-02
# node/worker-02 uncordoned
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# worker-02 Ready <none> 45d v1.30.2 ← back to normal
kubectl delete pod gpu-workload
kubectl taint node worker-01 hardware=gpu:NoSchedule-
kubectl label node worker-01 hardware-
Progress through each section in order, or jump to where you need practice.
Practice individual concepts you just learned.
Combine concepts and learn patterns. Each challenge has multiple variants at different difficulties.
kubectl get nodes and kubectl describe node. Key info: capacity vs allocatable resources, conditions (Ready, MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable), labels, and taints.disktype=ssd), location (zone=us-east-1a), or role (node-role.kubernetes.io/gpu-worker). Use them for scheduling control.NoSchedule (block new Pods), PreferNoSchedule (soft avoid), NoExecute (block new + evict existing). Applied with kubectl taint node.operator: Exists to match any value.control-plane:NoSchedule, not-ready:NoExecute, unreachable:NoExecute. Default toleration for not-ready/unreachable is 300 seconds.nodeSelector: required (hard) and preferred (soft, with weights). Supports operators: In, NotIn, Exists, DoesNotExist, Gt, Lt.topologyKey controls scope: per-node, per-zone, or per-region.maxSkew, topologyKey, and whenUnsatisfiable. More predictable than pod anti-affinity for even distribution.--ignore-daemonsets (skip DaemonSet Pods), --delete-emptydir-data (accept emptyDir data loss), --force (remove standalone Pods).kubeadm upgrade plan then kubeadm upgrade apply on control plane; drain, upgrade kubelet, uncordon on each worker.etcdctl snapshot save captures all cluster state. Restore with etcdctl snapshot restore. Always back up before upgrades and on a regular schedule.kubectl config use-context) to switch clusters. Tools like kubectx and GitOps (ArgoCD, Flux) simplify multi-cluster operations.