// MODULE 21

Cluster Management

kubeadm, managed K8s (EKS, GKE, AKS), version upgrades, etcd backup/restore, node maintenance, and cluster federation.

Cluster Management

Running a Kubernetes cluster means more than deploying workloads. You need to manage nodes, control where Pods land, perform upgrades without downtime, and back up cluster state. This module covers the operational side of Kubernetes — the skills that separate someone who uses K8s from someone who runs it.

                            Cluster Management
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐        │
│  │ Node Mgmt    │   │ Scheduling   │   │  Maintenance │        │
│  │              │   │  Controls    │   │              │        │
│  │ labels       │   │ taints       │   │ cordon       │        │
│  │ conditions   │   │ tolerations  │   │ drain        │        │
│  │ roles        │   │ affinity     │   │ uncordon     │        │
│  │ capacity     │   │ topology     │   │ upgrades     │        │
│  └──────────────┘   └──────────────┘   └──────────────┘        │
│                                                                 │
│  ┌──────────────┐   ┌──────────────┐                           │
│  │ etcd Backup  │   │ Multi-Cluster│                           │
│  │ & Restore    │   │ Operations   │                           │
│  │              │   │              │                           │
│  │ snapshot     │   │ contexts     │                           │
│  │ restore      │   │ federation   │                           │
│  └──────────────┘   └──────────────┘                           │
└─────────────────────────────────────────────────────────────────┘

Node Management

Nodes are the machines (physical or virtual) that run your workloads. Managing them is a core cluster admin task.

Listing Nodes

kubectl get nodes
# NAME           STATUS   ROLES           AGE   VERSION
# control-01     Ready    control-plane   45d   v1.30.2
# worker-01      Ready    <none>          45d   v1.30.2
# worker-02      Ready    <none>          45d   v1.30.2
# worker-03      Ready    <none>          44d   v1.30.2

# Wider output shows OS, kernel, container runtime
kubectl get nodes -o wide
# NAME         STATUS   ROLES           AGE   VERSION   INTERNAL-IP   OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
# control-01   Ready    control-plane   45d   v1.30.2   10.0.1.10     Ubuntu 22.04.3 LTS   5.15.0-91        containerd://1.7.11
# worker-01    Ready    <none>          45d   v1.30.2   10.0.1.11     Ubuntu 22.04.3 LTS   5.15.0-91        containerd://1.7.11
# worker-02    Ready    <none>          45d   v1.30.2   10.0.1.12     Ubuntu 22.04.3 LTS   5.15.0-91        containerd://1.7.11

Describing a Node

kubectl describe node gives you everything — capacity, allocatable resources, conditions, taints, labels, and running Pods:

kubectl describe node worker-01
# Name:               worker-01
# Roles:              <none>
# Labels:             kubernetes.io/arch=amd64
#                     kubernetes.io/hostname=worker-01
#                     kubernetes.io/os=linux
#                     node.kubernetes.io/instance-type=m5.xlarge
# Annotations:        ...
# Taints:             <none>
# Conditions:
#   Type                 Status  Reason                       Message
#   ----                 ------  ------                       -------
#   MemoryPressure       False   KubeletHasSufficientMemory   kubelet has sufficient memory
#   DiskPressure         False   KubeletHasNoDiskPressure     kubelet has no disk pressure
#   PIDPressure          False   KubeletHasSufficientPID      kubelet has sufficient PID
#   Ready                True    KubeletReady                 kubelet is posting ready status
# Capacity:
#   cpu:                4
#   memory:             16384Mi
#   pods:               110
# Allocatable:
#   cpu:                3800m
#   memory:             15168Mi
#   pods:               110
# Non-terminated Pods:  (12 in total)
#   Namespace   Name                CPU Requests  Memory Requests
#   ---------   ----                ------------  ---------------
#   default     web-abc123          100m          128Mi
#   default     api-def456          250m          256Mi
#   ...

Capacity is the total resources on the node. Allocatable is what's available for Pods after reserving resources for the OS and kubelet. The difference is the system reserved overhead.

Node Conditions

Every node reports five conditions. These drive scheduling decisions and alerting:

Condition Healthy Value What It Means
Ready True Node is healthy and can accept Pods
MemoryPressure False Node is not running low on memory
DiskPressure False Node is not running low on disk space
PIDPressure False Node is not running too many processes
NetworkUnavailable False Node network is correctly configured

If Ready is False or Unknown for the pod-eviction-timeout (default 5 minutes), the node controller starts evicting Pods to healthy nodes.

Gotcha: A node stuck in NotReady does not immediately kill Pods. Kubernetes waits for the eviction timeout. During that window, Pods on the failing node are still reported as Running but are unreachable. StatefulSet Pods are especially tricky — they won't be rescheduled until the old Pod is confirmed terminated (or you force-delete it).

Node Labels

Labels let you classify nodes by hardware, location, or role:

# Add a label
kubectl label node worker-01 disktype=ssd
# node/worker-01 labeled

# Add multiple labels
kubectl label node worker-01 gpu=nvidia-a100 zone=us-east-1a
# node/worker-01 labeled

# View labels on all nodes
kubectl get nodes --show-labels
# NAME         STATUS   ROLES           AGE   VERSION   LABELS
# worker-01    Ready    <none>          45d   v1.30.2   disktype=ssd,gpu=nvidia-a100,zone=us-east-1a,...

# Remove a label
kubectl label node worker-01 gpu-
# node/worker-01 unlabeled

# Overwrite an existing label
kubectl label node worker-01 disktype=nvme --overwrite
# node/worker-01 labeled

Node Roles

Roles are just labels with a special prefix. The control plane node has:

kubectl get nodes
# NAME         STATUS   ROLES           AGE   VERSION
# control-01   Ready    control-plane   45d   v1.30.2

# The role comes from this label:
kubectl get node control-01 --show-labels | tr ',' '\n' | grep role
# node-role.kubernetes.io/control-plane=

# You can assign custom roles to workers
kubectl label node worker-01 node-role.kubernetes.io/gpu-worker=
# node/worker-01 labeled

kubectl get nodes
# NAME         STATUS   ROLES           AGE   VERSION
# control-01   Ready    control-plane   45d   v1.30.2
# worker-01    Ready    gpu-worker      45d   v1.30.2
# worker-02    Ready    <none>          45d   v1.30.2

Tip: The role label value is irrelevant — Kubernetes only checks if the label key exists. node-role.kubernetes.io/worker="" and node-role.kubernetes.io/worker="true" both display as worker in the ROLES column.

Taints and Tolerations

Taints and tolerations work together to control which Pods can run on which nodes. A taint on a node repels Pods. A toleration on a Pod lets it ignore a specific taint.

Taints = "Keep out" signs on nodes
Tolerations = "I'm allowed in" badges on Pods

  Node with taint                 Pod without toleration
  ┌──────────────┐                ┌──────┐
  │  ⛔ taint:   │  ──── X ────  │ Pod  │  Rejected
  │  gpu=true:   │                └──────┘
  │  NoSchedule  │
  └──────────────┘                Pod with toleration
                                  ┌──────┐
                   ──── ✓ ────    │ Pod  │  Allowed
                                  │ tol: │
                                  │ gpu  │
                                  └──────┘

Applying Taints

# Taint a node
kubectl taint node worker-01 dedicated=gpu:NoSchedule
# node/worker-01 tainted

# View taints
kubectl describe node worker-01 | grep Taints
# Taints:  dedicated=gpu:NoSchedule

# Remove a taint (add minus sign at the end)
kubectl taint node worker-01 dedicated=gpu:NoSchedule-
# node/worker-01 untainted

The taint format is key=value:effect. The three effects are:

Effect Behavior
NoSchedule New Pods without a matching toleration are not scheduled. Existing Pods stay.
PreferNoSchedule Scheduler tries to avoid the node, but will use it if no other option exists. Soft rule.
NoExecute New Pods are rejected and existing Pods without a matching toleration are evicted.

Adding Tolerations to Pods

To schedule a Pod on a tainted node, add a matching toleration:

gpu-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-training
spec:
  containers:
  - name: trainer
    image: nvidia/cuda:12.0-base
    resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"
  nodeSelector:
    gpu: nvidia-a100          # also select the right node

The toleration operators:

Operator Meaning
Equal Key, value, and effect must all match
Exists Key and effect must match (value is ignored — use when taint has no value)
# Match any taint with key "dedicated" regardless of value
tolerations:
- key: "dedicated"
  operator: "Exists"
  effect: "NoSchedule"

# Match ALL taints (use with caution — this Pod goes anywhere)
tolerations:
- operator: "Exists"

NoExecute and tolerationSeconds

With NoExecute, you can control how long a Pod stays before eviction:

tolerations:
- key: "node.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 300       # stay for 5 minutes, then evict

If tolerationSeconds is omitted, the Pod tolerates the taint forever and is never evicted.

Built-in Taints

Kubernetes automatically applies taints to nodes in certain conditions:

Taint When Applied
node-role.kubernetes.io/control-plane:NoSchedule Control plane nodes — prevents user workloads on master
node.kubernetes.io/not-ready:NoExecute Node condition Ready is False
node.kubernetes.io/unreachable:NoExecute Node condition Ready is Unknown (lost contact)
node.kubernetes.io/disk-pressure:NoSchedule Node has disk pressure
node.kubernetes.io/memory-pressure:NoSchedule Node has memory pressure
node.kubernetes.io/pid-pressure:NoSchedule Node has PID pressure
node.kubernetes.io/unschedulable:NoSchedule Node is cordoned

Tip: Kubernetes automatically adds tolerations for not-ready and unreachable to every Pod with a default tolerationSeconds of 300 (5 minutes). That is why Pods are not immediately evicted when a node goes down — they wait 5 minutes first.

Use Cases for Taints

Node Affinity and Anti-Affinity

Node affinity is a more expressive alternative to nodeSelector. It lets you specify rules about which nodes a Pod can or should be scheduled on.

nodeSelector (Simple Approach)

apiVersion: v1
kind: Pod
metadata:
  name: ssd-app
spec:
  containers:
  - name: app
    image: myapp:latest
  nodeSelector:
    disktype: ssd              # must land on a node with this label

nodeSelector is a hard requirement — if no node matches, the Pod stays Pending. It only supports equality matching.

Node Affinity (Expressive Approach)

Node affinity supports two rule types:

Rule Behavior
requiredDuringSchedulingIgnoredDuringExecution Hard requirement — Pod is not scheduled unless a node matches. Like nodeSelector but more flexible.
preferredDuringSchedulingIgnoredDuringExecution Soft preference — scheduler tries to match but schedules elsewhere if needed.

The IgnoredDuringExecution part means: if the node labels change after the Pod is running, the Pod stays put. It is not evicted.

node-affinity-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: zone-aware-app
spec:
  containers:
  - name: app
    image: myapp:latest
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - us-east-1a
            - us-east-1b
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
      - weight: 20
        preference:
          matchExpressions:
          - key: node-type
            operator: In
            values:
            - high-memory

This Pod:

  1. Must land in zone us-east-1a or us-east-1b (hard requirement)
  2. Prefers nodes with disktype=ssd (weight 80) or node-type=high-memory (weight 20)

The weight field (1-100) lets the scheduler rank preferred nodes. Higher weight = stronger preference.

Supported operators for matchExpressions:

Operator Meaning
In Label value is in the provided list
NotIn Label value is not in the list
Exists Label key exists (value ignored)
DoesNotExist Label key does not exist
Gt Label value is greater than (numeric strings only)
Lt Label value is less than (numeric strings only)

nodeSelector vs Node Affinity

Feature nodeSelector Node Affinity
Matching Exact label match only In, NotIn, Exists, DoesNotExist, Gt, Lt
Hard/soft Hard only Both required and preferred
Multiple rules AND only OR within a term, AND across terms
Complexity Simple More verbose but more powerful

When to use which: Use nodeSelector for simple cases (e.g., "run on SSD nodes"). Use node affinity when you need OR logic, soft preferences, or operators beyond equality.

Pod Affinity and Anti-Affinity

While node affinity controls Pod-to-node relationships, pod affinity controls Pod-to-Pod relationships — whether Pods should be co-located or spread apart.

Pod Affinity (Co-locate Pods)

Use case: place a web frontend on the same node as its Redis cache for low-latency access.

colocated-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-frontend
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-frontend
  template:
    metadata:
      labels:
        app: web-frontend
    spec:
      containers:
      - name: web
        image: myapp/frontend:latest
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - redis-cache
            topologyKey: kubernetes.io/hostname    # same node

This says: "Schedule this Pod on a node that already runs a Pod with label app=redis-cache."

Pod Anti-Affinity (Spread Pods)

Use case: spread replicas across different nodes so a single node failure doesn't take down all instances.

spread-replicas.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
      - name: api
        image: myapp/api:latest
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - api-server
            topologyKey: kubernetes.io/hostname    # different nodes

This says: "Don't put two api-server Pods on the same node."

The topologyKey

The topologyKey determines the scope of "co-located" or "spread":

topologyKey Scope Use Case
kubernetes.io/hostname Per node Spread replicas across nodes
topology.kubernetes.io/zone Per availability zone Spread across AZs for HA
topology.kubernetes.io/region Per region Multi-region workloads
Spread across zones example:
affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - api-server
        topologyKey: topology.kubernetes.io/zone

Gotcha: Hard pod anti-affinity (required) with topologyKey: kubernetes.io/hostname limits your replicas to the number of nodes. If you have 3 nodes but want 5 replicas, 2 Pods will stay Pending forever. Use preferred anti-affinity if you want best-effort spreading without blocking.

Topology Spread Constraints

Topology spread constraints provide more fine-grained control over how Pods are distributed than pod anti-affinity. They let you specify a maximum allowed imbalance (maxSkew) across topology domains.

even-spread.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 6
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: web
        image: nginx:1.25
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: web
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: web

This distributes 6 replicas so:

  1. Across zones — no zone has more than 1 extra Pod compared to any other zone (hard rule)
  2. Across nodes — best-effort even spread within each zone (soft rule)
Zone A (2 nodes)          Zone B (2 nodes)          Zone C (2 nodes)
┌──────┐  ┌──────┐       ┌──────┐  ┌──────┐       ┌──────┐  ┌──────┐
│ web  │  │ web  │       │ web  │  │ web  │       │ web  │  │ web  │
│      │  │      │       │      │  │      │       │      │  │      │
└──────┘  └──────┘       └──────┘  └──────┘       └──────┘  └──────┘
 node-1    node-2         node-3    node-4         node-5    node-6

Zone A: 2 Pods    Zone B: 2 Pods    Zone C: 2 Pods
maxSkew = 1 → satisfied (max difference between any two zones is 0)

Key fields:

Field Description
maxSkew Maximum allowed difference in Pod count between any two topology domains. 1 means evenly distributed.
topologyKey The node label that defines domains (zone, hostname, region)
whenUnsatisfiable DoNotSchedule (hard) or ScheduleAnyway (soft, with skew minimization)
labelSelector Which Pods to count when computing the spread

Tip: Topology spread constraints are the recommended way to distribute Pods evenly. They are more predictable than pod anti-affinity for workloads where you want even distribution rather than strict "no two on the same node" rules.

Cordoning and Draining Nodes

When you need to perform maintenance on a node — OS updates, hardware replacement, Kubernetes upgrades — you must safely move workloads off it first.

Cordon: Mark Unschedulable

kubectl cordon worker-02
# node/worker-02 cordoned

kubectl get nodes
# NAME         STATUS                     ROLES           AGE   VERSION
# control-01   Ready                      control-plane   45d   v1.30.2
# worker-01    Ready                      <none>          45d   v1.30.2
# worker-02    Ready,SchedulingDisabled   <none>          45d   v1.30.2
# worker-03    Ready                      <none>          44d   v1.30.2

A cordoned node shows SchedulingDisabled. Existing Pods continue running — no evictions happen. New Pods will not be scheduled there.

Uncordon: Make Schedulable Again

kubectl uncordon worker-02
# node/worker-02 uncordoned

kubectl get nodes
# NAME         STATUS   ROLES           AGE   VERSION
# worker-02    Ready    <none>          45d   v1.30.2

Drain: Evict All Pods

Draining cordons the node and then evicts all Pods:

kubectl drain worker-02 --ignore-daemonsets --delete-emptydir-data
# node/worker-02 cordoned
# evicting pod default/web-abc123
# evicting pod default/api-def456
# evicting pod monitoring/prometheus-node-exporter-xyz
# pod/web-abc123 evicted
# pod/api-def456 evicted
# node/worker-02 drained

Common drain flags:

Flag Purpose
--ignore-daemonsets Skip DaemonSet Pods (they're supposed to run on every node)
--delete-emptydir-data Delete Pods using emptyDir volumes (data will be lost)
--force Force-delete Pods not managed by a controller (standalone Pods)
--grace-period=30 Override the Pod's termination grace period
--timeout=300s Abort drain if it takes longer than this
--pod-selector='app=web' Only evict Pods matching the selector

Gotcha: kubectl drain without --ignore-daemonsets will fail if DaemonSet Pods exist on the node. Without --delete-emptydir-data, it fails if any Pods use emptyDir. Without --force, it fails if standalone Pods (not managed by a controller) exist. In practice, you almost always need --ignore-daemonsets --delete-emptydir-data.

Drain and PodDisruptionBudgets

Drain respects PodDisruptionBudgets (PDBs). If evicting a Pod would violate a PDB, the drain blocks:

kubectl drain worker-02 --ignore-daemonsets --delete-emptydir-data
# evicting pod default/web-abc123
# error when evicting pods/"web-abc123" -n "default" (will retry after 5s):
#   Cannot evict pod as it would violate the pod's disruption budget.
# evicting pod default/web-abc123
# pod/web-abc123 evicted
# node/worker-02 drained

The drain retries until the PDB is satisfied (other replicas are ready) or the timeout expires.

The Maintenance Workflow

Step 1: Cordon          Step 2: Drain           Step 3: Maintain        Step 4: Uncordon
┌──────────────┐       ┌──────────────┐        ┌──────────────┐        ┌──────────────┐
│ worker-02    │       │ worker-02    │        │ worker-02    │        │ worker-02    │
│ Scheduling   │  ──▶  │ No Pods      │  ──▶   │ OS update    │  ──▶   │ Ready        │
│ Disabled     │       │ (evicted)    │        │ K8s upgrade  │        │ Schedulable  │
│ Pods running │       │ Unschedulable│        │ Reboot       │        │ New Pods OK  │
└──────────────┘       └──────────────┘        └──────────────┘        └──────────────┘
# Full maintenance workflow
kubectl cordon worker-02
kubectl drain worker-02 --ignore-daemonsets --delete-emptydir-data

# ... perform maintenance (upgrade OS, kubelet, etc.) ...

kubectl uncordon worker-02

Cluster Upgrades

Kubernetes releases a new minor version roughly every 4 months. Staying current is important for security patches, bug fixes, and new features.

Upgrade Rules

  1. Control plane first, then worker nodes — never upgrade workers ahead of the control plane
  2. One minor version at a time — upgrade from 1.29 to 1.30, not 1.29 to 1.31
  3. Version skew policy — kubelet can be up to 2 minor versions behind the API server
Version Skew Policy:
API Server: v1.30   ← must be the highest version component
         │
         ├── kubelet: v1.30  ✓  (same version)
         ├── kubelet: v1.29  ✓  (one behind)
         ├── kubelet: v1.28  ✓  (two behind)
         └── kubelet: v1.27  ✗  (three behind — not supported)

Managed Kubernetes (EKS, GKE, AKS)

Managed services handle control plane upgrades for you:

# GKE — upgrade control plane
gcloud container clusters upgrade my-cluster \
  --master --cluster-version=1.30.2

# EKS — upgrade control plane
aws eks update-cluster-version \
  --name my-cluster --kubernetes-version 1.30

# AKS — upgrade control plane
az aks upgrade --resource-group mygroup \
  --name my-cluster --kubernetes-version 1.30.2

For node pools, managed services can perform rolling upgrades automatically (new nodes come up, old nodes drain and terminate).

Self-Managed Upgrades (kubeadm)

For clusters built with kubeadm, the upgrade is manual. Here is the workflow:

Step 1: Upgrade the control plane
# Check available versions
sudo kubeadm upgrade plan
# Components that must be upgraded manually after you have upgraded the control plane:
# COMPONENT   CURRENT   TARGET
# kubelet     v1.29.5   v1.30.2
#
# Upgrade to the latest stable version:
# COMPONENT                CURRENT   TARGET
# kube-apiserver            v1.29.5   v1.30.2
# kube-controller-manager   v1.29.5   v1.30.2
# kube-scheduler            v1.29.5   v1.30.2
# kube-proxy                v1.29.5   v1.30.2
# CoreDNS                   v1.11.1   v1.11.3
# etcd                      3.5.10    3.5.12

# Upgrade kubeadm first
sudo apt-get update
sudo apt-get install -y kubeadm=1.30.2-1.1

# Apply the upgrade
sudo kubeadm upgrade apply v1.30.2
# [upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.30.2". Enjoy!

# Upgrade kubelet and kubectl on the control plane node
sudo apt-get install -y kubelet=1.30.2-1.1 kubectl=1.30.2-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet
Step 2: Upgrade each worker node (one at a time)
# On the admin machine: drain the worker
kubectl drain worker-01 --ignore-daemonsets --delete-emptydir-data

# On the worker node: upgrade kubeadm, kubelet
sudo apt-get update
sudo apt-get install -y kubeadm=1.30.2-1.1
sudo kubeadm upgrade node

sudo apt-get install -y kubelet=1.30.2-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet

# On the admin machine: uncordon the worker
kubectl uncordon worker-01

Repeat for each worker node. Verify after each node:

kubectl get nodes
# NAME         STATUS   ROLES           AGE   VERSION
# control-01   Ready    control-plane   45d   v1.30.2
# worker-01    Ready    <none>          45d   v1.30.2   ← upgraded
# worker-02    Ready    <none>          45d   v1.29.5   ← still on old version
# worker-03    Ready    <none>          44d   v1.29.5

Tip: Always take an etcd snapshot before upgrading the control plane. If the upgrade fails, you can restore the cluster from the backup. Also, read the changelog for each version — some releases have breaking changes or deprecated APIs that require manifest updates.

etcd Backup and Restore

etcd is the brain of the cluster. It stores every object — Pods, Services, Secrets, ConfigMaps, RBAC rules — everything. If etcd is lost without a backup, the entire cluster state is gone.

┌──────────────────────────────────────────┐
│            Kubernetes Cluster            │
│                                          │
│  API Server ────▶  etcd                  │
│                    ┌──────────────────┐  │
│  Controller ────▶  │ All cluster data │  │
│  Manager           │ - Pods           │  │
│                    │ - Services       │  │
│  Scheduler  ────▶  │ - Secrets        │  │
│                    │ - ConfigMaps     │  │
│                    │ - RBAC           │  │
│                    │ - ...everything  │  │
│                    └──────────────────┘  │
└──────────────────────────────────────────┘

Taking an etcd Snapshot

# Set environment variables for etcd connection
export ETCDCTL_API=3

# Take a snapshot
etcdctl snapshot save /opt/backup/etcd-snapshot-$(date +%Y%m%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
# Snapshot saved at /opt/backup/etcd-snapshot-20260131.db

# Verify the snapshot
etcdctl snapshot status /opt/backup/etcd-snapshot-20260131.db --write-out=table
# +----------+----------+------------+------------+
# |   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
# +----------+----------+------------+------------+
# | 3c9768e5 |   412850 |       1243 |     5.1 MB |
# +----------+----------+------------+------------+

The certificates are required because etcd uses mutual TLS. On kubeadm clusters, the certs are in /etc/kubernetes/pki/etcd/.

Restoring from a Snapshot

Restoring replaces all cluster state with the snapshot contents:

# Stop the API server (on kubeadm, move the static pod manifest)
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/

# Restore the snapshot to a new data directory
etcdctl snapshot restore /opt/backup/etcd-snapshot-20260131.db \
  --data-dir=/var/lib/etcd-restored \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
# 2026-01-31T10:00:00Z info restored snapshot
# 2026-01-31T10:00:00Z info added member

# Update the etcd static pod to use the restored data directory
# Edit /etc/kubernetes/manifests/etcd.yaml:
#   - --data-dir=/var/lib/etcd-restored
# Also update the hostPath volume:
#   path: /var/lib/etcd-restored

# Restore the API server manifest
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

# Wait for the control plane to restart
kubectl get nodes
# NAME         STATUS   ROLES           AGE   VERSION
# control-01   Ready    control-plane   45d   v1.30.2

Backup Strategy

When Why
Before cluster upgrades Rollback point if upgrade fails
Before major changes Rollback point for risky operations
Scheduled (daily/hourly) Disaster recovery
Before etcd maintenance Safety net

Gotcha: An etcd backup captures cluster state at a point in time. Anything created after the snapshot is lost on restore. In production, automate regular snapshots with a CronJob or external tool like Velero. Also, store backups off-cluster — if the cluster is destroyed, you need the backup to be somewhere else.

Multi-Cluster Considerations

As organizations grow, a single cluster often is not enough. Multiple clusters provide stronger isolation, separate failure domains, and meet compliance requirements.

When to Use Multiple Clusters vs Namespaces

Scenario Namespaces Multiple Clusters
Team isolation (same trust level) Yes Overkill
Dev/staging/prod environments Maybe Recommended
Regulatory compliance (PCI, HIPAA) Usually insufficient Required
Multi-region HA No Yes
Blast radius reduction Partial Full
Noisy neighbor protection ResourceQuotas help Complete isolation

Context Switching

kubectl uses contexts to connect to different clusters. Each context combines a cluster, user, and namespace:

# View all contexts
kubectl config get-contexts
# CURRENT   NAME            CLUSTER         AUTHINFO        NAMESPACE
# *         dev-cluster     dev-cluster     dev-admin       default
#           staging-cluster staging-cluster staging-admin   default
#           prod-cluster    prod-cluster    prod-admin      default

# Switch to a different cluster
kubectl config use-context prod-cluster
# Switched to context "prod-cluster".

# Verify
kubectl config current-context
# prod-cluster

# Run a command against a specific context without switching
kubectl --context=staging-cluster get nodes

Useful Multi-Cluster Tools

# kubectx — fast context switching
kubectx prod-cluster
# Switched to context "prod-cluster".

kubectx -                # switch back to previous context

# kubens — fast namespace switching
kubens kube-system
# Context "prod-cluster" modified.
# Active namespace is "kube-system".

Federation Concepts

Kubernetes federation allows you to manage multiple clusters from a single control plane. While the original Federation v1 was deprecated, the KubeFed project and newer tools like Admiralty and Liqo provide multi-cluster capabilities:

Tip: For most teams, federation is overkill. Start with separate clusters and a GitOps tool (ArgoCD, Flux) that deploys to multiple clusters from a single Git repo. This gives you multi-cluster management without the complexity of federation.

Hands-On: Taints, Tolerations, Drain, and Node Affinity

Let's combine everything in a practical scenario. We will taint a node, deploy Pods with tolerations, drain a node for maintenance, and use node affinity to control scheduling.

Step 1: Label and taint a node
# Label worker-01 as a GPU node
kubectl label node worker-01 hardware=gpu
# node/worker-01 labeled

# Taint it so only GPU workloads run there
kubectl taint node worker-01 hardware=gpu:NoSchedule
# node/worker-01 tainted

# Verify
kubectl describe node worker-01 | grep -E "Taints|Labels" | head -5
# Labels:   hardware=gpu,kubernetes.io/hostname=worker-01,...
# Taints:   hardware=gpu:NoSchedule
Step 2: Deploy a regular Pod (it avoids the tainted node)
kubectl run regular-app --image=nginx
# pod/regular-app created

kubectl get pod regular-app -o wide
# NAME          READY   STATUS    NODE
# regular-app   1/1     Running   worker-02     ← not on worker-01

Step 3: Deploy a GPU Pod with a toleration

gpu-workload.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  containers:
  - name: ml-training
    image: nvidia/cuda:12.0-base
    command: ["sleep", "3600"]
  tolerations:
  - key: "hardware"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: hardware
            operator: In
            values:
            - gpu
kubectl apply -f gpu-workload.yaml
# pod/gpu-workload created

kubectl get pod gpu-workload -o wide
# NAME           READY   STATUS    NODE
# gpu-workload   1/1     Running   worker-01    ← landed on the GPU node
Step 4: Drain a node for maintenance
# Drain worker-02 (where regular-app is running)
kubectl drain worker-02 --ignore-daemonsets --delete-emptydir-data
# node/worker-02 cordoned
# evicting pod default/regular-app
# pod/regular-app evicted
# node/worker-02 drained

kubectl get nodes
# NAME         STATUS                     ROLES           AGE   VERSION
# control-01   Ready                      control-plane   45d   v1.30.2
# worker-01    Ready                      <none>          45d   v1.30.2
# worker-02    Ready,SchedulingDisabled   <none>          45d   v1.30.2
# worker-03    Ready                      <none>          44d   v1.30.2

# regular-app is gone — it was a standalone Pod, not managed by a controller
kubectl get pods
# NAME           READY   STATUS    NODE
# gpu-workload   1/1     Running   worker-01

Gotcha: Standalone Pods (not managed by a Deployment, StatefulSet, or DaemonSet) are permanently lost during a drain. They are evicted but nothing recreates them. Always use a controller for production workloads.

Step 5: Uncordon after maintenance
kubectl uncordon worker-02
# node/worker-02 uncordoned

kubectl get nodes
# NAME         STATUS   ROLES           AGE   VERSION
# worker-02    Ready    <none>          45d   v1.30.2    ← back to normal
Step 6: Clean up
kubectl delete pod gpu-workload
kubectl taint node worker-01 hardware=gpu:NoSchedule-
kubectl label node worker-01 hardware-

Exercises

Progress through each section in order, or jump to where you need practice.

Practice individual concepts you just learned.

💪 Challenges

Combine concepts and learn patterns. Each challenge has multiple variants at different difficulties.

Module 21 Summary