Project: Mini Container Runtime | Infrastructure Development with Go

Build a minimal container runtime using Linux namespaces and cgroups. Run a process in an isolated environment with resource limits.

What You're Building

A contain command that:

Takes a command and rootfs path as arguments
Creates new Linux namespaces (PID, UTS, mount, net)
Sets up filesystem isolation with pivot_root
Mounts /proc inside the container
Applies cgroup resource limits (memory, CPU)
Sets a hostname for the isolated environment
Runs the specified command in this isolated context

Why This Project

This project connects Linux isolation primitives to Go code: namespaces, cgroups, pivot_root, process execution, and cleanup. It is Linux-specific and uses real kernel features, so it gives you a concrete way to inspect the mechanics behind container runtimes.

Milestones

All within Module 10, but build in this order — each stage is independently verifiable in your VM.

Milestone 1 — Re-exec skeleton

The run/child dispatch and the re-exec pattern (see ## Requirements), no isolation yet — child just prints and runs the command.

Done when: sudo ./contain run -- /bin/echo hello prints hello via the child process.

Milestone 2 — UTS + PID namespaces

Add CLONE_NEWUTS + CLONE_NEWPID, set the hostname.

Done when: the command inside reports your --hostname and sees itself as PID 1.

Milestone 3 — Filesystem isolation

pivot_root into the Alpine rootfs, mount /proc, unmount the old root.

Done when: ps aux inside shows only container processes and ls / shows the Alpine tree.

Milestone 4 — Cgroup limits + cleanup

Memory/CPU/PID limits via cgroup v2 files, child PID registered, deferred cleanup. Unit-test the pure parts (parseMemory, path/format helpers — see ## Testing).

Done when: the OOM and fork-bomb checks in ## Testing behave as documented, go test ./... passes, and no contain-* cgroup dirs remain after exit.

Usage

# Run a shell in an isolated container
sudo ./contain run --rootfs /path/to/alpine-rootfs -- /bin/sh

# Run with memory limit
sudo ./contain run --rootfs ./rootfs --memory 64M -- /bin/sh

# Run with CPU and PID limits
sudo ./contain run --rootfs ./rootfs --memory 128M --cpu 50000 --pids 64 -- /bin/sh

# Run a specific command
sudo ./contain run --rootfs ./rootfs --hostname mycontainer -- /bin/echo "hello from container"

Getting a Root Filesystem

You need a minimal Linux rootfs to use as the container's filesystem. Alpine is perfect:

# Download and extract Alpine mini root filesystem
mkdir rootfs
curl -o alpine.tar.gz https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/x86_64/alpine-minirootfs-3.19.0-x86_64.tar.gz
tar -xzf alpine.tar.gz -C rootfs

Expected Output

$ sudo ./contain run --rootfs ./rootfs --memory 64M --hostname sandbox -- /bin/sh

[contain] creating namespaces: pid uts mnt
[contain] setting hostname: sandbox
[contain] setting up rootfs: ./rootfs
[contain] pivoting root
[contain] mounting /proc
[contain] applying cgroup limits: memory=64M
[contain] running: /bin/sh

/ # hostname
sandbox
/ # ps aux
PID   USER     TIME  COMMAND
    1 root      0:00 /bin/sh
    2 root      0:00 ps aux
/ # cat /proc/self/cgroup
0::/
/ # ls /
bin    dev    etc    home   lib    media  mnt    opt    proc   root   run    sbin   srv    sys    tmp    usr    var
/ # exit

[contain] container exited: exit status 0
[contain] cleaning up cgroups

Requirements

Core

CLI parsing: Use os.Args or the flag package. The program needs a run subcommand with flags for --rootfs, --memory, --cpu, --pids, and --hostname.

Re-exec pattern: The program must fork itself with a special subcommand (e.g., child) to perform setup inside the new namespaces. This is the standard pattern — you can't set up the new mount namespace from the parent.

// Parent: create namespaces, re-exec into them
cmd := exec.Command("/proc/self/exe", append([]string{"child"}, args...)...)
cmd.SysProcAttr = &syscall.SysProcAttr{
    Cloneflags: syscall.CLONE_NEWPID | syscall.CLONE_NEWUTS | syscall.CLONE_NEWNS,
}

// Child: now inside new namespaces, set up the environment

Namespace creation: Create new PID, UTS, and mount namespaces using Cloneflags on exec.Cmd.SysProcAttr.
Hostname: Set the hostname inside the UTS namespace using syscall.Sethostname.
Filesystem isolation:
1. Mount the rootfs directory
2. Create an oldroot directory inside
3. Call syscall.PivotRoot(rootfs, oldroot) to swap roots
4. os.Chdir("/") into the new root
5. Unmount and remove the old root
Mount /proc: Mount a new proc filesystem so ps works inside the container:
```
syscall.Mount("proc", "/proc", "proc", 0, "")
```
Cgroup v2 resource limits: Write limits to the cgroup filesystem:
- Memory: write to memory.max in /sys/fs/cgroup/<name>/
- CPU: write to cpu.max (e.g., "50000 100000" for 50% CPU)
- PIDs: write to pids.max
- Add the child process to the cgroup by writing its PID to cgroup.procs
Cleanup: Remove the cgroup directory on exit. Use defer to ensure cleanup runs.

Namespace Flags

Flag	Constant	What It Isolates
PID	`CLONE_NEWPID`	Process IDs — container sees PID 1
UTS	`CLONE_NEWUTS`	Hostname — container gets its own
Mount	`CLONE_NEWNS`	Mounts — container has its own mount table
Net	`CLONE_NEWNET`	Network — container has its own network stack (stretch goal)

Cgroup v2 File Layout

/sys/fs/cgroup/contain-<id>/
├── cgroup.procs      ← Write PID here to add process to group
├── memory.max        ← Max memory in bytes (e.g., "67108864" for 64M)
├── cpu.max           ← "quota period" (e.g., "50000 100000" = 50%)
└── pids.max          ← Max number of processes (e.g., "64")

Helper: Parse Memory Strings

func parseMemory(s string) (int64, error) {
    s = strings.TrimSpace(s)
    multipliers := map[byte]int64{'K': 1024, 'M': 1024 * 1024, 'G': 1024 * 1024 * 1024}
    if len(s) == 0 {
        return 0, fmt.Errorf("empty memory string")
    }
    last := s[len(s)-1]
    if m, ok := multipliers[last]; ok {
        n, err := strconv.ParseInt(s[:len(s)-1], 10, 64)
        return n * m, err
    }
    return strconv.ParseInt(s, 10, 64)
}

Suggested Structure

contain/
├── main.go           ← CLI entry point, run vs child dispatch
├── container.go      ← Namespace setup, re-exec, run the command
├── filesystem.go     ← pivot_root, mount /proc, unmount old root
├── cgroup.go         ← Create cgroup, write limits, cleanup
├── cgroup_test.go    ← Test memory parsing, limit file generation
└── rootfs/           ← Alpine mini rootfs (not committed to git)

Hints

Suggested approach:

Start with the re-exec pattern: parent creates a child process, child prints "hello from child" — verify it works

Add CLONE_NEWUTS and set the hostname — verify with hostname command

Add CLONE_NEWPID — verify child sees itself as PID 1

Add filesystem isolation: pivot_root into the Alpine rootfs

Mount /proc — verify ps shows only the container's processes

Add cgroup memory limits — verify by allocating memory beyond the limit

Add CPU and PID limits

Add cleanup logic with defer

Testing Memory Limits

# Inside the container, try to allocate more than the limit
/ # dd if=/dev/zero of=/dev/null bs=1M count=200
# Should be killed by OOM if memory limit is set below 200M

Testing PID Limits

# Inside the container, try to fork-bomb (safely!)
/ # for i in $(seq 1 100); do sleep 100 & done
# Should fail after hitting the pids.max limit

Debugging Tips

# Check what namespaces a process is in
ls -la /proc/self/ns/

# Check cgroup membership
cat /proc/self/cgroup

# Verify cgroup limits are applied
cat /sys/fs/cgroup/contain-*/memory.max

Safety Notes

This project requires root. sudo is necessary for namespace creation and pivot_root. Always test in a VM or disposable environment — never on your daily machine.
Use a VM or cloud instance. A cheap VPS or a local VM (Multipass, Vagrant, VirtualBox) is ideal. If something goes wrong with mount namespaces, you could leave stale mounts.
Always clean up cgroups. If your program crashes, leftover cgroup directories in /sys/fs/cgroup/ need manual removal with rmdir.
Don't add CLONE_NEWNET until you're comfortable. Without network setup, the container has no network. Add it as a stretch goal with veth pairs.
The rootfs is read-write. Your container can modify files in the rootfs directory. Use a fresh copy for testing or make it read-only.

Testing

Some parts (namespace creation, pivot_root) require root and are hard to unit test. Focus tests on the pure logic:

func TestParseMemory(t *testing.T) {
    tests := []struct {
        input string
        want  int64
    }{
        {"64M", 67108864},
        {"1G", 1073741824},
        {"512K", 524288},
        {"1048576", 1048576},
    }
    for _, tt := range tests {
        t.Run(tt.input, func(t *testing.T) {
            got, err := parseMemory(tt.input)
            if err != nil {
                t.Fatal(err)
            }
            if got != tt.want {
                t.Errorf("parseMemory(%q) = %d, want %d", tt.input, got, tt.want)
            }
        })
    }
}

func TestCgroupPaths(t *testing.T) {
    cg := newCgroup("test-container")
    if cg.path != "/sys/fs/cgroup/contain-test-container" {
        t.Errorf("unexpected cgroup path: %s", cg.path)
    }
}

func TestCPUMaxFormat(t *testing.T) {
    // 50% CPU = "50000 100000"
    got := formatCPUMax(50000)
    want := "50000 100000"
    if got != want {
        t.Errorf("formatCPUMax(50000) = %q, want %q", got, want)
    }
}

For integration testing, write a script that runs the full binary in a VM and checks output:

#!/bin/bash
# integration_test.sh — run in a VM with root
OUTPUT=$(sudo ./contain run --rootfs ./rootfs --hostname testbox -- /bin/hostname)
if [ "$OUTPUT" != "testbox" ]; then
    echo "FAIL: expected 'testbox', got '$OUTPUT'"
    exit 1
fi
echo "PASS"

Stretch Goals

Network namespace with veth pairs: Create a CLONE_NEWNET namespace and set up a veth pair to give the container network access through the host
OverlayFS layers: Use overlayfs to create a copy-on-write layer over the rootfs, so the base image isn't modified
Resource usage reporting: Read memory.current, cpu.stat, and pids.current from the cgroup and display usage on exit
Environment variables: Pass --env KEY=VALUE flags into the container
Volume mounts: Support --volume /host/path:/container/path bind mounts
Container listing: Track running containers in a state file and support contain list and contain kill

Skills Used: Syscalls (clone, pivot_root, mount, sethostname), namespaces, cgroups v2, process management, file I/O, exec.Command, SysProcAttr, byte parsing, defer for cleanup, CLI argument handling.

Objective checkpoint

Verify the project

not checked yet

Run this from the course repository. The command executes a fixed go vet ./... and go test -count=1 ./... inside your project; the browser can read the result but cannot execute commands.

node vibe.js project-check p4 "/path/to/container-runtime"

Start node vibe.js watch, then run the checkpoint command.

Engineering rubric

Pure path, mount-plan, and cgroup configuration logic has unprivileged unit tests.
Privileged integration checks are isolated to a disposable Linux VM.
Cleanup runs on success, error, signal, and child-process failure.
Safety limitations are explicit in the README.

Delayed production tickets

Return to one of these after at least a day, without reopening the implementation lesson first.

Inject a child-process failure and prove mounts and cgroup directories are still removed.
Add a pids.max option without widening the privileged code path.