Build a minimal container runtime using Linux namespaces and cgroups. Run isolated processes with resource limits. Applies Module 10.
Build a minimal container runtime using Linux namespaces and cgroups. Run a process in an isolated environment with resource limits.
A contain command that:
You've been running Docker containers for years. Now you understand what it actually does. This project is Linux-specific (won't work on macOS) and uses real kernel features. It's the deepest systems programming project in the course.
If you can explain namespaces, cgroups, and pivot_root in an interview — and point to code you wrote that uses them — you're operating at a level most DevOps engineers never reach.
# Run a shell in an isolated container
sudo ./contain run --rootfs /path/to/alpine-rootfs -- /bin/sh
# Run with memory limit
sudo ./contain run --rootfs ./rootfs --memory 64M -- /bin/sh
# Run with CPU and PID limits
sudo ./contain run --rootfs ./rootfs --memory 128M --cpu 50000 --pids 64 -- /bin/sh
# Run a specific command
sudo ./contain run --rootfs ./rootfs --hostname mycontainer -- /bin/echo "hello from container"
You need a minimal Linux rootfs to use as the container's filesystem. Alpine is perfect:
# Download and extract Alpine mini root filesystem
mkdir rootfs
curl -o alpine.tar.gz https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/x86_64/alpine-minirootfs-3.19.0-x86_64.tar.gz
tar -xzf alpine.tar.gz -C rootfs
$ sudo ./contain run --rootfs ./rootfs --memory 64M --hostname sandbox -- /bin/sh
[contain] creating namespaces: pid uts mnt
[contain] setting hostname: sandbox
[contain] setting up rootfs: ./rootfs
[contain] pivoting root
[contain] mounting /proc
[contain] applying cgroup limits: memory=64M
[contain] running: /bin/sh
/ # hostname
sandbox
/ # ps aux
PID USER TIME COMMAND
1 root 0:00 /bin/sh
2 root 0:00 ps aux
/ # cat /proc/self/cgroup
0::/
/ # ls /
bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
/ # exit
[contain] container exited: exit status 0
[contain] cleaning up cgroups
os.Args or the flag package. The program needs a run subcommand with flags for --rootfs, --memory, --cpu, --pids, and --hostname.child) to perform setup inside the new namespaces. This is the standard pattern — you can't set up the new mount namespace from the parent.// Parent: create namespaces, re-exec into them
cmd := exec.Command("/proc/self/exe", append([]string{"child"}, args...)...)
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWPID | syscall.CLONE_NEWUTS | syscall.CLONE_NEWNS,
}
// Child: now inside new namespaces, set up the environment
Cloneflags on exec.Cmd.SysProcAttr.syscall.Sethostname.oldroot directory insidesyscall.PivotRoot(rootfs, oldroot) to swap rootsos.Chdir("/") into the new rootps works inside the container:syscall.Mount("proc", "/proc", "proc", 0, "")
memory.max in /sys/fs/cgroup/<name>/cpu.max (e.g., "50000 100000" for 50% CPU)pids.maxcgroup.procsdefer to ensure cleanup runs.| Flag | Constant | What It Isolates |
|---|---|---|
| PID | CLONE_NEWPID |
Process IDs — container sees PID 1 |
| UTS | CLONE_NEWUTS |
Hostname — container gets its own |
| Mount | CLONE_NEWNS |
Mounts — container has its own mount table |
| Net | CLONE_NEWNET |
Network — container has its own network stack (stretch goal) |
/sys/fs/cgroup/contain-<id>/
├── cgroup.procs ← Write PID here to add process to group
├── memory.max ← Max memory in bytes (e.g., "67108864" for 64M)
├── cpu.max ← "quota period" (e.g., "50000 100000" = 50%)
└── pids.max ← Max number of processes (e.g., "64")
func parseMemory(s string) (int64, error) {
s = strings.TrimSpace(s)
multipliers := map[byte]int64{'K': 1024, 'M': 1024 * 1024, 'G': 1024 * 1024 * 1024}
if len(s) == 0 {
return 0, fmt.Errorf("empty memory string")
}
last := s[len(s)-1]
if m, ok := multipliers[last]; ok {
n, err := strconv.ParseInt(s[:len(s)-1], 10, 64)
return n * m, err
}
return strconv.ParseInt(s, 10, 64)
}
contain/
├── main.go ← CLI entry point, run vs child dispatch
├── container.go ← Namespace setup, re-exec, run the command
├── filesystem.go ← pivot_root, mount /proc, unmount old root
├── cgroup.go ← Create cgroup, write limits, cleanup
├── cgroup_test.go ← Test memory parsing, limit file generation
└── rootfs/ ← Alpine mini rootfs (not committed to git)
Suggested approach:
- Start with the re-exec pattern: parent creates a child process, child prints "hello from child" — verify it works
- Add
CLONE_NEWUTSand set the hostname — verify withhostnamecommand- Add
CLONE_NEWPID— verify child sees itself as PID 1- Add filesystem isolation:
pivot_rootinto the Alpine rootfs- Mount
/proc— verifypsshows only the container's processes- Add cgroup memory limits — verify by allocating memory beyond the limit
- Add CPU and PID limits
- Add cleanup logic with
defer
# Inside the container, try to allocate more than the limit
/ # dd if=/dev/zero of=/dev/null bs=1M count=200
# Should be killed by OOM if memory limit is set below 200M
# Inside the container, try to fork-bomb (safely!)
/ # for i in $(seq 1 100); do sleep 100 & done
# Should fail after hitting the pids.max limit
# Check what namespaces a process is in
ls -la /proc/self/ns/
# Check cgroup membership
cat /proc/self/cgroup
# Verify cgroup limits are applied
cat /sys/fs/cgroup/contain-*/memory.max
sudo is necessary for namespace creation and pivot_root. Always test in a VM or disposable environment — never on your daily machine./sys/fs/cgroup/ need manual removal with rmdir.CLONE_NEWNET until you're comfortable. Without network setup, the container has no network. Add it as a stretch goal with veth pairs.Some parts (namespace creation, pivot_root) require root and are hard to unit test. Focus tests on the pure logic:
func TestParseMemory(t *testing.T) {
tests := []struct {
input string
want int64
}{
{"64M", 67108864},
{"1G", 1073741824},
{"512K", 524288},
{"1048576", 1048576},
}
for _, tt := range tests {
t.Run(tt.input, func(t *testing.T) {
got, err := parseMemory(tt.input)
if err != nil {
t.Fatal(err)
}
if got != tt.want {
t.Errorf("parseMemory(%q) = %d, want %d", tt.input, got, tt.want)
}
})
}
}
func TestCgroupPaths(t *testing.T) {
cg := newCgroup("test-container")
if cg.path != "/sys/fs/cgroup/contain-test-container" {
t.Errorf("unexpected cgroup path: %s", cg.path)
}
}
func TestCPUMaxFormat(t *testing.T) {
// 50% CPU = "50000 100000"
got := formatCPUMax(50000)
want := "50000 100000"
if got != want {
t.Errorf("formatCPUMax(50000) = %q, want %q", got, want)
}
}
For integration testing, write a script that runs the full binary in a VM and checks output:
#!/bin/bash
# integration_test.sh — run in a VM with root
OUTPUT=$(sudo ./contain run --rootfs ./rootfs --hostname testbox -- /bin/hostname)
if [ "$OUTPUT" != "testbox" ]; then
echo "FAIL: expected 'testbox', got '$OUTPUT'"
exit 1
fi
echo "PASS"
CLONE_NEWNET namespace and set up a veth pair to give the container network access through the hostmemory.current, cpu.stat, and pids.current from the cgroup and display usage on exit--env KEY=VALUE flags into the container--volume /host/path:/container/path bind mountscontain list and contain killSkills Used: Syscalls (clone, pivot_root, mount, sethostname), namespaces, cgroups v2, process management, file I/O, exec.Command, SysProcAttr, byte parsing, defer for cleanup, CLI argument handling.