Build a minimal container runtime using Linux namespaces and cgroups. Run a process in an isolated environment with resource limits.
What You're Building
A contain command that:
- Takes a command and rootfs path as arguments
- Creates new Linux namespaces (PID, UTS, mount, net)
- Sets up filesystem isolation with pivot_root
- Mounts /proc inside the container
- Applies cgroup resource limits (memory, CPU)
- Sets a hostname for the isolated environment
- Runs the specified command in this isolated context
Why This Project
This project connects Linux isolation primitives to Go code: namespaces, cgroups, pivot_root, process execution, and cleanup. It is Linux-specific and uses real kernel features, so it gives you a concrete way to inspect the mechanics behind container runtimes.
Usage
# Run a shell in an isolated container
sudo ./contain run --rootfs /path/to/alpine-rootfs -- /bin/sh
# Run with memory limit
sudo ./contain run --rootfs ./rootfs --memory 64M -- /bin/sh
# Run with CPU and PID limits
sudo ./contain run --rootfs ./rootfs --memory 128M --cpu 50000 --pids 64 -- /bin/sh
# Run a specific command
sudo ./contain run --rootfs ./rootfs --hostname mycontainer -- /bin/echo "hello from container"
Getting a Root Filesystem
You need a minimal Linux rootfs to use as the container's filesystem. Alpine is perfect:
# Download and extract Alpine mini root filesystem
mkdir rootfs
curl -o alpine.tar.gz https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/x86_64/alpine-minirootfs-3.19.0-x86_64.tar.gz
tar -xzf alpine.tar.gz -C rootfs
Expected Output
$ sudo ./contain run --rootfs ./rootfs --memory 64M --hostname sandbox -- /bin/sh
[contain] creating namespaces: pid uts mnt
[contain] setting hostname: sandbox
[contain] setting up rootfs: ./rootfs
[contain] pivoting root
[contain] mounting /proc
[contain] applying cgroup limits: memory=64M
[contain] running: /bin/sh
/ # hostname
sandbox
/ # ps aux
PID USER TIME COMMAND
1 root 0:00 /bin/sh
2 root 0:00 ps aux
/ # cat /proc/self/cgroup
0::/
/ # ls /
bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
/ # exit
[contain] container exited: exit status 0
[contain] cleaning up cgroups
Requirements
Core
- CLI parsing: Use
os.Argsor theflagpackage. The program needs arunsubcommand with flags for--rootfs,--memory,--cpu,--pids, and--hostname. - Re-exec pattern: The program must fork itself with a special subcommand (e.g.,
child) to perform setup inside the new namespaces. This is the standard pattern — you can't set up the new mount namespace from the parent.// Parent: create namespaces, re-exec into them cmd := exec.Command("/proc/self/exe", append([]string{"child"}, args...)...) cmd.SysProcAttr = &syscall.SysProcAttr{ Cloneflags: syscall.CLONE_NEWPID | syscall.CLONE_NEWUTS | syscall.CLONE_NEWNS, } // Child: now inside new namespaces, set up the environment - Namespace creation: Create new PID, UTS, and mount namespaces using
Cloneflagsonexec.Cmd.SysProcAttr. - Hostname: Set the hostname inside the UTS namespace using
syscall.Sethostname. - Filesystem isolation:
- Mount the rootfs directory
- Create an
oldrootdirectory inside - Call
syscall.PivotRoot(rootfs, oldroot)to swap roots os.Chdir("/")into the new root- Unmount and remove the old root
- Mount /proc: Mount a new proc filesystem so
psworks inside the container:syscall.Mount("proc", "/proc", "proc", 0, "") - Cgroup v2 resource limits: Write limits to the cgroup filesystem:
- Memory: write to
memory.maxin/sys/fs/cgroup/<name>/ - CPU: write to
cpu.max(e.g.,"50000 100000"for 50% CPU) - PIDs: write to
pids.max - Add the child process to the cgroup by writing its PID to
cgroup.procs
- Memory: write to
- Cleanup: Remove the cgroup directory on exit. Use
deferto ensure cleanup runs.
Namespace Flags
| Flag | Constant | What It Isolates |
|---|---|---|
| PID | CLONE_NEWPID |
Process IDs — container sees PID 1 |
| UTS | CLONE_NEWUTS |
Hostname — container gets its own |
| Mount | CLONE_NEWNS |
Mounts — container has its own mount table |
| Net | CLONE_NEWNET |
Network — container has its own network stack (stretch goal) |
Cgroup v2 File Layout
/sys/fs/cgroup/contain-<id>/
├── cgroup.procs ← Write PID here to add process to group
├── memory.max ← Max memory in bytes (e.g., "67108864" for 64M)
├── cpu.max ← "quota period" (e.g., "50000 100000" = 50%)
└── pids.max ← Max number of processes (e.g., "64")
Helper: Parse Memory Strings
func parseMemory(s string) (int64, error) {
s = strings.TrimSpace(s)
multipliers := map[byte]int64{'K': 1024, 'M': 1024 * 1024, 'G': 1024 * 1024 * 1024}
if len(s) == 0 {
return 0, fmt.Errorf("empty memory string")
}
last := s[len(s)-1]
if m, ok := multipliers[last]; ok {
n, err := strconv.ParseInt(s[:len(s)-1], 10, 64)
return n * m, err
}
return strconv.ParseInt(s, 10, 64)
}
Suggested Structure
contain/
├── main.go ← CLI entry point, run vs child dispatch
├── container.go ← Namespace setup, re-exec, run the command
├── filesystem.go ← pivot_root, mount /proc, unmount old root
├── cgroup.go ← Create cgroup, write limits, cleanup
├── cgroup_test.go ← Test memory parsing, limit file generation
└── rootfs/ ← Alpine mini rootfs (not committed to git)
Hints
Suggested approach:
- Start with the re-exec pattern: parent creates a child process, child prints "hello from child" — verify it works
- Add
CLONE_NEWUTSand set the hostname — verify withhostnamecommand- Add
CLONE_NEWPID— verify child sees itself as PID 1- Add filesystem isolation:
pivot_rootinto the Alpine rootfs- Mount
/proc— verifypsshows only the container's processes- Add cgroup memory limits — verify by allocating memory beyond the limit
- Add CPU and PID limits
- Add cleanup logic with
defer
Testing Memory Limits
# Inside the container, try to allocate more than the limit
/ # dd if=/dev/zero of=/dev/null bs=1M count=200
# Should be killed by OOM if memory limit is set below 200M
Testing PID Limits
# Inside the container, try to fork-bomb (safely!)
/ # for i in $(seq 1 100); do sleep 100 & done
# Should fail after hitting the pids.max limit
Debugging Tips
# Check what namespaces a process is in
ls -la /proc/self/ns/
# Check cgroup membership
cat /proc/self/cgroup
# Verify cgroup limits are applied
cat /sys/fs/cgroup/contain-*/memory.max
Safety Notes
- This project requires root.
sudois necessary for namespace creation and pivot_root. Always test in a VM or disposable environment — never on your daily machine. - Use a VM or cloud instance. A cheap VPS or a local VM (Multipass, Vagrant, VirtualBox) is ideal. If something goes wrong with mount namespaces, you could leave stale mounts.
- Always clean up cgroups. If your program crashes, leftover cgroup directories in
/sys/fs/cgroup/need manual removal withrmdir. - Don't add
CLONE_NEWNETuntil you're comfortable. Without network setup, the container has no network. Add it as a stretch goal with veth pairs. - The rootfs is read-write. Your container can modify files in the rootfs directory. Use a fresh copy for testing or make it read-only.
Testing
Some parts (namespace creation, pivot_root) require root and are hard to unit test. Focus tests on the pure logic:
func TestParseMemory(t *testing.T) {
tests := []struct {
input string
want int64
}{
{"64M", 67108864},
{"1G", 1073741824},
{"512K", 524288},
{"1048576", 1048576},
}
for _, tt := range tests {
t.Run(tt.input, func(t *testing.T) {
got, err := parseMemory(tt.input)
if err != nil {
t.Fatal(err)
}
if got != tt.want {
t.Errorf("parseMemory(%q) = %d, want %d", tt.input, got, tt.want)
}
})
}
}
func TestCgroupPaths(t *testing.T) {
cg := newCgroup("test-container")
if cg.path != "/sys/fs/cgroup/contain-test-container" {
t.Errorf("unexpected cgroup path: %s", cg.path)
}
}
func TestCPUMaxFormat(t *testing.T) {
// 50% CPU = "50000 100000"
got := formatCPUMax(50000)
want := "50000 100000"
if got != want {
t.Errorf("formatCPUMax(50000) = %q, want %q", got, want)
}
}
For integration testing, write a script that runs the full binary in a VM and checks output:
#!/bin/bash
# integration_test.sh — run in a VM with root
OUTPUT=$(sudo ./contain run --rootfs ./rootfs --hostname testbox -- /bin/hostname)
if [ "$OUTPUT" != "testbox" ]; then
echo "FAIL: expected 'testbox', got '$OUTPUT'"
exit 1
fi
echo "PASS"
Stretch Goals
- Network namespace with veth pairs: Create a
CLONE_NEWNETnamespace and set up a veth pair to give the container network access through the host - OverlayFS layers: Use overlayfs to create a copy-on-write layer over the rootfs, so the base image isn't modified
- Resource usage reporting: Read
memory.current,cpu.stat, andpids.currentfrom the cgroup and display usage on exit - Environment variables: Pass
--env KEY=VALUEflags into the container - Volume mounts: Support
--volume /host/path:/container/pathbind mounts - Container listing: Track running containers in a state file and support
contain listandcontain kill
Skills Used: Syscalls (clone, pivot_root, mount, sethostname), namespaces, cgroups v2, process management, file I/O, exec.Command, SysProcAttr, byte parsing, defer for cleanup, CLI argument handling.