Containerization

Securing Container Filesystems with pivot_root

Discover how Docker uses Mount namespaces and the pivot_root system call to jail processes within a private, layered root filesystem.

DevOpsAdvanced12 min read

In this article

The Architecture of Isolation

The Global Namespace Problem
The Virtual Filesystem Layer

Mount Namespaces and Propagation

Configuring Namespace Propagation

Securing the Root with Pivot Root

The Mechanics of the Pivot

Layered Filesystems and the Rootfs

Copy-on-Write Performance

The Architecture of Isolation

Traditional operating systems operate with a single global file tree where every process shares a common view of the root directory. This architecture becomes a liability when multiple applications require conflicting versions of system libraries or when a compromised process gains the ability to traverse the entire host filesystem. To solve this, Docker creates a localized illusion where a process believes it exists on an entirely different machine with its own unique file structure.

The fundamental goal is to decouple the process from the host environment without the heavy overhead of hardware virtualization. This is achieved through a combination of kernel features that redirect file access requests at the virtual filesystem layer. By the time a process issues an open system call, the kernel has already mapped its relative path to a private, isolated directory tree.

Containerization is not about running a separate kernel but about providing a process with a restricted and private view of the existing kernel resources.

Modern container runtimes rely on a specific sequence of system calls to establish this jail. Before the application code ever executes, the runtime must prepare a specialized filesystem, isolate the mount table, and securely swap the root of the environment. This multi-stage process ensures that even if an attacker gains root privileges inside the container, they cannot see or touch the host files.

The Global Namespace Problem

In a standard Linux environment, the mount table is a global data structure managed by the kernel. When a new disk is attached or a filesystem is mounted, every process on the system immediately sees the changes. This lack of boundaries means that any process can theoretically access sensitive paths like /etc/shadow or /proc/kcore if its permissions allow it.

Containers break this shared state by giving each process group its own private copy of the mount table. This allows a web server to see a root directory containing only the necessary binaries and configuration files while the host remains completely invisible. The separation is enforced at the kernel level, ensuring zero performance penalty for file access while maintaining strict security boundaries.

The Virtual Filesystem Layer

The Linux kernel uses the Virtual Filesystem (VFS) as an abstraction layer to manage different storage backends. Whether a file sits on an ext4 disk, a network share, or a temporary memory buffer, the VFS provides a unified interface for system calls. Container isolation hooks into this layer by manipulating how the VFS resolves pathnames for specific process namespaces.

By modifying the VFS mount entries for a specific process, the kernel can redirect a request for /bin/sh to a different physical location on the disk. This redirection is transparent to the application, which continues to use standard paths while remaining confined to a subdirectory of the host. This mechanism is the bedrock of what we perceive as a container image.

Mount Namespaces and Propagation

The first step in jailing a process is the creation of a new mount namespace using the clone or unshare system calls. When a process enters a new mount namespace, it receives a private copy of the host mount table. Changes made to mounts within this namespace, such as adding a new volume or unmounting a disk, do not affect the host or other containers.

However, simply copying the mount table is insufficient because of a feature called mount propagation. In many modern Linux distributions, mounts are marked as shared by default. This means that a mount event in one namespace could automatically propagate to another, potentially leaking host information into the container or allowing a container to unmount host filesystems.

MS_SHARED: Events propagate both into and out of the mount namespace.
MS_PRIVATE: The mount is completely isolated and does not share events with any other namespace.
MS_SLAVE: The mount receives updates from the host but cannot send updates back.
MS_UNBINDABLE: A private mount that cannot be used as a source for future bind mounts.

To ensure true isolation, container runtimes must explicitly change the propagation type of the entire namespace. This is typically done by recursively remounting the root directory with the MS_PRIVATE or MS_SLAVE flag. This step severs the connection between the container and the host mount events, creating a hermetically sealed environment for the filesystem operations.

Configuring Namespace Propagation

Setting up a private mount table requires precise execution to avoid accidental leaks. The runtime usually begins by calling unshare with the CLONE_NEWNS flag to detach from the parent namespace. Once inside the new namespace, it executes a recursive mount call on the root directory to reset the behavior of all existing and future mount points.

cInitializing a Private Mount Namespace

1// Create a new mount namespace
2unshare(CLONE_NEWNS);
3
4// Mark the entire root tree as private to prevent propagation leaks
5// This ensures that mounts inside the container stay inside the container
6mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL);

This code snippet illustrates the critical hardening step performed by runtimes like runc. By marking the root as private, we ensure that if the container later mounts a temporary filesystem for its own use, that mount point will not appear on the host system. It also prevents the host from accidentally mounting a sensitive volume into an already running container.

Securing the Root with Pivot Root

For many years, the chroot system call was the primary method for changing a process root directory. While chroot effectively changes the starting point for path resolution, it is fundamentally insecure for containerization. A process with root privileges can easily escape a chroot jail by creating a new directory, chrooting into it, and then using relative paths to climb back up to the real host root.

The pivot_root system call provides a more robust solution by swapping the actual root mount of the entire namespace. Unlike chroot, which only affects the process view, pivot_root moves the host root to a temporary location and promotes a new directory to be the permanent root of the mount namespace. This operation is much harder to subvert because it alters the underlying mount structure rather than just the process state.

Warning: The chroot system call is a filesystem convenience, not a security boundary. Always use pivot_root for container isolation to prevent directory traversal escapes.

Using pivot_root comes with strict technical requirements to maintain system stability. The new root must be a mount point itself, which is why container runtimes often bind-mount a directory onto itself before pivoting. Additionally, the old root must be moved to a subdirectory within the new root so it can be safely unmounted once the transition is complete.

The Mechanics of the Pivot

The transition from host root to container root involves a surgical sequence of mounts and unmounts. First, the runtime identifies the directory containing the container filesystem. It then creates a placeholder directory inside that filesystem to hold the old host root temporarily during the swap.

After the pivot_root call executes, the process sees the new directory as the root and the old host root as a subdirectory. To finalize the isolation, the runtime must unmount the old host root and remove the placeholder directory. This leaves the process with absolutely no path, relative or absolute, back to the host filesystem.

goExecuting the Pivot Root Sequence

1// 1. Bind mount the new root to itself to satisfy pivot_root requirements
2syscall.Mount(rootfs, rootfs, "", syscall.MS_BIND|syscall.MS_REC, "")
3
4// 2. Create the temporary directory for the old root
5oldRoot := filepath.Join(rootfs, ".old_host_root")
6os.MkdirAll(oldRoot, 0700)
7
8// 3. Swap the roots
9syscall.PivotRoot(rootfs, oldRoot)
10
11// 4. Change current directory to the new root
12os.Chdir("/")
13
14// 5. Unmount and remove the old host root
15syscall.Unmount("/.old_host_root", syscall.MNT_DETACH)
16os.RemoveAll("/.old_host_root")

This sequence ensures that the process is completely disconnected from the host. By the time the application starts, its entire universe is confined to the provided rootfs. Any attempt to use dot-dot to move above the root will simply land the process back at the same root directory, just as it does on a standard Linux host.

Layered Filesystems and the Rootfs

The root filesystem that we pivot into is rarely a simple directory. In Docker, it is a dynamic composition of multiple read-only image layers and a single writable container layer. This is achieved using OverlayFS, a union filesystem that merges multiple directories into a single unified view. This layering is what allows Docker to share common base images across hundreds of containers while using minimal disk space.

OverlayFS works by defining a lower directory for the base image, an upper directory for the changes made by the container, and a merged directory that acts as the final view. When a process reads a file, the kernel looks for it in the upper layer first; if it is not found, it falls back to the lower layers. This enables the efficient copy-on-write behavior that defines container performance.

When a container is started, the runtime first assembles these layers using the mount system call with the overlay type. Only after this merged view is created does the mount namespace and pivot_root logic take over. The result is a highly efficient, isolated, and writable environment that can be spun up in milliseconds.

Lowerdir: One or more read-only directories containing the base image contents.
Upperdir: A writable directory where all container-specific changes are stored.
Merged: The mount point that combines all layers into a single coherent file tree.
Workdir: A hidden directory used by the kernel to manage atomic file operations during copy-ups.

Copy-on-Write Performance

One of the biggest advantages of this layered approach is the copy-on-write mechanism. When a process attempts to modify a file that exists in the read-only lower layer, the kernel automatically copies that file to the writable upper layer before allowing the write to proceed. This ensures that the original image remains untouched and can be reused by other containers.

Developers should be aware that while this process is fast, modifying large files for the first time can incur a small latency penalty. For write-heavy workloads, it is often better to use Docker volumes, which bypass the OverlayFS layers entirely. Understanding this trade-off is key to optimizing containerized applications for high-throughput environments.

Enforcing Resource Constraints via Control Groups Orchestrating Isolation with the OCI Runtime