This document describes security aspects of Sysbox system containers.
- Root Filesystem Jail
- Linux Namespaces
- User Namespace ID Mapping
- Procfs Virtualization
- Sysfs Virtualization
- Process Capabilities
- System Calls
- System Call Interception
- Devices
- Resource Limiting & Cgroups
- No-New-Privileges Flag
- Support for Linux Security Modules (LSMs)
- Out-of-Memory Score Adjustment
System container processes are confined to the directory hierarchy associated with the container's root filesystem, plus any configured mounts (e.g., Docker volumes or bind-mounts).
System containers deployed with Sysbox always use all Linux namespaces.
That is, when you deploy a system container with Sysbox, (e.g., docker run --runtime=sysbox-runc -it alpine:latest
), Sysbox will always use
all Linux namespaces to create the container.
This is done for enhanced container isolation.
It's one area where Sysbox deviates from the OCI specification, which leaves it to the higher layer (e.g., Docker + containerd) to choose the namespaces that should be enabled for the container.
The table below shows a comparison on namespace usage between system containers and regular Docker containers.
Namespace | Docker + Sysbox | Docker + OCI runc |
---|---|---|
mount | Yes | Yes |
pid | Yes | Yes |
uts | Yes | Yes |
net | Yes | Yes |
ipc | Yes | Yes |
cgroup | Yes | No |
user | Yes | No by default; Yes when Docker engine is configured with userns-remap mode |
By virtue of using the Linux user namespace, system containers get:
-
Stronger container isolation (e.g., root in the container maps to an unprivileged user on the host).
-
The root user inside the container has full privileges (i.e., all capabilities) within the container.
Refer to the kernel's user_namespaces manual page for more info.
The Linux cgroup namespace helps isolation by hiding host paths in cgroup
information exposed inside the system container via /proc
.
Refer to the kernel's cgroup_namespaces manual page for more info.
The Linux user namespace works by mapping user-IDs and group-IDs between the container and the host.
Sysbox performs the mapping as follows:
-
If the container manager (e.g., Docker) tells Sysbox to run the container with the user-namespace enabled, Sysbox honors the user-ID mappings provided by the container manager.
-
Otherwise, Sysbox automatically enables the user-namespace in the container and allocates user-ID mappings for it.
We call these "Directed userns ID mapping" and "Auto userns ID mapping" respectively.
The following sub-sections describe these in further detail.
Recommendation:
If your kernel has the shiftfs
module (you can check by running lsmod | grep shiftfs
), then
Auto Userns ID Mapping is preferred. Otherwise, you must use Directed Userns ID
Mapping (e.g., by configuring Docker with userns-remap).
When the container manager (e.g., Docker) tells Sysbox to enable the user-namespace in containers, Sysbox honors the user-ID mappings provided by the higher layer.
For Docker specifically, this occurs when the Docker daemon is configured with "userns-remap", as described this Docker document.
There is one advantage of Directed userns ID mapping:
- Sysbox does not need the Linux kernel's shiftfs module. This means Sysbox can run in kernels that don't carry that module (e.g., Ubuntu cloud images).
But there are a couple of drawbacks:
-
Configuring Docker with userns-remap places a few functional limitations on regular Docker containers (those launched with Docker's default runc).
-
Docker assigns the same user-ID mapping to all containers. This is not ideal: it would be better to assign each container exclusive user-ID mappings for improved cross-container isolation.
When the container manager does not specify the user-namespace for a container, Sysbox automatically enables it and allocates user-ID mappings for the container.
For Docker specifically, this occurs when Docker is not configured with userns-remap (by default, Docker is not configured with userns-remap).
This has a couple of advantages over Directed userns ID mapping:
-
No change in the configuration of the container manager (e.g., Docker) is required.
-
Sysbox allocates exclusive user-IDs to each container, which results in enhanced cross-container isolation.
The rest of this section describes how Sysbox allocates user-ID mappings for the containers.
By way of example: if we launch two containers with Sysbox, notice the ID mappings assigned to each:
$ docker run --runtime=sysbox-runc --name=syscont1 --rm -d alpine tail -f /dev/null
16c1abcc48259a47ef749e2d292ceef6a9f7d6ab815a6a5d12f06efc3c09d0ce
$ docker run --runtime=sysbox-runc --name=syscont2 --rm -d alpine tail -f /dev/null
573843fceac623a93278aafd4d8142bf631bc1b214b1bcfcd183b1be77a00b69
$ docker exec syscont1 cat /proc/self/uid_map
0 165536 65536
$ docker exec syscont2 cat /proc/self/uid_map
0 231072 65536
Each system container gets an exclusive range of 64K user IDs. For syscont1, user IDs [0, 65536] are mapped to host user IDs [165536, 231071]. And for syscont2 user IDs [0, 65536] are mapped to host user IDs [231072, 65536].
The same applies to the group IDs.
The reason 64K user-IDs are given to each system container is to allow the
container to have IDs ranging from the root
(ID 0) all the way up to user
nobody
(ID 65534).
Exclusive ID mappings ensure that if a container process somehow escapes the container's root filesystem jail, it will find itself without any permissions to access any other files in the host or in other containers.
The exclusive host user IDs chosen by Sysbox are obtained from the /etc/subuid
and /etc/subgid
files:
$ more /etc/subuid
cesar:100000:65536
sysbox:165536:268435456
$ more /etc/subgid
cesar:100000:65536
sysbox:165536:268435456
These files are automatically configured by Sysbox during installation (or more
specifically when the sysbox-mgr
component is started during installation)
By default, Sysbox reserves a range of 268435456 user IDs (enough to accommodate 4K system containers, each with 64K user IDs).
If more than 4K containers are running at the same time, Sysbox will by default
re-use user-ID mappings from the range specified in /etc/subuid
. The same
applies to group-ID mappings. In this scenario multiple system containers may
share the same user-ID mapping, reducing container-to-container isolation a bit.
For extra security, it's possible to configure Sysbox to not re-use mappings and instead fail to launch new system containers until host user IDs become available (i.e., when other system containers are stopped).
The size of the reserved ID range, as well as the policy in case the range is
exhausted, is configurable via the sysbox-mgr command line. If you wish to
change this, See sudo sysbox-mgr --help
and use the Sysbox reconfiguration procedure.
Auto userns ID mapping requires the presence of the shiftfs module in the Linux kernel.
Sysbox will check for this. If the module is required but not present in the Linux kernel, Sysbox will fail to launch containers and issue an error such as this one.
Note that shiftfs is present in Ubuntu Desktop and Server editions, but likely not present in Ubuntu cloud editions.
Sysbox performs partial virtualization of procfs inside the system container. This is a key differentiating feature of Sysbox.
The goal for this virtualization is to expose procfs as read-write yet ensure the system container processes cannot modify system-wide settings on the host via procfs.
The virtualization is "partial" because many resources under procfs are already "namespaced" (i.e., isolated) by the Linux kernel. However, many others are not, and it is for those that Sysbox performs the virtualization.
The virtualization of procfs inside the container not only occurs on "/proc", but also in any other procfs mountpoints inside the system container.
By extension, this means that inner containers also see a virtualized procfs, ensuring they too are well isolated.
Procfs virtualization is independent among system containers, meaning that each system container gets its own virtualized procfs: any changes it does to it are not seen in other system containers.
Sysbox takes care of exposing and tracking the procfs contexts within each system container.
Sysbox performs partial virtualization of sysfs inside the system container, for the same reasons as with procfs (see prior section).
In addition, the /sys/fs/cgroup
sub-directory is mounted read-write to allow
system container processes to assign cgroup resources within the system
container. System container processes can only use cgroups to assign a subset of
the resources assigned to the system container itself. Processes inside the
system container can't modify cgroup resources assigned to the system container
itself.
A system container's init process configured with user-ID 0 (root) always starts with all capabilities enabled.
$ docker run --runtime=sysbox-runc -it alpine:latest
/ # grep -i cap /proc/self/status
CapInh: 0000003fffffffff
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000003fffffffff
Note that the system container's Linux user-namespace ensures that these capabilities are only applicable to resources assigned to the system container itself. The process has no capabilities on resources not assigned to the system container.
A system container's init process configured with a non-root user-ID starts with no capabilities. For example, when deploying system containers with Docker:
$ docker run --runtime=sysbox-runc --user 1000 -it alpine:latest
/ $ grep -i cap /proc/self/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
This mimics the way capabilities are assigned to users on a physical host or VM.
System containers deployed with Sysbox allow a minimum set of 300+ syscalls, using Linux seccomp.
Significant syscalls blocked within system containers are the same as those listed in this Docker article, except that system container allow these system calls too:
mount
umount2
add_key
request_key
keyctl
pivot_root
gethostname
sethostname
setns
unshare
It's currently not possible to reduce the set of syscalls allowed within
a system container (i.e., the Docker --security-opt seccomp=<profile>
option
is not supported).
Sysbox performs selective system call interception on a few "control-path"
system calls, such as mount
and umount2
.
This is another key feature of Sysbox, and it's done in order to perform proper procfs and sysfs virtualization.
Sysbox does this very selectively in order to ensure performance of processes inside the system container is not impacted.
The following devices are always present in the system container:
/dev/null
/dev/zero
/dev/full
/dev/random
/dev/urandom
/dev/tty
Additional devices may be added by the container engine. For example, when deploying system containers with Docker, you typically see the following devices in addition to the ones listed above:
/dev/console
/dev/pts
/dev/mqueue
/dev/shm
Sysbox does not currently support exposing host devices inside system
containers (e.g., via the docker run --device
option). We are
working on adding support for this.
System container resource consumption can be limited via cgroups.
This can be used to balance resource consumption as well as to prevent denial-of-service attacks in which a buggy or compromised system container consumes all available resources in the system.
For example, when using Docker to deploy system containers, the
docker run --cpu*
, --memory*
, --blkio*
, etc., settings can be
used for this purpose.
Modern versions of the Linux kernel (>= 3.5) support a per-process
attribute called no_new_privs
.
It's a security feature meant to ensure a child process can't gain more privileges than its parent process.
Once this attribute is set on a process it can't be unset, and it's inherited by all child and further descendant processes.
The details are explained here and here.
The no_new_privs
attribute may be set on the init process of a
container, for example, using the docker run --security-opt no-new-privileges
flag (see the docker run
doc for details).
In a system container, the container's init process is normally owned by the system container's root user and granted full privileges within the system container's user namespace (as described above).
In this case setting the no_new_privs
attribute on the system container's init
process has no effect as far as limiting the privileges it may get (since it
already has all privileges within the system container).
However, it does have effect on child processes with lesser privileges (those won't be allowed to elevate their privileges, preventing them from executing setuid programs for example).
Sysbox does not yet support AppArmor profiles to apply mandatory access control (MAC) to containers.
If the container manager (e.g., Docker) instructs Sysbox to apply an AppArmor profile on a container, Sysbox currently ignores this.
The rationale behind this is that typical AppArmor profiles from container managers such as Docker are too restrictive for system containers, and don't give you much benefit given that Sysbox gets equivalent protection by enabling the Linux user namespaces in its containers.
For example, Docker's default AppArmor profile is too restrictive for system containers.
Having said this, in the near future we plan to add some support for AppArmor in order to follow the defense-in-depth security principle.
Sysbox does not yet support running on systems with SELinux enabled.
Sysbox does not have support for other Linux LSMs at this time.
The Linux kernel has a mechanism to kill processes when the system is running low on memory.
The decision on which process to kill is done based on an out-of-memory (OOM) score assigned to all processes.
The score is in the range [-1000:1000], where higher values means higher probability of the process being killed when the host reaches an out-of-memory scenario.
It's possible for users with sufficient privileges to adjust the OOM
score of a given process, via the /proc/[pid]/oom_score_adj
file.
A system container's init process OOM score adjustment can be
configured to start at a given value. For example, using Docker's --oom-score-adj
option:
$ docker run --runtime=sysbox-runc --oom-score-adj=-100 -it alpine:latest
In addition, Sysbox ensures that system container processes are
allowed to modify their out-of-memory (OOM) score adjustment to any
value in the range [-999:1000], via /proc/[pid]/oom_score_adj
.
This is necessary in order to allow system software that requires such adjustment range (e.g., Kubernetes) to operate correctly within the system container.
From a host security perspective however, allowing system container processes to adjust their OOM score downwards is risky, since it means that such processes are unlikely to be killed when the host is running low on memory.
To mitigate this risk, a user can always put an upper bound on the memory allocated to each system container. Though this does not prevent a system container process from reducing its OOM score, placing such an upper bound reduces the chances of the system running out of memory and prevents memory exhaustion attacks by malicious processes inside the system container.
Placing an upper bound on the memory allocated to the system container
can be done by using Docker's --memory
option:
$ docker run --runtime=sysbox-runc --memory=100m -it alpine:latest