Kernel+Userland: Introduce containers #22968

supercomputer7 · 2024-01-27T17:42:30Z

This is the thing I worked on for the last 2 weeks - containers, which consist of combining 4 main features:

Jail restrictions
PID isolation in Process lists
VFS root contexts
Hostname contexts

The concept of VFS root context is very similar to the Linux mount namespace idea. I still have to write all the reasoning behind what I do, so this is a draft, for now.
In the next couple of days I will try to land the last commits of bringing it all up with a new shiny userspace utility called runc (for "run container").

Please note that although this is not finished yet, there are many good bits in the PR which will benefit the system in so many ways, and because I use them as foundations for the end goal, I rather keep them in this PR.

supercomputer7 · 2024-02-03T13:57:50Z

Now this is mostly done in terms of finishing the design, I still need to fix CI and ensure proper commit messages are in place.

As for future development, we should aim for adding sysfs nodes for unshared resources. This will probably require us to also keep a list of attached process to a resource (like with the ScopedProcessList class) if we want to do something meaningful.
In addition to that, I searched on how we could access VFS root context filesystem view, and this serverfault question has some interesting answers.

Adding more features is of course possible. I thought about adding a way to set a limit on how many processes can be created within a ScopedProcessList, to help ensuring fork-bombs can't actually spawn limitless amount of processes. A special prctl option can be defined to set this limit (and if the process isn't attached to a scoped process list, we could return ESRCNOTFOUND).

supercomputer7 · 2024-02-09T09:01:19Z

More ideas that came to my mind:

Adding separation to SlavePTY devices so we can mount devpts filesystems within a container, and based on the VFS root context (which could hold a list instead of the global list), the processes can share a new set of devices contained in their own filesystem instance in such VFS root context.
Instantiate separation for other devices so maybe attach each device to certain VFS root context so it could be opened only within the same VFS root context. If you open a device node before moving to other VFS root context, then you keep that device node being opened in a file descriptor. This will allow creating device nodes without the fear of processes trying to open them to bypass the container isolation (so you can't open the main disk device node and read from it).
This also means that we could basically use access tokens for devices across multiple contexts as well, so it means that each re-presentable device has its own major and minor numbers which are of course shared across the multiple contexts (and are unique for the whole OS, but not for each context), and only if you have access to them through your context (which should be given by a special syscall), then you can actually open the device. Ofc, once you have a device being opened in the context, there's no meaning to allow revoking access so there'll be no syscall to do so.

supercomputer7 · 2024-02-09T14:19:00Z

Another clever idea - we could take leverage of the VFS root context idea to implement lazily-unmount feature - essentially we can allow the user to ask the kernel to move the filesystem out of its context, attaching it to the kernel VFS root context which is not visible for userspace, and then add some kernel process worker that will clean it up once the last references to the filesystem inodes are cleaned.

supercomputer7 · 2024-02-09T16:25:53Z

This is blocked until #23138 is merged, since I plan on expanding on loop devices for containers as well.

stale · 2024-03-03T09:59:04Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!

stale · 2024-03-31T06:52:49Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!

OfficialPixelBrush · 2024-04-09T10:39:03Z

Any updates on this front?

supercomputer7 · 2024-04-09T17:50:20Z

Any updates on this front?

Not yet. PR #23814 should be merged first so containers could be used efficiently - currently it takes hot 2-3 seconds to fire-up a fully functional BuggieBox based container, so we should aim to build container images and use them with loop devices.

supercomputer7 · 2024-04-12T14:32:19Z

Let's make this a non-draft PR, as loop devices will take some time (I found that I need to do more fixes there to make it usable, so it will wait).
This PR is not perfect, but it gives a reasonable structure to what I want containers to look alike, so let's try to merge this as-is.
It takes hot 2-3 seconds to fire-up a full-fledged BuggieBox container, but it works correctly - complete filesystem isolation, together with process isolation, gives the impression that the general idea of this PR is doing a thing in the right direction. We should aim at keeping everything simple as possible, including the configuration of the actual containers.
Putting more complicated applications, like the Browser, in a containerized environment, is ofc another goal. but we need to ensure that creating containers could be done almost instantly beforehand.

The VFSRootContext class, as its name suggests, holds a context for a root directory with its mount table and the root custody/inode in the same class. The idea is derived from the Linux mount namespace mechanism. It mimicks the concept of the ProcessList object, but it is adjusted for a root directory tree context. In contrast to the ProcessList concept, processes that share the default VFSRootContext can't see other VFSRootContext related properties such as as the mount table and root custody/inode. To accommodate to this change progressively, we internally create 2 main VFS root contexts for now - one for kernel processes (as they don't need to care about VFS root contexts for the most part), and another for all userspace programs. This separation allows us to continue pretending for userspace that everything is "normal" as it is used to be, until we introduce proper interfaces in the mount-related syscalls as well as in the SysFS. We make VFSRootContext objects being listed, as another preparation before we could expose interfaces to userspace. As a result, the PowerStateSwitchTask now iterates on all contexts and tear them down one by one.

Expose some initial interfaces in the mount-related syscalls to select the desired VFSRootContext, by specifying the VFSRootContext index number. For now there's still no way to create a different VFSRootContext, so the only valid IDs are -1 (for currently attached VFSRootContext) or 1 for the first userspace VFSRootContext.

The whole concept of Jails was far more complicated than I actually want it to be, so let's reduce the complexity of how it works from now on. Please note that we always leaked the attach count of a Jail object in the fork syscall if it failed midway. Instead, we should have attach to the jail just before registering the new Process, so we don't need to worry about unsuccessful Process creation. The reduction of complexity in regard to jails means that instead of relying on jails to provide PID isolation, we could simplify the whole idea of them to be a simple SetOnce, and let the ProcessList (now called ScopedProcessList) to be responsible for this type of isolation. Therefore, we apply the following changes to do so: - We make the Jail concept no longer a class of its own. Instead, we simplify the idea of being jailed to a simple ProtectedValues boolean flag. This means that we no longer check of matching jail pointers anywhere in the Kernel code. To set a process as jailed, a new prctl option was added to set a Kernel SetOnce boolean flag (so it cannot change ever again). - We provide Process & Thread methods to iterate over process lists. A process can either iterate on the global process list, or if it's attached to a scoped process list, then only over that list. This essentially replaces the need of checking the Jail pointer of a process when iterating over process lists.

These programs are capable of running other programs, so we should restrict them from potentially running SUID programs, which was never a functionality we supported for those programs anyway.

This new syscall will be used by the upcoming runc (run-container) utility. In addition to that, this syscall allows userspace to neatly copy RAMFS instances to other places, which was not possible in the past.

There's no point in constructing an object just for the sake of keeping a state that can be touched by anything in the kernel code. Let's reduce everything to be in a C++ namespace called with the previous name "VirtualFileSystem" and keep a smaller textual-footprint struct called "VirtualFileSystemDetails". This change also cleans up old "friend class" statements that were no longer needed, and move methods from the VirtualFileSystem code to more appropriate places as well. Please note that the method of locking all filesystems during shutdown is removed, as in that place there's no meaning to actually locking all filesystems because of running in kernel mode entirely.

Similarly to VFSRootContext and ScopedProcessList, this class intends to form resource isolation as well. We add this class as an infrastructure preparation of hostname contexts which should allow processes to obtain different hostnames on the same machine.

These 2 syscalls are responsible for unsharing resources in the system, such as hostname, VFS root contexts and process lists. Together with an appropriate userspace implementation, these syscalls could be used for creating a sandbox environment (containers) for user programs.

Similarly to KLexicalPath, we might need to check if a path is canonical or not.

These 2 methods will be used later by the userspace implementation that will handle creation of containers.

Together with a first JSON file for bringing up a fully functional BuggieBox container, we allow users to take advantage of the kernel unsharing features that were introduced in earlier commits.

supercomputer7 · 2024-07-12T17:19:39Z

This should be ready for review again :)

timschumi

Let's get this conflict magnet out of the way.

supercomputer7 force-pushed the containers branch 4 times, most recently from 5bce19c to 042e9f0 Compare February 2, 2024 10:06

supercomputer7 mentioned this pull request Feb 2, 2024

Kernel: Remove hostname-related syscalls #20111

Closed

supercomputer7 force-pushed the containers branch 2 times, most recently from 2a94e05 to 71f5a78 Compare February 3, 2024 12:32

supercomputer7 mentioned this pull request Feb 9, 2024

Kernel: Make the shutdown procedure infallible #20786

Closed

supercomputer7 mentioned this pull request Feb 9, 2024

Kernel+Userland: Introduce loop devices #23138

Merged

supercomputer7 added the ⛔️ pr-is-blocked PR is blocked by something outside of the author's control, protected from stalebot label Feb 9, 2024

supercomputer7 mentioned this pull request Feb 29, 2024

Kernel: Implement "negative" pledges #23389

Closed

stale bot added the stale label Mar 3, 2024

supercomputer7 removed ⛔️ pr-is-blocked PR is blocked by something outside of the author's control, protected from stalebot stale labels Mar 3, 2024

stale bot added the stale label Mar 31, 2024

stale bot removed the stale label Apr 9, 2024

supercomputer7 force-pushed the containers branch from 71f5a78 to 54943a8 Compare April 12, 2024 14:30

supercomputer7 marked this pull request as ready for review April 12, 2024 14:30

supercomputer7 requested review from BertalanD and timschumi as code owners April 12, 2024 14:31

github-actions bot added the 👀 pr-needs-review PR needs review from a maintainer or community member label Apr 12, 2024

stale bot removed the stale label Jun 24, 2024

supercomputer7 mentioned this pull request Jul 5, 2024

Kernel panic after excessive fork()s #24627

Open

supercomputer7 force-pushed the containers branch 2 times, most recently from 0fe8451 to 82a6d25 Compare July 12, 2024 10:36

supercomputer7 marked this pull request as draft July 12, 2024 10:50

github-actions bot removed the 👀 pr-needs-review PR needs review from a maintainer or community member label Jul 12, 2024

supercomputer7 added 12 commits July 12, 2024 14:37

Userland: Always enter jail mode in Browser and Assistant

8ce28f9

These programs are capable of running other programs, so we should restrict them from potentially running SUID programs, which was never a functionality we supported for those programs anyway.

Kernel+Userland: Introduce the copy_mount syscall

601cd35

This new syscall will be used by the upcoming runc (run-container) utility. In addition to that, this syscall allows userspace to neatly copy RAMFS instances to other places, which was not possible in the past.

AK: Add is_canonical method for LexicalPath

2b49ad9

Similarly to KLexicalPath, we might need to check if a path is canonical or not.

LibCore: Add System methods to handle the unshare syscall family

7e7ebb7

These 2 methods will be used later by the userspace implementation that will handle creation of containers.

Userland+Base: Introduce userspace implementation for running containers

8a53c5d

Together with a first JSON file for bringing up a fully functional BuggieBox container, we allow users to take advantage of the kernel unsharing features that were introduced in earlier commits.

Documentation: Add document about containers

851f72e

supercomputer7 force-pushed the containers branch from 82a6d25 to 851f72e Compare July 12, 2024 17:18

supercomputer7 marked this pull request as ready for review July 12, 2024 17:19

supercomputer7 requested a review from timschumi July 12, 2024 17:19

github-actions bot added the 👀 pr-needs-review PR needs review from a maintainer or community member label Jul 12, 2024

supercomputer7 mentioned this pull request Jul 13, 2024

Kernel: New PCI driver subsystem design #23448

Draft

supercomputer7 mentioned this pull request Jul 20, 2024

Kernel+Userland: Add auto-jailing symlink of dynamic loader, introduce the new set-elf-jailed utility #24764

Draft

timschumi approved these changes Jul 21, 2024

View reviewed changes

timschumi merged commit 60cda20 into SerenityOS:master Jul 21, 2024
14 checks passed

github-actions bot removed the 👀 pr-needs-review PR needs review from a maintainer or community member label Jul 21, 2024

brody-qq mentioned this pull request Jul 27, 2024

Kernel/FileSystem: Small code cleanups #24760

Merged

alec3660 mentioned this pull request Aug 8, 2024

Assistant: Terminal is launched in jail mode #24913

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel+Userland: Introduce containers #22968

Kernel+Userland: Introduce containers #22968

supercomputer7 commented Jan 27, 2024 •

edited

Loading

supercomputer7 commented Feb 3, 2024 •

edited

Loading

supercomputer7 commented Feb 9, 2024 •

edited

Loading

supercomputer7 commented Feb 9, 2024

supercomputer7 commented Feb 9, 2024

stale bot commented Mar 3, 2024

stale bot commented Mar 31, 2024

OfficialPixelBrush commented Apr 9, 2024

supercomputer7 commented Apr 9, 2024

supercomputer7 commented Apr 12, 2024 •

edited

Loading

supercomputer7 commented Jul 12, 2024

timschumi left a comment

Kernel+Userland: Introduce containers #22968

Kernel+Userland: Introduce containers #22968

Conversation

supercomputer7 commented Jan 27, 2024 • edited Loading

supercomputer7 commented Feb 3, 2024 • edited Loading

supercomputer7 commented Feb 9, 2024 • edited Loading

supercomputer7 commented Feb 9, 2024

supercomputer7 commented Feb 9, 2024

stale bot commented Mar 3, 2024

stale bot commented Mar 31, 2024

OfficialPixelBrush commented Apr 9, 2024

supercomputer7 commented Apr 9, 2024

supercomputer7 commented Apr 12, 2024 • edited Loading

supercomputer7 commented Jul 12, 2024

timschumi left a comment

Choose a reason for hiding this comment

supercomputer7 commented Jan 27, 2024 •

edited

Loading

supercomputer7 commented Feb 3, 2024 •

edited

Loading

supercomputer7 commented Feb 9, 2024 •

edited

Loading

supercomputer7 commented Apr 12, 2024 •

edited

Loading