Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel+Userland: Introduce containers #22968

Merged
merged 12 commits into from
Jul 21, 2024

Conversation

supercomputer7
Copy link
Member

@supercomputer7 supercomputer7 commented Jan 27, 2024

This is the thing I worked on for the last 2 weeks - containers, which consist of combining 4 main features:

  • Jail restrictions
  • PID isolation in Process lists
  • VFS root contexts
  • Hostname contexts

The concept of VFS root context is very similar to the Linux mount namespace idea. I still have to write all the reasoning behind what I do, so this is a draft, for now.
In the next couple of days I will try to land the last commits of bringing it all up with a new shiny userspace utility called runc (for "run container").

Please note that although this is not finished yet, there are many good bits in the PR which will benefit the system in so many ways, and because I use them as foundations for the end goal, I rather keep them in this PR.

@supercomputer7
Copy link
Member Author

supercomputer7 commented Feb 3, 2024

Now this is mostly done in terms of finishing the design, I still need to fix CI and ensure proper commit messages are in place.

As for future development, we should aim for adding sysfs nodes for unshared resources. This will probably require us to also keep a list of attached process to a resource (like with the ScopedProcessList class) if we want to do something meaningful.
In addition to that, I searched on how we could access VFS root context filesystem view, and this serverfault question has some interesting answers.

Adding more features is of course possible. I thought about adding a way to set a limit on how many processes can be created within a ScopedProcessList, to help ensuring fork-bombs can't actually spawn limitless amount of processes. A special prctl option can be defined to set this limit (and if the process isn't attached to a scoped process list, we could return ESRCNOTFOUND).

@supercomputer7
Copy link
Member Author

supercomputer7 commented Feb 9, 2024

More ideas that came to my mind:

  • Adding separation to SlavePTY devices so we can mount devpts filesystems within a container, and based on the VFS root context (which could hold a list instead of the global list), the processes can share a new set of devices contained in their own filesystem instance in such VFS root context.
  • Instantiate separation for other devices so maybe attach each device to certain VFS root context so it could be opened only within the same VFS root context. If you open a device node before moving to other VFS root context, then you keep that device node being opened in a file descriptor. This will allow creating device nodes without the fear of processes trying to open them to bypass the container isolation (so you can't open the main disk device node and read from it).
    This also means that we could basically use access tokens for devices across multiple contexts as well, so it means that each re-presentable device has its own major and minor numbers which are of course shared across the multiple contexts (and are unique for the whole OS, but not for each context), and only if you have access to them through your context (which should be given by a special syscall), then you can actually open the device. Ofc, once you have a device being opened in the context, there's no meaning to allow revoking access so there'll be no syscall to do so.

@supercomputer7
Copy link
Member Author

Another clever idea - we could take leverage of the VFS root context idea to implement lazily-unmount feature - essentially we can allow the user to ask the kernel to move the filesystem out of its context, attaching it to the kernel VFS root context which is not visible for userspace, and then add some kernel process worker that will clean it up once the last references to the filesystem inodes are cleaned.

@supercomputer7
Copy link
Member Author

This is blocked until #23138 is merged, since I plan on expanding on loop devices for containers as well.

@supercomputer7 supercomputer7 added the ⛔️ pr-is-blocked PR is blocked by something outside of the author's control, protected from stalebot label Feb 9, 2024
Copy link

stale bot commented Mar 3, 2024

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!

@stale stale bot added the stale label Mar 3, 2024
@supercomputer7 supercomputer7 removed ⛔️ pr-is-blocked PR is blocked by something outside of the author's control, protected from stalebot stale labels Mar 3, 2024
Copy link

stale bot commented Mar 31, 2024

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!

@stale stale bot added the stale label Mar 31, 2024
@OfficialPixelBrush
Copy link
Contributor

Any updates on this front?

@stale stale bot removed the stale label Apr 9, 2024
@supercomputer7
Copy link
Member Author

Any updates on this front?

Not yet. PR #23814 should be merged first so containers could be used efficiently - currently it takes hot 2-3 seconds to fire-up a fully functional BuggieBox based container, so we should aim to build container images and use them with loop devices.

@supercomputer7 supercomputer7 marked this pull request as ready for review April 12, 2024 14:30
@github-actions github-actions bot added the 👀 pr-needs-review PR needs review from a maintainer or community member label Apr 12, 2024
@supercomputer7
Copy link
Member Author

supercomputer7 commented Apr 12, 2024

Let's make this a non-draft PR, as loop devices will take some time (I found that I need to do more fixes there to make it usable, so it will wait).
This PR is not perfect, but it gives a reasonable structure to what I want containers to look alike, so let's try to merge this as-is.
It takes hot 2-3 seconds to fire-up a full-fledged BuggieBox container, but it works correctly - complete filesystem isolation, together with process isolation, gives the impression that the general idea of this PR is doing a thing in the right direction. We should aim at keeping everything simple as possible, including the configuration of the actual containers.
Putting more complicated applications, like the Browser, in a containerized environment, is ofc another goal. but we need to ensure that creating containers could be done almost instantly beforehand.

@stale stale bot removed the stale label Jun 24, 2024
@supercomputer7 supercomputer7 force-pushed the containers branch 2 times, most recently from 0fe8451 to 82a6d25 Compare July 12, 2024 10:36
@supercomputer7 supercomputer7 marked this pull request as draft July 12, 2024 10:50
@github-actions github-actions bot removed the 👀 pr-needs-review PR needs review from a maintainer or community member label Jul 12, 2024
The VFSRootContext class, as its name suggests, holds a context for a
root directory with its mount table and the root custody/inode in the
same class.

The idea is derived from the Linux mount namespace mechanism.
It mimicks the concept of the ProcessList object, but it is adjusted for
a root directory tree context.
In contrast to the ProcessList concept, processes that share the default
VFSRootContext can't see other VFSRootContext related properties such as
as the mount table and root custody/inode.

To accommodate to this change progressively, we internally create 2 main
VFS root contexts for now - one for kernel processes (as they don't need
to care about VFS root contexts for the most part), and another for all
userspace programs.
This separation allows us to continue pretending for userspace that
everything is "normal" as it is used to be, until we introduce proper
interfaces in the mount-related syscalls as well as in the SysFS.

We make VFSRootContext objects being listed, as another preparation
before we could expose interfaces to userspace.
As a result, the PowerStateSwitchTask now iterates on all contexts
and tear them down one by one.
Expose some initial interfaces in the mount-related syscalls to select
the desired VFSRootContext, by specifying the VFSRootContext index
number.

For now there's still no way to create a different VFSRootContext, so
the only valid IDs are -1 (for currently attached VFSRootContext) or 1
for the first userspace VFSRootContext.
The whole concept of Jails was far more complicated than I actually want
it to be, so let's reduce the complexity of how it works from now on.
Please note that we always leaked the attach count of a Jail object in
the fork syscall if it failed midway.
Instead, we should have attach to the jail just before registering the
new Process, so we don't need to worry about unsuccessful Process
creation.

The reduction of complexity in regard to jails means that instead of
relying on jails to provide PID isolation, we could simplify the whole
idea of them to be a simple SetOnce, and let the ProcessList (now called
ScopedProcessList) to be responsible for this type of isolation.

Therefore, we apply the following changes to do so:
- We make the Jail concept no longer a class of its own. Instead, we
  simplify the idea of being jailed to a simple ProtectedValues boolean
  flag. This means that we no longer check of matching jail pointers
  anywhere in the Kernel code.
  To set a process as jailed, a new prctl option was added to set a
  Kernel SetOnce boolean flag (so it cannot change ever again).
- We provide Process & Thread methods to iterate over process lists.
  A process can either iterate on the global process list, or if it's
  attached to a scoped process list, then only over that list.
  This essentially replaces the need of checking the Jail pointer of a
  process when iterating over process lists.
These programs are capable of running other programs, so we should
restrict them from potentially running SUID programs, which was never a
functionality we supported for those programs anyway.
This new syscall will be used by the upcoming runc (run-container)
utility.

In addition to that, this syscall allows userspace to neatly copy RAMFS
instances to other places, which was not possible in the past.
There's no point in constructing an object just for the sake of keeping
a state that can be touched by anything in the kernel code.

Let's reduce everything to be in a C++ namespace called with the
previous name "VirtualFileSystem" and keep a smaller textual-footprint
struct called "VirtualFileSystemDetails".

This change also cleans up old "friend class" statements that were no
longer needed, and move methods from the VirtualFileSystem code to more
appropriate places as well.
Please note that the method of locking all filesystems during shutdown
is removed, as in that place there's no meaning to actually locking all
filesystems because of running in kernel mode entirely.
Similarly to VFSRootContext and ScopedProcessList, this class intends
to form resource isolation as well.
We add this class as an infrastructure preparation of hostname contexts
which should allow processes to obtain different hostnames on the same
machine.
These 2 syscalls are responsible for unsharing resources in the system,
such as hostname, VFS root contexts and process lists.

Together with an appropriate userspace implementation, these syscalls
could be used for creating a sandbox environment (containers) for user
programs.
Similarly to KLexicalPath, we might need to check if a path is canonical
or not.
These 2 methods will be used later by the userspace implementation that
will handle creation of containers.
Together with a first JSON file for bringing up a fully functional
BuggieBox container, we allow users to take advantage of the kernel
unsharing features that were introduced in earlier commits.
@supercomputer7
Copy link
Member Author

This should be ready for review again :)

Copy link
Member

@timschumi timschumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this conflict magnet out of the way.

@timschumi timschumi merged commit 60cda20 into SerenityOS:master Jul 21, 2024
14 checks passed
@github-actions github-actions bot removed the 👀 pr-needs-review PR needs review from a maintainer or community member label Jul 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants