-
-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel+Userland: Introduce containers #22968
Conversation
5bce19c
to
042e9f0
Compare
2a94e05
to
71f5a78
Compare
Now this is mostly done in terms of finishing the design, I still need to fix CI and ensure proper commit messages are in place. As for future development, we should aim for adding sysfs nodes for unshared resources. This will probably require us to also keep a list of attached process to a resource (like with the Adding more features is of course possible. I thought about adding a way to set a limit on how many processes can be created within a |
More ideas that came to my mind:
|
Another clever idea - we could take leverage of the VFS root context idea to implement lazily-unmount feature - essentially we can allow the user to ask the kernel to move the filesystem out of its context, attaching it to the kernel VFS root context which is not visible for userspace, and then add some kernel process worker that will clean it up once the last references to the filesystem inodes are cleaned. |
This is blocked until #23138 is merged, since I plan on expanding on loop devices for containers as well. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! |
Any updates on this front? |
Not yet. PR #23814 should be merged first so containers could be used efficiently - currently it takes hot 2-3 seconds to fire-up a fully functional BuggieBox based container, so we should aim to build container images and use them with loop devices. |
71f5a78
to
54943a8
Compare
Let's make this a non-draft PR, as loop devices will take some time (I found that I need to do more fixes there to make it usable, so it will wait). |
0fe8451
to
82a6d25
Compare
The VFSRootContext class, as its name suggests, holds a context for a root directory with its mount table and the root custody/inode in the same class. The idea is derived from the Linux mount namespace mechanism. It mimicks the concept of the ProcessList object, but it is adjusted for a root directory tree context. In contrast to the ProcessList concept, processes that share the default VFSRootContext can't see other VFSRootContext related properties such as as the mount table and root custody/inode. To accommodate to this change progressively, we internally create 2 main VFS root contexts for now - one for kernel processes (as they don't need to care about VFS root contexts for the most part), and another for all userspace programs. This separation allows us to continue pretending for userspace that everything is "normal" as it is used to be, until we introduce proper interfaces in the mount-related syscalls as well as in the SysFS. We make VFSRootContext objects being listed, as another preparation before we could expose interfaces to userspace. As a result, the PowerStateSwitchTask now iterates on all contexts and tear them down one by one.
Expose some initial interfaces in the mount-related syscalls to select the desired VFSRootContext, by specifying the VFSRootContext index number. For now there's still no way to create a different VFSRootContext, so the only valid IDs are -1 (for currently attached VFSRootContext) or 1 for the first userspace VFSRootContext.
The whole concept of Jails was far more complicated than I actually want it to be, so let's reduce the complexity of how it works from now on. Please note that we always leaked the attach count of a Jail object in the fork syscall if it failed midway. Instead, we should have attach to the jail just before registering the new Process, so we don't need to worry about unsuccessful Process creation. The reduction of complexity in regard to jails means that instead of relying on jails to provide PID isolation, we could simplify the whole idea of them to be a simple SetOnce, and let the ProcessList (now called ScopedProcessList) to be responsible for this type of isolation. Therefore, we apply the following changes to do so: - We make the Jail concept no longer a class of its own. Instead, we simplify the idea of being jailed to a simple ProtectedValues boolean flag. This means that we no longer check of matching jail pointers anywhere in the Kernel code. To set a process as jailed, a new prctl option was added to set a Kernel SetOnce boolean flag (so it cannot change ever again). - We provide Process & Thread methods to iterate over process lists. A process can either iterate on the global process list, or if it's attached to a scoped process list, then only over that list. This essentially replaces the need of checking the Jail pointer of a process when iterating over process lists.
These programs are capable of running other programs, so we should restrict them from potentially running SUID programs, which was never a functionality we supported for those programs anyway.
This new syscall will be used by the upcoming runc (run-container) utility. In addition to that, this syscall allows userspace to neatly copy RAMFS instances to other places, which was not possible in the past.
There's no point in constructing an object just for the sake of keeping a state that can be touched by anything in the kernel code. Let's reduce everything to be in a C++ namespace called with the previous name "VirtualFileSystem" and keep a smaller textual-footprint struct called "VirtualFileSystemDetails". This change also cleans up old "friend class" statements that were no longer needed, and move methods from the VirtualFileSystem code to more appropriate places as well. Please note that the method of locking all filesystems during shutdown is removed, as in that place there's no meaning to actually locking all filesystems because of running in kernel mode entirely.
Similarly to VFSRootContext and ScopedProcessList, this class intends to form resource isolation as well. We add this class as an infrastructure preparation of hostname contexts which should allow processes to obtain different hostnames on the same machine.
These 2 syscalls are responsible for unsharing resources in the system, such as hostname, VFS root contexts and process lists. Together with an appropriate userspace implementation, these syscalls could be used for creating a sandbox environment (containers) for user programs.
Similarly to KLexicalPath, we might need to check if a path is canonical or not.
These 2 methods will be used later by the userspace implementation that will handle creation of containers.
Together with a first JSON file for bringing up a fully functional BuggieBox container, we allow users to take advantage of the kernel unsharing features that were introduced in earlier commits.
82a6d25
to
851f72e
Compare
This should be ready for review again :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get this conflict magnet out of the way.
This is the thing I worked on for the last 2 weeks - containers, which consist of combining 4 main features:
The concept of VFS root context is very similar to the Linux mount namespace idea. I still have to write all the reasoning behind what I do, so this is a draft, for now.
In the next couple of days I will try to land the last commits of bringing it all up with a new shiny userspace utility called
runc
(for "run container").Please note that although this is not finished yet, there are many good bits in the PR which will benefit the system in so many ways, and because I use them as foundations for the end goal, I rather keep them in this PR.