-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsenter: implement a two-stage join for setns #4492
base: main
Are you sure you want to change the base?
Conversation
Huh, these diffs from diff --git a/libcontainer/nsenter/nsexec.c b/libcontainer/nsenter/nsexec.c
index 4da0665c..b0787c61 100644
--- a/libcontainer/nsenter/nsexec.c
+++ b/libcontainer/nsenter/nsexec.c
@@ -432,21 +432,21 @@ static struct nstype_t {
int type;
char *name;
} all_ns_types[] = {
- { .type = CLONE_NEWCGROUP, .name = "cgroup" },
- { .type = CLONE_NEWIPC, .name = "ipc" },
- { .type = CLONE_NEWNS, .name = "mnt" },
- { .type = CLONE_NEWNET, .name = "net" },
- { .type = CLONE_NEWPID, .name = "pid" },
- { .type = CLONE_NEWTIME, .name = "time" },
- { .type = CLONE_NEWUSER, .name = "user" },
- { .type = CLONE_NEWUTS, .name = "uts" },
- {}, /* null terminator */
+ {.type = CLONE_NEWCGROUP,.name = "cgroup"},
+ {.type = CLONE_NEWIPC,.name = "ipc"},
+ {.type = CLONE_NEWNS,.name = "mnt"},
+ {.type = CLONE_NEWNET,.name = "net"},
+ {.type = CLONE_NEWPID,.name = "pid"},
+ {.type = CLONE_NEWTIME,.name = "time"},
+ {.type = CLONE_NEWUSER,.name = "user"},
+ {.type = CLONE_NEWUTS,.name = "uts"},
+ {}, /* null terminator */
};
/* Returns the clone(2) flag for a namespace, given the name of a namespace. */
static int nstype(char *name)
{
- for (struct nstype_t *ns = all_ns_types; ns->name != NULL; ns++)
+ for (struct nstype_t * ns = all_ns_types; ns->name != NULL; ns++)
if (!strcmp(name, ns->name))
return ns->type;
/*
@@ -567,18 +565,17 @@ static char *strappend(char *dst, char *src)
static char *nsset_to_str(nsset_t nsset)
{
char *str = NULL;
- for (struct nstype_t *ns = all_ns_types; ns->name != NULL; ns++) {
+ for (struct nstype_t * ns = all_ns_types; ns->name != NULL; ns++) {
if (ns->type & nsset) {
if (str)
str = strappend(str, ", ");
str = strappend(str, ns->name);
}
}
- return str ?: strdup("");
+ return str ? : strdup("");
} The first one seems like a bug in |
33eca5e
to
9f5ee7e
Compare
Seems like AlmaLinux's |
*/ | ||
joined |= __join_namespaces(to_join & ~(joined | CLONE_NEWUSER), ns_list, ns_len); | ||
joined |= __join_namespaces(CLONE_NEWUSER, ns_list, ns_len); | ||
joined |= __join_namespaces(to_join & ~(joined | CLONE_NEWUSER), ns_list, ns_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, is there any namespaces will be joined failed before we join user namespace, but will join successfully after user ns joined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are rootless and there are userns-owned namespaces, we need to be in the userns in order to have the necessary permissions to do setns
. This is what 2cd9c31 fixed, and so we need to keep this part of that behaviour. The way we fixed it before was changing the order (so the userns would be first), but by doing it this way the order is irrelevant.
If you remove this line, the rootless docker exec
tests all fail.
(This is also what nsenter(1)
does afaics.)
b6876e9
to
754e51b
Compare
libcontainer/nsenter/nsexec.c
Outdated
write_log(DEBUG, "setns(%#x) into %s namespace (with path %s)", flag, ns->type, ns->path); | ||
if (setns(ns->fd, flag) < 0) | ||
if (!(type & allow)) { | ||
write_log(DEBUG, "skipping setns(%#x) into %s namespace (with path %s)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should check whether is this type ns has been joined before, or else it will make user confused for this log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too many logs about skipping setns.
@lifubang ➜ ~/bb $ sudo /workspaces/runc/runc --debug exec -d test sleep inf
DEBU[0000]libcontainer/dmz/cloned_binary_linux.go:202 libcontainer/dmz.IsCloned() F_GET_SEALS on /proc/self/exe failed: invalid argument
DEBU[0000]libcontainer/dmz/cloned_binary_linux.go:227 libcontainer/dmz.CloneSelfExe() runc-dmz: using overlayfs for sealed /proc/self/exe
DEBU[0000]libcontainer/container_linux.go:506 libcontainer.(*Container).newParentProcess() runc-dmz: using /proc/self/exe clone
DEBU[0000] nsexec[8341]: => nsexec container setup
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec[8341]: set process as non-dumpable
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[8341]: ~> nsexec stage-0
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[8341]: spawn stage-1
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[8341]: -> stage-1 synchronisation loop
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: ~> nsexec stage-1
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: setns(0x8000000) into ipc namespace (with path /proc/8162/ns/ipc): Success
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: setns(0x4000000) into uts namespace (with path /proc/8162/ns/uts): Success
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: setns(0x40000000) into net namespace (with path /proc/8162/ns/net): Success
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: setns(0x20000000) into pid namespace (with path /proc/8162/ns/pid): Success
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: setns(0x20000) into mnt namespace (with path /proc/8162/ns/mnt): Success
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: setns(0x2000000) into cgroup namespace (with path /proc/8162/ns/cgroup): Success
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x8000000) into ipc namespace (with path /proc/8162/ns/ipc)
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x4000000) into uts namespace (with path /proc/8162/ns/uts)
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x40000000) into net namespace (with path /proc/8162/ns/net)
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x20000000) into pid namespace (with path /proc/8162/ns/pid)
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x20000) into mnt namespace (with path /proc/8162/ns/mnt)
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x2000000) into cgroup namespace (with path /proc/8162/ns/cgroup)
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x8000000) into ipc namespace (with path /proc/8162/ns/ipc)
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x4000000) into uts namespace (with path /proc/8162/ns/uts)
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x40000000) into net namespace (with path /proc/8162/ns/net)
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x20000000) into pid namespace (with path /proc/8162/ns/pid)
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x20000) into mnt namespace (with path /proc/8162/ns/mnt)
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x2000000) into cgroup namespace (with path /proc/8162/ns/cgroup)
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: unshare remaining namespaces
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: spawn stage-2
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: request stage-0 to forward stage-2 pid (8343)
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[8341]: stage-1 requested pid to be forwarded
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[8341]: forward stage-1 (8342) and stage-2 (8343) pids to runc
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: signal completion to stage-0
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[8341]: stage-1 complete
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[8341]: <- stage-1 synchronisation loop
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[8341]: -> stage-2 synchronisation loop
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[8341]: signalling stage-2 to run
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: <~ nsexec stage-1
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-2[6]: ~> nsexec stage-2
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-2[6]: signal completion to stage-0
DEBU[0000] nsexec-0[8341]: stage-2 complete
DEBU[0000] nsexec-0[8341]: <- stage-2 synchronisation loop
DEBU[0000] nsexec-0[8341]: <~ nsexec stage-0
DEBU[0000] nsexec-2[6]: <= nsexec container setup
DEBU[0000] nsexec-2[6]: booting up go runtime ...
DEBU[0000] child process in init()
DEBU[0000] reading sync
DEBU[0000] read sync type:procReady
DEBU[0000] writing sync type:procRun
DEBU[0000] reading sync
DEBU[0000] sync pipe closed
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() writing sync type:procReady
DEBU[0000] reading sync
DEBU[0000] read sync type:procRun
DEBU[0000] setns_init: about to exec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably skip the "skipping" messages, but we should keep the error and success lines. These are all debug logs, so there is basically no overhead to having them without debug logging enabled (we don't forward them in log.c
if the log level isn't debug).
* | ||
* This is similar to what nsenter(1) seems to do in practice. | ||
*/ | ||
joined |= __join_namespaces(to_join & ~(joined | CLONE_NEWUSER), ns_list, ns_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe for rootless container, all to_join
ns paths will be joined fail in the first step? So it will introduce some unnecessary syscalls for rootless container? From this aspect, please look #4491 to check whether it will be more simple to fix the related issue in go code or not?
free(namespaces); | ||
/* Make sure we joined the namespaces we planned to. */ | ||
if (failed_to_join) | ||
bail("failed to join {%s} namespaces: %s", nsset_to_str(failed_to_join), strerror(EPERM)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: bail
appends : %m
to the message printed so maybe instead of : %s", ... strerror(EPERM)
you can just assign errno = EPERM
before calling it.
Otherwise we'll end up with something like ... namespaces: Operation not permitted: Success
which is somewhat confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like:
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[11318]: skipping setns(0x10000000) into user namespace (with path /proc/1/ns/user)
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[11318]: setns(0x10000000) into user namespace (with path /proc/1/ns/user): Invalid argument
FATA[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[11318]: failed to setns into user namespace: Invalid argument
FATA[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[11317]: failed to sync with stage-1: next state: Success
ERRO[0000]utils.go:62 main.fatalWithCode() runc run failed: unable to start container process: can't get final child's PID from pipe: EOF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single nit; LGTM otherwise.
Signed-off-by: Aleksa Sarai <[email protected]>
This is basically a no-op change because runc already disallows this, but it will be needed in future patches when we have to track what namespaces have already been joined. Signed-off-by: Aleksa Sarai <[email protected]>
If we are running with privileges and are asked to join an externally created user namespaces as well as some other namespace that was *not* created underneath said user namespace, the approach we added in commit 2cd9c31 ("nsenter: guarantee correct user namespace ordering") doesn't work. While in theory you would want all externally created namespaces to be sane, it seems that some tools really do create unrelated namespaces and ask us to join them. Luckily we can just loosely copy what nsenter(1) appears to do -- we first try to join any namespaces we can (with host root privileges), then we join any user namespaces, and then we join any remaining namespaces (now with the user namespace's privileges). Note that we *do not* have to try to join namespaces after we create our own user namespace. Namespace permissions are based purely on the owning user namespace (not the rootuid) so we will not have access to any extra namespaces once we unshare(CLONE_NEWUSER) (in fact we will not be able to setns(2) to anything!). Fixes: 2cd9c31 ("nsenter: guarantee correct user namespace ordering") Signed-off-by: Aleksa Sarai <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
754e51b
to
fffc165
Compare
Do you want this PR in 1.2.2? Because it's really a bug. |
This is a bug but not a regression, and the change is somewhat radical. For release-1.2 branch we should focus on regressions and security fixes, while keeping it stable. So, to my mind, let's keep it for 1.3 and not backport to 1.2.x. We also have a goal to release 1.3.0 in a timely manner (say in about 5 months from now, not a few years like it happened with runc 1.2.0). |
If we are running with privileges and are asked to join an externally
created user namespaces as well as some other namespace that was not
created underneath said user namespace, the approach we added in commit
2cd9c31 ("nsenter: guarantee correct user namespace ordering")
doesn't work.
While in theory you would want all externally created namespaces to be
sane, it seems that some tools really do create unrelated namespaces and
ask us to join them. Luckily we can just loosely copy what nsenter(1)
appears to do -- we first try to join any namespaces we can (with host
root privileges), then we join any user namespaces, and then we join any
remaining namespaces (now with the user namespace's privileges).
Note that we do not have to try to join namespaces after we create our
own user namespace. Namespace permissions are based purely on the owning
user namespace (not the rootuid) so we will not have access to any extra
namespaces once we unshare(CLONE_NEWUSER) (in fact we will not be able
to setns(2) to anything!).
Fixes #4390
Closes #4491
Fixes: 2cd9c31 ("nsenter: guarantee correct user namespace ordering")
Signed-off-by: Aleksa Sarai [email protected]