Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsenter: implement a two-stage join for setns #4492

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Oct 30, 2024

If we are running with privileges and are asked to join an externally
created user namespaces as well as some other namespace that was not
created underneath said user namespace, the approach we added in commit
2cd9c31 ("nsenter: guarantee correct user namespace ordering")
doesn't work.

While in theory you would want all externally created namespaces to be
sane, it seems that some tools really do create unrelated namespaces and
ask us to join them. Luckily we can just loosely copy what nsenter(1)
appears to do -- we first try to join any namespaces we can (with host
root privileges), then we join any user namespaces, and then we join any
remaining namespaces (now with the user namespace's privileges).

Note that we do not have to try to join namespaces after we create our
own user namespace. Namespace permissions are based purely on the owning
user namespace (not the rootuid) so we will not have access to any extra
namespaces once we unshare(CLONE_NEWUSER) (in fact we will not be able
to setns(2) to anything!).

Fixes #4390
Closes #4491

Fixes: 2cd9c31 ("nsenter: guarantee correct user namespace ordering")
Signed-off-by: Aleksa Sarai [email protected]

@cyphar
Copy link
Member Author

cyphar commented Oct 30, 2024

Huh, these diffs from cfmt are really weird:

diff --git a/libcontainer/nsenter/nsexec.c b/libcontainer/nsenter/nsexec.c
index 4da0665c..b0787c61 100644
--- a/libcontainer/nsenter/nsexec.c
+++ b/libcontainer/nsenter/nsexec.c
@@ -432,21 +432,21 @@ static struct nstype_t {
 	int type;
 	char *name;
 } all_ns_types[] = {
-	{ .type = CLONE_NEWCGROUP, .name = "cgroup" },
-	{ .type = CLONE_NEWIPC, .name = "ipc" },
-	{ .type = CLONE_NEWNS, .name = "mnt" },
-	{ .type = CLONE_NEWNET, .name = "net" },
-	{ .type = CLONE_NEWPID, .name = "pid" },
-	{ .type = CLONE_NEWTIME, .name = "time" },
-	{ .type = CLONE_NEWUSER, .name = "user" },
-	{ .type = CLONE_NEWUTS, .name = "uts" },
-	{}, /* null terminator */
+	{.type = CLONE_NEWCGROUP,.name = "cgroup"},
+	{.type = CLONE_NEWIPC,.name = "ipc"},
+	{.type = CLONE_NEWNS,.name = "mnt"},
+	{.type = CLONE_NEWNET,.name = "net"},
+	{.type = CLONE_NEWPID,.name = "pid"},
+	{.type = CLONE_NEWTIME,.name = "time"},
+	{.type = CLONE_NEWUSER,.name = "user"},
+	{.type = CLONE_NEWUTS,.name = "uts"},
+	{},			/* null terminator */
 };
 
 /* Returns the clone(2) flag for a namespace, given the name of a namespace. */
 static int nstype(char *name)
 {
-	for (struct nstype_t *ns = all_ns_types; ns->name != NULL; ns++)
+	for (struct nstype_t * ns = all_ns_types; ns->name != NULL; ns++)
 		if (!strcmp(name, ns->name))
 			return ns->type;
 	/*
@@ -567,18 +565,17 @@ static char *strappend(char *dst, char *src)
 static char *nsset_to_str(nsset_t nsset)
 {
 	char *str = NULL;
-	for (struct nstype_t *ns = all_ns_types; ns->name != NULL; ns++) {
+	for (struct nstype_t * ns = all_ns_types; ns->name != NULL; ns++) {
 		if (ns->type & nsset) {
 			if (str)
 				str = strappend(str, ", ");
 			str = strappend(str, ns->name);
 		}
 	}
-	return str ?: strdup("");
+	return str ? : strdup("");
 }

The first one seems like a bug in indent because even -sar doesn't format it properly.

@cyphar
Copy link
Member Author

cyphar commented Oct 30, 2024

Seems like AlmaLinux's nsenter auto-sets the mappings if you don't define them. Very weird. We could just create a container and use its userns, but let me see if I can work around this first (it would be nice to properly make the namespaces completely independently of runc).

*/
joined |= __join_namespaces(to_join & ~(joined | CLONE_NEWUSER), ns_list, ns_len);
joined |= __join_namespaces(CLONE_NEWUSER, ns_list, ns_len);
joined |= __join_namespaces(to_join & ~(joined | CLONE_NEWUSER), ns_list, ns_len);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, is there any namespaces will be joined failed before we join user namespace, but will join successfully after user ns joined?

Copy link
Member Author

@cyphar cyphar Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are rootless and there are userns-owned namespaces, we need to be in the userns in order to have the necessary permissions to do setns. This is what 2cd9c31 fixed, and so we need to keep this part of that behaviour. The way we fixed it before was changing the order (so the userns would be first), but by doing it this way the order is irrelevant.

If you remove this line, the rootless docker exec tests all fail.

(This is also what nsenter(1) does afaics.)

@cyphar cyphar force-pushed the nsenter-flexible-joining branch 4 times, most recently from b6876e9 to 754e51b Compare October 31, 2024 03:36
libcontainer/nsenter/nsexec.c Show resolved Hide resolved
write_log(DEBUG, "setns(%#x) into %s namespace (with path %s)", flag, ns->type, ns->path);
if (setns(ns->fd, flag) < 0)
if (!(type & allow)) {
write_log(DEBUG, "skipping setns(%#x) into %s namespace (with path %s)",
Copy link
Member

@lifubang lifubang Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should check whether is this type ns has been joined before, or else it will make user confused for this log.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many logs about skipping setns.

@lifubang ➜ ~/bb $ sudo /workspaces/runc/runc --debug exec -d test sleep inf
DEBU[0000]libcontainer/dmz/cloned_binary_linux.go:202 libcontainer/dmz.IsCloned() F_GET_SEALS on /proc/self/exe failed: invalid argument 
DEBU[0000]libcontainer/dmz/cloned_binary_linux.go:227 libcontainer/dmz.CloneSelfExe() runc-dmz: using overlayfs for sealed /proc/self/exe 
DEBU[0000]libcontainer/container_linux.go:506 libcontainer.(*Container).newParentProcess() runc-dmz: using /proc/self/exe clone         
DEBU[0000] nsexec[8341]: => nsexec container setup      
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec[8341]: set process as non-dumpable    
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[8341]: ~> nsexec stage-0            
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[8341]: spawn stage-1                
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[8341]: -> stage-1 synchronisation loop 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: ~> nsexec stage-1            
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: setns(0x8000000) into ipc namespace (with path /proc/8162/ns/ipc): Success 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: setns(0x4000000) into uts namespace (with path /proc/8162/ns/uts): Success 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: setns(0x40000000) into net namespace (with path /proc/8162/ns/net): Success 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: setns(0x20000000) into pid namespace (with path /proc/8162/ns/pid): Success 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: setns(0x20000) into mnt namespace (with path /proc/8162/ns/mnt): Success 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: setns(0x2000000) into cgroup namespace (with path /proc/8162/ns/cgroup): Success 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x8000000) into ipc namespace (with path /proc/8162/ns/ipc) 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x4000000) into uts namespace (with path /proc/8162/ns/uts) 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x40000000) into net namespace (with path /proc/8162/ns/net) 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x20000000) into pid namespace (with path /proc/8162/ns/pid) 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x20000) into mnt namespace (with path /proc/8162/ns/mnt) 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x2000000) into cgroup namespace (with path /proc/8162/ns/cgroup) 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x8000000) into ipc namespace (with path /proc/8162/ns/ipc) 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x4000000) into uts namespace (with path /proc/8162/ns/uts) 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x40000000) into net namespace (with path /proc/8162/ns/net) 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x20000000) into pid namespace (with path /proc/8162/ns/pid) 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x20000) into mnt namespace (with path /proc/8162/ns/mnt) 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: skipping setns(0x2000000) into cgroup namespace (with path /proc/8162/ns/cgroup) 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: unshare remaining namespaces 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: spawn stage-2                
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: request stage-0 to forward stage-2 pid (8343) 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[8341]: stage-1 requested pid to be forwarded 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[8341]: forward stage-1 (8342) and stage-2 (8343) pids to runc 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: signal completion to stage-0 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[8341]: stage-1 complete             
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[8341]: <- stage-1 synchronisation loop 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[8341]: -> stage-2 synchronisation loop 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[8341]: signalling stage-2 to run    
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[8342]: <~ nsexec stage-1            
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-2[6]: ~> nsexec stage-2               
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-2[6]: signal completion to stage-0    
DEBU[0000] nsexec-0[8341]: stage-2 complete             
DEBU[0000] nsexec-0[8341]: <- stage-2 synchronisation loop 
DEBU[0000] nsexec-0[8341]: <~ nsexec stage-0            
DEBU[0000] nsexec-2[6]: <= nsexec container setup       
DEBU[0000] nsexec-2[6]: booting up go runtime ...       
DEBU[0000] child process in init()                      
DEBU[0000] reading sync                                 
DEBU[0000] read sync type:procReady                     
DEBU[0000] writing sync type:procRun                    
DEBU[0000] reading sync                                 
DEBU[0000] sync pipe closed                             
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() writing sync type:procReady                  
DEBU[0000] reading sync                                 
DEBU[0000] read sync type:procRun                       
DEBU[0000] setns_init: about to exec

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably skip the "skipping" messages, but we should keep the error and success lines. These are all debug logs, so there is basically no overhead to having them without debug logging enabled (we don't forward them in log.c if the log level isn't debug).

*
* This is similar to what nsenter(1) seems to do in practice.
*/
joined |= __join_namespaces(to_join & ~(joined | CLONE_NEWUSER), ns_list, ns_len);
Copy link
Member

@lifubang lifubang Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for rootless container, all to_join ns paths will be joined fail in the first step? So it will introduce some unnecessary syscalls for rootless container? From this aspect, please look #4491 to check whether it will be more simple to fix the related issue in go code or not?

free(namespaces);
/* Make sure we joined the namespaces we planned to. */
if (failed_to_join)
bail("failed to join {%s} namespaces: %s", nsset_to_str(failed_to_join), strerror(EPERM));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: bail appends : %m to the message printed so maybe instead of : %s", ... strerror(EPERM) you can just assign errno = EPERM before calling it.

Otherwise we'll end up with something like ... namespaces: Operation not permitted: Success which is somewhat confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like:

DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[11318]: skipping setns(0x10000000) into user namespace (with path /proc/1/ns/user) 
DEBU[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[11318]: setns(0x10000000) into user namespace (with path /proc/1/ns/user): Invalid argument 
FATA[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-1[11318]: failed to setns into user namespace: Invalid argument 
FATA[0000]libcontainer/logs/logs.go:55 libcontainer/logs.processEntry() nsexec-0[11317]: failed to sync with stage-1: next state: Success 
ERRO[0000]utils.go:62 main.fatalWithCode() runc run failed: unable to start container process: can't get final child's PID from pipe: EOF

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single nit; LGTM otherwise.

This is basically a no-op change because runc already disallows this,
but it will be needed in future patches when we have to track what
namespaces have already been joined.

Signed-off-by: Aleksa Sarai <[email protected]>
If we are running with privileges and are asked to join an externally
created user namespaces as well as some other namespace that was *not*
created underneath said user namespace, the approach we added in commit
2cd9c31 ("nsenter: guarantee correct user namespace ordering")
doesn't work.

While in theory you would want all externally created namespaces to be
sane, it seems that some tools really do create unrelated namespaces and
ask us to join them. Luckily we can just loosely copy what nsenter(1)
appears to do -- we first try to join any namespaces we can (with host
root privileges), then we join any user namespaces, and then we join any
remaining namespaces (now with the user namespace's privileges).

Note that we *do not* have to try to join namespaces after we create our
own user namespace. Namespace permissions are based purely on the owning
user namespace (not the rootuid) so we will not have access to any extra
namespaces once we unshare(CLONE_NEWUSER) (in fact we will not be able
to setns(2) to anything!).

Fixes: 2cd9c31 ("nsenter: guarantee correct user namespace ordering")
Signed-off-by: Aleksa Sarai <[email protected]>
@lifubang
Copy link
Member

Do you want this PR in 1.2.2? Because it's really a bug.

@kolyshkin kolyshkin added this to the 1.3.0 milestone Nov 13, 2024
@kolyshkin
Copy link
Contributor

Do you want this PR in 1.2.2? Because it's really a bug.

This is a bug but not a regression, and the change is somewhat radical. For release-1.2 branch we should focus on regressions and security fixes, while keeping it stable. So, to my mind, let's keep it for 1.3 and not backport to 1.2.x.

We also have a goal to release 1.3.0 in a timely manner (say in about 5 months from now, not a few years like it happened with runc 1.2.0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

'runc exec' errors with 'failed to setns into net namespace: Operation not permitted'
3 participants