Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue pulling image with insufficient sub/guids #311

Open
vsoch opened this issue Nov 23, 2023 · 6 comments
Open

Issue pulling image with insufficient sub/guids #311

vsoch opened this issue Nov 23, 2023 · 6 comments
Labels
question Further information is requested

Comments

@vsoch
Copy link
Contributor

vsoch commented Nov 23, 2023

I'm (fairly successfully) running different kinds of pods in usernetes, but I just hit this error:

  Warning  Failed     28s                 kubelet            Failed to pull image "vanessa/pytorch-dist-test": failed to pull and unpack image "docker.io/vanessa/pytorch-dist-test:latest": failed to extract layer sha256:b51c2b01ae19fd8eccd86ab9a8667a71f5ae4f739790cd8859935405bcceca93: mount callback failed on /var/lib/containerd/tmpmounts/containerd-mount180165242: failed to Lchown "/var/lib/containerd/tmpmounts/containerd-mount180165242/opt/conda/pkgs/pytorch-1.0.0-py3.6_cuda10.0.130_cudnn7.4.1_1/bin" for UID 1185200044, GID 1185200044: lchown /var/lib/containerd/tmpmounts/containerd-mount180165242/opt/conda/pkgs/pytorch-1.0.0-py3.6_cuda10.0.130_cudnn7.4.1_1/bin: invalid argument (Hint: try increasing the number of subordinate IDs in /etc/subuid and /etc/subgid): unknown

I'm not sure the error is correct for the error, I was wondering if there are too many containers running? I figured out I could do make shell to get into one of the nodes, and then I found a way to see containerd images:

root@u7s-lima-flux-0:/usernetes# crictl  images
IMAGE                                      TAG                  IMAGE ID            SIZE
docker.io/flannel/flannel-cni-plugin       v1.2.0               a55d1bad692b7       3.88MB
docker.io/flannel/flannel                  v0.22.2              49937eb983daf       27MB
docker.io/kindest/kindnetd                 v20230511-dc714da8   b0b1fa0f58c6e       27.7MB
docker.io/kindest/local-path-helper        v20230510-486859a6   be300acfc8622       3.05MB
docker.io/kindest/local-path-provisioner   v20230511-dc714da8   ce18e076e9d4b       19.4MB
registry.k8s.io/coredns/coredns            v1.10.1              ead0a4a53df89       16.2MB
registry.k8s.io/etcd                       3.5.9-0              73deb9a3f7025       103MB
registry.k8s.io/kube-apiserver             v1.28.0              a432ea809db3e       85.8MB
registry.k8s.io/kube-apiserver             v1.28.4              7fe0e6f37db33       34.7MB
registry.k8s.io/kube-controller-manager    v1.28.0              df537910e4a99       81.5MB
registry.k8s.io/kube-controller-manager    v1.28.4              d058aa5ab969c       33.4MB
registry.k8s.io/kube-proxy                 v1.28.0              b16199d508b6d       74.7MB
registry.k8s.io/kube-proxy                 v1.28.4              83f6cc407eed8       24.6MB
registry.k8s.io/kube-scheduler             v1.28.0              553617289d9f1       61.5MB
registry.k8s.io/kube-scheduler             v1.28.4              e3db313c6dbc0       18.8MB
registry.k8s.io/pause                      3.7                  221177c6082a8       311kB
registry.k8s.io/pause                      3.9                  e6f1816883972       322kB
root@u7s-lima-flux-0:/usernetes# crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                      ATTEMPT             POD ID              POD
7c707a589ca66       ead0a4a53df89       4 hours ago         Running             coredns                   0                   aed90e90c6242       coredns-5dd5756b68-9kl5h
9776db8e8b544       ead0a4a53df89       4 hours ago         Running             coredns                   0                   4a0eeeb48b889       coredns-5dd5756b68-qnbdl
d5fba5d8745e9       49937eb983daf       4 hours ago         Running             kube-flannel              0                   2a4085ee6ee10       kube-flannel-ds-77kwr
2eb73b78f1728       83f6cc407eed8       4 hours ago         Running             kube-proxy                0                   bfc90a6ccd0cc       kube-proxy-czg44
f5f8eb5441fdd       73deb9a3f7025       4 hours ago         Running             etcd                      0                   4c14985778844       etcd-u7s-lima-flux-0
eeb8ff772a280       7fe0e6f37db33       4 hours ago         Running             kube-apiserver            0                   3ec15fdc4a314       kube-apiserver-u7s-lima-flux-0
9521080e367b7       d058aa5ab969c       4 hours ago         Running             kube-controller-manager   0                   c76cf2b2f0b3a       kube-controller-manager-u7s-lima-flux-0
9515db7b8fb1c       e3db313c6dbc0       4 hours ago         Running             kube-scheduler            0                   54bedc845ed7a       kube-scheduler-u7s-lima-flux-0

These just look like images for the kubelet or control plane (not any applications) and interesting, there aren't any subuid in the file here:

# cat /etc/subuid 
# cat /etc/subgid
# both empty files

Is there a bug here / something we can do to get it to work?

@vsoch
Copy link
Contributor Author

vsoch commented Nov 23, 2023

Also this seems to be an issue with creating (maybe?) more pods than the size of my resources (cpu, etc) can support. I deployed a smaller pytorch workflow ref and it worked! @AkihiroSuda this is SO cool it's rocking my socks!! 🧦 This is what we wanted to get working many months ago and I'm over the moon it's starting to! 🌔

@AkihiroSuda
Copy link
Member

These just look like images for the kubelet or control plane (not any applications) and interesting, there aren't any subuid in the file here:

Please check the files on the host.
You probably have 65536 ids there.

@vsoch
Copy link
Contributor Author

vsoch commented Nov 24, 2023

Yes the uid/gid for the host virtual machine (not inside of docker compose) looks OK.

@AkihiroSuda
Copy link
Member

Please try increasing 65536 there to a larger number

@vsoch
Copy link
Contributor Author

vsoch commented Nov 24, 2023

Please try increasing 65536 there to a larger number

Sure! I've never done that on my host. How large should it be?

@AkihiroSuda
Copy link
Member

Depends, but at least 1185200044 for your image

for UID 1185200044, GID 1185200044: lchown /var/lib/containerd/tmpmounts/containerd-mount180165242/opt/conda/pkgs/pytorch-1.0.0-py3.6_cuda10.0.130_cudnn7.4.1_1/bin: invalid argument (Hint: try increasing the number of subordinate IDs in /etc/subuid and /etc/subgid)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants