Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodeadm-run service fails if containerd is not fully initialized and running #1917

Closed
ajvn opened this issue Aug 10, 2024 · 9 comments
Closed

Comments

@ajvn
Copy link

ajvn commented Aug 10, 2024

What happened:
We are building our own AMIs with preloaded critical container images. This helps us speed up node getting into a ready state, and we have those images available on the node itself in case registries are down.

However, this slows down startup of containerd a little bit, and if it's not fully up and running before nodeadm tries to pull pause image, nodeadm-run service will fail, and node won't join the cluster, while it remains up and running, and will have to be removed manually.
Other option is getting into the node itself and restarting nodeadm-run service. After that it will join the cluster, but
this obviously is not a scale-friendly solution.

This happens randomly, so far on average it affects every 4th node joining the cluster.

What you expected to happen:
It looks like nodeadm tries to pull pause image 3 times, after that it exits with an error:

It would be good if we could adjust how many times it should retry and/or how often it should retry via configuration option or a flag (preferably configuration option, so we don't have to adjust systemd unit file).

Potentially we could try adding containerd.service to the After= section of systemd unit file, but I don't know if this helps as we need containerd to be fully up and running, not only being active according to the systemd.

[Unit]
Description=EKS Nodeadm Run
Documentation=https://github.com/awslabs/amazon-eks-ami
# start after cloud-init, in order to pickup changes the
# user may have applied via cloud-init scripts
After=nodeadm-config.service cloud-final.service <= adding containerd.service here
Requires=nodeadm-config.service

[Service]
Type=oneshot
ExecStart=/usr/bin/nodeadm init --skip config

[Install]
WantedBy=multi-user.target

How to reproduce it (as minimally and precisely as possible):
Have containerd not be fully ready before nodeadm-run service executes.

Anything else we need to know?:
Here's some additional information which helped me during the investigation:

$ sudo journalctl --no-pager -b -u nodeadm-run
...
Aug 10 04:22:54 nodeadm[3026]: {"level":"info","ts":1723263774.9953516,"caller":"containerd/sandbox.go:48","msg":"Pulling sandbox image..","image":"602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5"}
Aug 10 04:22:54 nodeadm[3026]: E0810 04:22:54.996931    3026 remote_image.go:135] PullImage "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5" from image service failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: no such file or directory"
Aug 10 04:22:56 nodeadm[3026]: {"level":"info","ts":1723263776.997649,"caller":"containerd/sandbox.go:48","msg":"Pulling sandbox image..","image":"602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5"}
Aug 10 04:22:56 nodeadm[3026]: E0810 04:22:56.997752    3026 remote_image.go:135] PullImage "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5" from image service failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: no such file or directory"
Aug 10 04:23:00 nodeadm[3026]: {"level":"info","ts":1723263780.998839,"caller":"containerd/sandbox.go:48","msg":"Pulling sandbox image..","image":"602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5"}
Aug 10 04:23:01 nodeadm[3026]: E0810 04:23:01.003106    3026 remote_image.go:135] PullImage "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5" from image service failed: rpc error: code = Unknown desc = server is not initialized yet
Aug 10 04:23:09 nodeadm[3026]: {"level":"fatal","ts":1723263789.010438,"caller":"nodeadm/main.go:36","msg":"Command failed","error":"rpc error: code = Unknown desc = server is not initialized yet","stacktrace":"main.main\n\t/workdir/cmd/nodeadm/main.go:36\nruntime.main\n\t/root/sdk/go1.21.9/src/runtime/proc.go:267"}
Aug 10 04:23:09 systemd[1]: nodeadm-run.service: Main process exited, code=exited, status=1/FAILURE
Aug 10 04:23:09 systemd[1]: nodeadm-run.service: Failed with result 'exit-code'.
Aug 10 04:23:09 systemd[1]: Failed to start nodeadm-run.service - EKS Nodeadm Run.

Mainly related to times when containerd and nodeadm-run services started:

● containerd.service - containerd container runtime
     Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; preset: disabled)
    Drop-In: /etc/systemd/system/containerd.service.d
             └─00-runtime-slice.conf
     Active: active (running) since Sat 2024-08-10 04:23:01 UTC; 2h 14min ago
       Docs: https://containerd.io
    Process: 3050 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 3058 (containerd)
      Tasks: 89
     Memory: 29.8M
        CPU: 4.262s
     CGroup: /runtime.slice/containerd.service
             └─3058 /usr/bin/containerd

Warning: some journal files were not opened due to insufficient permissions.
---
× nodeadm-run.service - EKS Nodeadm Run
     Loaded: loaded (/etc/systemd/system/nodeadm-run.service; enabled; preset: disabled)
     Active: failed (Result: exit-code) since Sat 2024-08-10 04:23:09 UTC; 2h 14min ago
       Docs: https://github.com/awslabs/amazon-eks-ami
    Process: 3026 ExecStart=/usr/bin/nodeadm init --skip config (code=exited, status=1/FAILURE)
   Main PID: 3026 (code=exited, status=1/FAILURE)
        CPU: 107ms

Warning: some journal files were not opened due to insufficient permissions.

Environment:
I don't believe it's relevant in this case, but if requested I am happy to oblige.

@ajvn
Copy link
Author

ajvn commented Aug 12, 2024

Workaround for now is replacing default systemd unit file with this, added to node launch template:

...
--BOUNDARY
Content-Type: text/x-shellscript;

#!/usr/bin/env bash
cat > /etc/systemd/system/nodeadm-run.service << EOF
[Unit]
Description=EKS Nodeadm Run
Documentation=https://github.com/awslabs/amazon-eks-ami
# start after cloud-init, in order to pickup changes the
# user may have applied via cloud-init scripts
After=nodeadm-config.service cloud-final.service containerd.service
Requires=nodeadm-config.service

[Service]
Type=oneshot
ExecStart=/usr/bin/nodeadm init --skip config
RestartSec=5s
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload

--BOUNDARY--

@cartermckinnon
Copy link
Member

cartermckinnon commented Aug 12, 2024

I haven't seen this occur, but the bug sounds legit to me. I think we'll want to handle this in nodeadm instead of with systemd ujnit dependencies, we can just wait until our CRI client can connect to the socket. I'll get a PR together this week 👍

@ajvn
Copy link
Author

ajvn commented Aug 13, 2024

Thank you, let me know when there's an AMI available to test and I'll give it a go.

@ndbaker1
Copy link
Member

#1965 should help with containerd not being ready before initiating the sandbox image pulls 👍

@ajvn
Copy link
Author

ajvn commented Sep 17, 2024

Just so I understand properly, this is not AMI version dependent, it will automatically pull new nodeadm version when new node joins the cluster?

@ndbaker1
Copy link
Member

ndbaker1 commented Sep 17, 2024

@ajvn nodeadm gets built with the ami, so you'll get updates when the next ami release happens 👍

@ajvn
Copy link
Author

ajvn commented Sep 17, 2024

Gotcha, thanks 👍
I'll let you know how it goes once there's new AMI released and I take it into use.

@ajvn
Copy link
Author

ajvn commented Oct 4, 2024

I'm getting around testing this now, building custom AMI based on AMI ami-079b7c883fe056119, created on 2024-09-28T23:51:52.000Z.

I'll update issue with findings some time next week.

@ajvn
Copy link
Author

ajvn commented Oct 9, 2024

Tested on 5 different clusters with various node group sizes, all of the nodes joined without any issues.
I believe we can mark this issue as resolved, thank you folks.

@ajvn ajvn closed this as completed Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants