-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodeadm-run service fails if containerd is not fully initialized and running #1917
Comments
Workaround for now is replacing default systemd unit file with this, added to node launch template:
|
I haven't seen this occur, but the bug sounds legit to me. I think we'll want to handle this in |
Thank you, let me know when there's an AMI available to test and I'll give it a go. |
#1965 should help with containerd not being ready before initiating the sandbox image pulls 👍 |
Just so I understand properly, this is not AMI version dependent, it will automatically pull new |
@ajvn |
Gotcha, thanks 👍 |
I'm getting around testing this now, building custom AMI based on AMI I'll update issue with findings some time next week. |
Tested on 5 different clusters with various node group sizes, all of the nodes joined without any issues. |
What happened:
We are building our own AMIs with preloaded critical container images. This helps us speed up node getting into a ready state, and we have those images available on the node itself in case registries are down.
However, this slows down startup of
containerd
a little bit, and if it's not fully up and running beforenodeadm
tries to pullpause
image,nodeadm-run
service will fail, and node won't join the cluster, while it remains up and running, and will have to be removed manually.Other option is getting into the node itself and restarting
nodeadm-run
service. After that it will join the cluster, butthis obviously is not a scale-friendly solution.
This happens randomly, so far on average it affects every 4th node joining the cluster.
What you expected to happen:
It looks like
nodeadm
tries to pullpause
image 3 times, after that it exits with an error:It would be good if we could adjust how many times it should retry and/or how often it should retry via configuration option or a flag (preferably configuration option, so we don't have to adjust
systemd
unit file).Potentially we could try adding
containerd.service
to theAfter=
section ofsystemd
unit file, but I don't know if this helps as we needcontainerd
to be fully up and running, not only being active according to thesystemd
.How to reproduce it (as minimally and precisely as possible):
Have
containerd
not be fully ready beforenodeadm-run
service executes.Anything else we need to know?:
Here's some additional information which helped me during the investigation:
Mainly related to times when
containerd
andnodeadm-run
services started:Environment:
I don't believe it's relevant in this case, but if requested I am happy to oblige.
The text was updated successfully, but these errors were encountered: