-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add custom delay for instance refresh actions #465
Comments
Sorry for the delay in responding to this!!!!! I see the PDB field in the helm chart has been released, so hopefully that helps a bit. Could a pod grace period / node grace period be used to make the pod wait until the new node is up (catching SIGTERM in the container and waiting a bit)? |
@bwagner5 AFAIK with instance refresh there is no need to complete the whole process in 120s like for spot termination. As there is the greatest likelihood of all nodes in an ASG being replaced in this mode it would be really useful to block the NTH draining the node until the new nodes are ready or if that's not possible until a predefined time has elapsed. |
@bwagner5 was there any progress made on this? |
There has not been progress, but I did have a conversation about something similar to this earlier today. The conversation was about how to handle capacity-rebalance events for spot on ASG. The TLDR is that ASG will launch a replacement instance before sending the Terminate lifecycle hook and therefore triggering NTH to cordon and drain the node that is being replaced. It would be nice if instance refresh had that functionality to launch before terminate. On the case where you specify 100% healthy for an instance refresh, does that cause ASG to provision a new node above the desired capacity (if you max is set higher)? It seems like that would be the best way to handle this in general, or maybe I'm missing some issue there too. Delaying NTH from draining a node obviously sounds very doable. The only hesitation I have is the complexity that has grown in NTH from adding a bunch of these small knobs. If we absolutely need it, then sure we can add it. I just want to make sure all avenues have been explored. |
@bwagner5 for capacity rebalance does the ASG wait for the instance to be ready or just pending/initialising? For instance refresh with 100% healthy if the ASG starts a new instance it's not waiting for it to be ready. The specific use case I have is that my clusters get updated via a single process and when there are service and node changes I get failures due to nodes being cordoned. We have a system node pool made up of 3 (actually 6, but let's ignore that for now) ASGs in separate regions, so on an upgrade NTH would get 3 refresh events. Only 1 of these events would be actionable due to PDBs but I think all nodes would be cordoned. Once the |
When using instance refresh to update ASGs it looks like the events come through with a start date of now which triggers the node-termination handler to start cordoning and draining the node immediately. This does work correctly if the ASG healthy percentage is set to 100% and all pods have replicas and PDBs (for NTH we need #463 to satisfy this); but single pods such as Prometheus will often be un-schedulable for a short period while the new node boots up.
To make this whole process function without any downtime a custom duration to wait on ASG termination events could be adopted and defaulted to something like 90 seconds. Assuming that this wait time was longer than the time to start and join a node to the cluster there would be no un-schedulable pods and the ability to use a non 100% ASG healthy percentage. Combined with the ASG lifecycle hook timeout this would support a high level of customisation without much extra complexity.
The text was updated successfully, but these errors were encountered: