-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Graceful shutdown of ECS workers #1
Comments
Use casesWe have two use cases:
Currently we've used lifecycled for Concourse workers, and a lambda function for the ECS clusters. In both cases, we need to do a graceful shutdown. For Concourse we need to avoid interrupting running jobs, and with ECS we need to ensure that containers have been replaced if we want zero downtime rolling updates. LimitationsI had a look at lifecycle hooks and found the following limitations:
Possible generic solutionsLambda with SSMWe could build a Lambda function which uses SSM to send a For ECS the use case is a bit simpler, we can tell ECS to drain the instance and then proceed when ECS reports that there are no running containers. This can be done from a Lambda function without calling SSM at all. Binary with daemonsetsWe can also achieve the same thing by installing a binary on regular instances, or running the binary as a daemonset in ECS. This does not require SSM, but it would require that the instance profile (or task role for ECS) had privileges to set instance state to draining in ECS. For regular instances, we'd probably need users to write a shutdown script which described how to gracefully shut down the instance and then have the binary trigger that script. In both cases we could use SQS to hold the lifecycle hooks, and just set the visibility timeout to zero when we have multiple consumers (if we go for the binary solution). Don't build something genericIf we skip the generic solution, we could just clean up the existing python Lambda and reuse that. However, we'd still need a solid solution for regular instances. edit: Worth noting that the Lambda function in question does not have a license... |
I guess the option Lambda with SSM sounds less time consuming than the Binary with daemonsets. What if we use the old approach and make it generic later with new PRs or maybe do the whole rework. I guess it would make the deploy of this module faster. We currently cannot use it because of missing this critical functionality. |
I'm thinking the same thing @lukaspour - I'll just fix the current module first and then we can take a closer look at making a generic version later. |
Added an example of using the community module for graceful shutdowns of ECS nodes. Will revisit a generic version in the future. |
To pick this up again; For regular instances we have landed on buildkite/lifecycled I've created a feature request to add ECS draining to |
When moving this module I removed this part because:
name_prefix
could be at most ~8 characters) and caused problems when deploying.Obviously, this feature needs to make it back into the module one of two ways:
I'd like to take a look at the second option and make it generic so that we have a single way of handling graceful shutdowns of autoscaling instances (if possible), as this is a needed feature for e.g. Concourse workers as well, where we are currently using lifecycled on the instances themselves to handle the lifecycle hooks.
The text was updated successfully, but these errors were encountered: