Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Graceful shutdown of ECS workers #1

Open
itsdalmo opened this issue Jun 20, 2018 · 5 comments
Open

Feature request: Graceful shutdown of ECS workers #1

itsdalmo opened this issue Jun 20, 2018 · 5 comments
Assignees

Comments

@itsdalmo
Copy link
Contributor

When moving this module I removed this part because:

Obviously, this feature needs to make it back into the module one of two ways:

  • Better fixes to the original module and use that.
  • Roll our own autoscaling lambda and try to make it "generic" for both ECS and regular instances.

I'd like to take a look at the second option and make it generic so that we have a single way of handling graceful shutdowns of autoscaling instances (if possible), as this is a needed feature for e.g. Concourse workers as well, where we are currently using lifecycled on the instances themselves to handle the lifecycle hooks.

@itsdalmo itsdalmo self-assigned this Jun 20, 2018
@itsdalmo
Copy link
Contributor Author

itsdalmo commented Jul 9, 2018

Use cases

We have two use cases:

  • Concourse workers: Should not shut down without draining the worker beforehand.
  • ECS Clusters: Should drain containers before proceeding with the shutdown.

Currently we've used lifecycled for Concourse workers, and a lambda function for the ECS clusters. In both cases, we need to do a graceful shutdown. For Concourse we need to avoid interrupting running jobs, and with ECS we need to ensure that containers have been replaced if we want zero downtime rolling updates.

Limitations

I had a look at lifecycle hooks and found the following limitations:

  • The EC2 API only allows us to send terminate or stop to the instances.
  • The terminate and stop API calls will kill the instance after a few minutes if the graceful shutdown has not completed.

Possible generic solutions

Lambda with SSM

We could build a Lambda function which uses SSM to send a shutdown -h now to the instance, and then waits for it to reach a stopped state before allowing the lifecycle hook to proceed to terminate it. The benefit of doing this is that we don't have to install or run anything new on the instances, this would however require that they are running the SSM agent and have the necessary privileges.

For ECS the use case is a bit simpler, we can tell ECS to drain the instance and then proceed when ECS reports that there are no running containers. This can be done from a Lambda function without calling SSM at all.

Binary with daemonsets

We can also achieve the same thing by installing a binary on regular instances, or running the binary as a daemonset in ECS. This does not require SSM, but it would require that the instance profile (or task role for ECS) had privileges to set instance state to draining in ECS. For regular instances, we'd probably need users to write a shutdown script which described how to gracefully shut down the instance and then have the binary trigger that script.

In both cases we could use SQS to hold the lifecycle hooks, and just set the visibility timeout to zero when we have multiple consumers (if we go for the binary solution).

Don't build something generic

If we skip the generic solution, we could just clean up the existing python Lambda and reuse that. However, we'd still need a solid solution for regular instances.

edit: Worth noting that the Lambda function in question does not have a license...

@lukaspour
Copy link
Contributor

I guess the option Lambda with SSM sounds less time consuming than the Binary with daemonsets. What if we use the old approach and make it generic later with new PRs or maybe do the whole rework. I guess it would make the deploy of this module faster. We currently cannot use it because of missing this critical functionality.

@itsdalmo
Copy link
Contributor Author

I'm thinking the same thing @lukaspour - I'll just fix the current module first and then we can take a closer look at making a generic version later.

@itsdalmo
Copy link
Contributor Author

Added an example of using the community module for graceful shutdowns of ECS nodes. Will revisit a generic version in the future.

@itsdalmo
Copy link
Contributor Author

To pick this up again; For regular instances we have landed on buildkite/lifecycled
running as a daemon on the instance as the best option for gracefully shutting down regular instances.

I've created a feature request to add ECS draining to lifecycled so that we can have a single solution for graceful shutdown of instances (irrespective of whether they are an ECS cluster or not):
buildkite/lifecycled#54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants