Feature request: Graceful shutdown of ECS workers #1

itsdalmo · 2018-06-20T08:12:19Z

When moving this module I removed this part because:

The module was hacked together in a fork.
It produced resource names that were too long (the name_prefix could be at most ~8 characters) and caused problems when deploying.
Original issue and more details here: ECS – zero downtime problem TeliaSoneraNorge/telia-terraform-modules#36

Obviously, this feature needs to make it back into the module one of two ways:

Better fixes to the original module and use that.
Roll our own autoscaling lambda and try to make it "generic" for both ECS and regular instances.

I'd like to take a look at the second option and make it generic so that we have a single way of handling graceful shutdowns of autoscaling instances (if possible), as this is a needed feature for e.g. Concourse workers as well, where we are currently using lifecycled on the instances themselves to handle the lifecycle hooks.

The text was updated successfully, but these errors were encountered:

itsdalmo · 2018-07-09T09:02:18Z

Use cases

We have two use cases:

Concourse workers: Should not shut down without draining the worker beforehand.
ECS Clusters: Should drain containers before proceeding with the shutdown.

Currently we've used lifecycled for Concourse workers, and a lambda function for the ECS clusters. In both cases, we need to do a graceful shutdown. For Concourse we need to avoid interrupting running jobs, and with ECS we need to ensure that containers have been replaced if we want zero downtime rolling updates.

Limitations

I had a look at lifecycle hooks and found the following limitations:

The EC2 API only allows us to send terminate or stop to the instances.
The terminate and stop API calls will kill the instance after a few minutes if the graceful shutdown has not completed.

Possible generic solutions

Lambda with SSM

We could build a Lambda function which uses SSM to send a shutdown -h now to the instance, and then waits for it to reach a stopped state before allowing the lifecycle hook to proceed to terminate it. The benefit of doing this is that we don't have to install or run anything new on the instances, this would however require that they are running the SSM agent and have the necessary privileges.

For ECS the use case is a bit simpler, we can tell ECS to drain the instance and then proceed when ECS reports that there are no running containers. This can be done from a Lambda function without calling SSM at all.

Binary with daemonsets

We can also achieve the same thing by installing a binary on regular instances, or running the binary as a daemonset in ECS. This does not require SSM, but it would require that the instance profile (or task role for ECS) had privileges to set instance state to draining in ECS. For regular instances, we'd probably need users to write a shutdown script which described how to gracefully shut down the instance and then have the binary trigger that script.

In both cases we could use SQS to hold the lifecycle hooks, and just set the visibility timeout to zero when we have multiple consumers (if we go for the binary solution).

Don't build something generic

If we skip the generic solution, we could just clean up the existing python Lambda and reuse that. However, we'd still need a solid solution for regular instances.

edit: Worth noting that the Lambda function in question does not have a license...

lukaspour · 2018-07-10T09:09:47Z

I guess the option Lambda with SSM sounds less time consuming than the Binary with daemonsets. What if we use the old approach and make it generic later with new PRs or maybe do the whole rework. I guess it would make the deploy of this module faster. We currently cannot use it because of missing this critical functionality.

itsdalmo · 2018-07-10T10:08:50Z

I'm thinking the same thing @lukaspour - I'll just fix the current module first and then we can take a closer look at making a generic version later.

itsdalmo · 2018-07-10T11:34:35Z

Added an example of using the community module for graceful shutdowns of ECS nodes. Will revisit a generic version in the future.

itsdalmo · 2018-09-27T17:08:13Z

To pick this up again; For regular instances we have landed on buildkite/lifecycled
running as a daemon on the instance as the best option for gracefully shutting down regular instances.

I've created a feature request to add ECS draining to lifecycled so that we can have a single solution for graceful shutdown of instances (irrespective of whether they are an ECS cluster or not):
buildkite/lifecycled#54

itsdalmo self-assigned this Jun 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Graceful shutdown of ECS workers #1

Feature request: Graceful shutdown of ECS workers #1

itsdalmo commented Jun 20, 2018

itsdalmo commented Jul 9, 2018 •

edited

Loading

lukaspour commented Jul 10, 2018

itsdalmo commented Jul 10, 2018

itsdalmo commented Jul 10, 2018

itsdalmo commented Sep 27, 2018

Feature request: Graceful shutdown of ECS workers #1

Feature request: Graceful shutdown of ECS workers #1

Comments

itsdalmo commented Jun 20, 2018

itsdalmo commented Jul 9, 2018 • edited Loading

Use cases

Limitations

Possible generic solutions

Lambda with SSM

Binary with daemonsets

Don't build something generic

lukaspour commented Jul 10, 2018

itsdalmo commented Jul 10, 2018

itsdalmo commented Jul 10, 2018

itsdalmo commented Sep 27, 2018

itsdalmo commented Jul 9, 2018 •

edited

Loading