Skip to content

Commit

Permalink
Re-enable ASG scaling before final stabilisation
Browse files Browse the repository at this point in the history
This aims to address, to some extent, issue #1342 -
the problem that *apps can not auto-scale* until an autoscaling deploy has
successfully completed. On 22nd May 2024, this inability to auto-scale led
to a severe outage in the Ophan Tracker.

Ever since #83 in
April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy
(`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy,
(`ResumeAlarmNotifications`) once deployment has successfully completed.

In December 2016, with #403, an
additional `WaitForStabilization` was added as a penultimate deploy step,
with the aim of ensuring that the cull of old instances has _completed_
before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description
indicates this was a necessary choice. We can see the argument for it -
obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by
alarms - but if the ASG instances are already overloaded and recycling, the ASG
will _never_ stabilise, because it needs to scale up to handle the load it's
experiencing.

By simply putting the final `WaitForStabilization` step _after_
`ResumeAlarmNotifications`, the Ophan outage would have been shortened from
1 hour to ~2 minutes.

The `WaitForStabilization` step itself simply checks that the number of instances
in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not
scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or
twice, and the `WaitForStabilization` condition will still be easily satisfied,
and the deploy will be reported as success.
  • Loading branch information
rtyley committed May 23, 2024
1 parent c743364 commit 7550f3b
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 9 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -197,12 +197,12 @@ object AutoScaling extends DeploymentType with BucketParameters {
target.region,
terminationGrace(pkg, target, reporter) * 1000
),
ResumeAlarmNotifications(autoScalingGroup, target.region),
WaitForStabilization(
autoScalingGroup,
secondsToWait(pkg, target, reporter) * 1000,
target.region
),
ResumeAlarmNotifications(autoScalingGroup, target.region)
)
)
}
val groupsToUpdate: List[AutoScalingGroupInfo] =
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -142,8 +142,8 @@ class AutoScalingTest
WaitForStabilization(testInfo, 15 * 60 * 1000, Region("eu-west-1")),
CullInstancesWithTerminationTag(testInfo, Region("eu-west-1")),
TerminationGrace(testInfo, Region("eu-west-1"), 10000),
WaitForStabilization(testInfo, 15 * 60 * 1000, Region("eu-west-1")),
ResumeAlarmNotifications(testInfo, Region("eu-west-1"))
ResumeAlarmNotifications(testInfo, Region("eu-west-1")),
WaitForStabilization(testInfo, 15 * 60 * 1000, Region("eu-west-1"))
)
actual shouldBe expected
}
Expand Down Expand Up @@ -209,12 +209,12 @@ class AutoScalingTest
),
CullInstancesWithTerminationTag(testOldCfnAsg, Region("eu-west-1")),
TerminationGrace(testOldCfnAsg, Region("eu-west-1"), 10000),
ResumeAlarmNotifications(testOldCfnAsg, Region("eu-west-1")),
WaitForStabilization(
testOldCfnAsg,
15 * 60 * 1000,
Region("eu-west-1")
),
ResumeAlarmNotifications(testOldCfnAsg, Region("eu-west-1")),
// All tasks for testNewCdkAsg
WaitForStabilization(testNewCdkAsg, 5 * 60 * 1000, Region("eu-west-1")),
CheckGroupSize(testNewCdkAsg, Region("eu-west-1")),
Expand All @@ -239,12 +239,12 @@ class AutoScalingTest
),
CullInstancesWithTerminationTag(testNewCdkAsg, Region("eu-west-1")),
TerminationGrace(testNewCdkAsg, Region("eu-west-1"), 10000),
ResumeAlarmNotifications(testNewCdkAsg, Region("eu-west-1")),
WaitForStabilization(
testNewCdkAsg,
15 * 60 * 1000,
Region("eu-west-1")
),
ResumeAlarmNotifications(testNewCdkAsg, Region("eu-west-1"))
)
)
actual shouldBe expected
}
Expand Down Expand Up @@ -334,8 +334,8 @@ class AutoScalingTest
WaitForStabilization(testInfo, 3 * 60 * 1000, Region("eu-west-1")),
CullInstancesWithTerminationTag(testInfo, Region("eu-west-1")),
TerminationGrace(testInfo, Region("eu-west-1"), 11000),
WaitForStabilization(testInfo, 3 * 60 * 1000, Region("eu-west-1")),
ResumeAlarmNotifications(testInfo, Region("eu-west-1"))
ResumeAlarmNotifications(testInfo, Region("eu-west-1")),
WaitForStabilization(testInfo, 3 * 60 * 1000, Region("eu-west-1"))
)
actual shouldBe expected
}
Expand Down

0 comments on commit 7550f3b

Please sign in to comment.