Re-enable ASG scaling before final stabilisation

This aims to address, to some extent, issue #1342 - the problem that *apps can not auto-scale* until an autoscaling deploy has successfully completed. On 22nd May 2024, this inability to auto-scale led to a severe outage in the Ophan Tracker. Ever since #83 in April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy (`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy, (`ResumeAlarmNotifications`) once deployment has successfully completed. In December 2016, with #403, an additional `WaitForStabilization` was added as a penultimate deploy step, with the aim of ensuring that the cull of old instances has _completed_ before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description indicates this was a necessary choice. We can see the argument for it - obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by alarms - but if the ASG instances are already overloaded and recycling, the ASG will _never_ stabilise, because it needs to scale up to handle the load it's experiencing. By simply putting the final `WaitForStabilization` step _after_ `ResumeAlarmNotifications`, the Ophan outage would have been shortened from 1 hour to ~2 minutes. The `WaitForStabilization` step itself simply checks that the number of instances in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or twice, and the `WaitForStabilization` condition will still be easily satisfied, and the deploy will be reported as success.
guardian · May 23, 2024 · 7550f3b · 7550f3b
1 parent c743364
commit 7550f3b
Show file tree

Hide file tree

Showing 2 changed files with 9 additions and 9 deletions.
diff --git a/magenta-lib/src/main/scala/magenta/deployment_type/AutoScaling.scala b/magenta-lib/src/main/scala/magenta/deployment_type/AutoScaling.scala
@@ -197,12 +197,12 @@ object AutoScaling extends DeploymentType with BucketParameters {
           target.region,
           terminationGrace(pkg, target, reporter) * 1000
         ),
+        ResumeAlarmNotifications(autoScalingGroup, target.region),
         WaitForStabilization(
           autoScalingGroup,
           secondsToWait(pkg, target, reporter) * 1000,
           target.region
-        ),
-        ResumeAlarmNotifications(autoScalingGroup, target.region)
+        )
       )
     }
     val groupsToUpdate: List[AutoScalingGroupInfo] =

diff --git a/magenta-lib/src/test/scala/magenta/deployment_type/AutoScalingTest.scala b/magenta-lib/src/test/scala/magenta/deployment_type/AutoScalingTest.scala
@@ -142,8 +142,8 @@ class AutoScalingTest
         WaitForStabilization(testInfo, 15 * 60 * 1000, Region("eu-west-1")),
         CullInstancesWithTerminationTag(testInfo, Region("eu-west-1")),
         TerminationGrace(testInfo, Region("eu-west-1"), 10000),
-        WaitForStabilization(testInfo, 15 * 60 * 1000, Region("eu-west-1")),
-        ResumeAlarmNotifications(testInfo, Region("eu-west-1"))
+        ResumeAlarmNotifications(testInfo, Region("eu-west-1")),
+        WaitForStabilization(testInfo, 15 * 60 * 1000, Region("eu-west-1"))
       )
       actual shouldBe expected
     }
@@ -209,12 +209,12 @@ class AutoScalingTest
         ),
         CullInstancesWithTerminationTag(testOldCfnAsg, Region("eu-west-1")),
         TerminationGrace(testOldCfnAsg, Region("eu-west-1"), 10000),
+        ResumeAlarmNotifications(testOldCfnAsg, Region("eu-west-1")),
         WaitForStabilization(
           testOldCfnAsg,
           15 * 60 * 1000,
           Region("eu-west-1")
         ),
-        ResumeAlarmNotifications(testOldCfnAsg, Region("eu-west-1")),
         // All tasks for testNewCdkAsg
         WaitForStabilization(testNewCdkAsg, 5 * 60 * 1000, Region("eu-west-1")),
         CheckGroupSize(testNewCdkAsg, Region("eu-west-1")),
@@ -239,12 +239,12 @@ class AutoScalingTest
         ),
         CullInstancesWithTerminationTag(testNewCdkAsg, Region("eu-west-1")),
         TerminationGrace(testNewCdkAsg, Region("eu-west-1"), 10000),
+        ResumeAlarmNotifications(testNewCdkAsg, Region("eu-west-1")),
         WaitForStabilization(
           testNewCdkAsg,
           15 * 60 * 1000,
           Region("eu-west-1")
-        ),
-        ResumeAlarmNotifications(testNewCdkAsg, Region("eu-west-1"))
+        )
       )
       actual shouldBe expected
     }
@@ -334,8 +334,8 @@ class AutoScalingTest
         WaitForStabilization(testInfo, 3 * 60 * 1000, Region("eu-west-1")),
         CullInstancesWithTerminationTag(testInfo, Region("eu-west-1")),
         TerminationGrace(testInfo, Region("eu-west-1"), 11000),
-        WaitForStabilization(testInfo, 3 * 60 * 1000, Region("eu-west-1")),
-        ResumeAlarmNotifications(testInfo, Region("eu-west-1"))
+        ResumeAlarmNotifications(testInfo, Region("eu-west-1")),
+        WaitForStabilization(testInfo, 3 * 60 * 1000, Region("eu-west-1"))
       )
       actual shouldBe expected
     }