keep scaling when nodes are draining #672

janory · 2023-07-19T15:44:26Z

Hi! 👋

We recently started to use the Nomad Autoscaler agent and we really like it. 🚀
We are using the Autoscaler with the Nomad APM, aws-asg target and target-value strategy plugins.

We have multiple long running (1-45 minutes) batch jobs on our nodes and when a scale in action happens the drain event won't finish until the last batch job completes on the node.

This leads to constant warning messages like this:

2023-07-18T13:17:01.646Z [TRACE] policy_manager.policy_handler: target is not ready: policy_id=4a1d5af4-323a-d939-d208-18672288565c
2023-07-18T13:17:01.646Z [WARN] internal_plugin.aws-asg: node pool status readiness check failed: error="node 872ae150-f1a2-12b1-2197-cd32a3b49546 is draining"
2023-07-18T13:17:01.642Z [TRACE] policy_manager.policy_handler: getting target status: policy_id=4a1d5af4-323a-d939-d208-18672288565c
2023-07-18T13:17:01.642Z [TRACE] policy_manager.policy_handler: tick: policy_id=4a1d5af4-323a-d939-d208-18672288565c

because the Autoscaler implicitly checks the ASG target's status for each tick (handleTick -> generateEvaluation -> Status -> IsPoolReady -> FilterNodes -> if node.Drain).

Based on the comment here and also based on what we are experiencing the Autoscaler stops any further scaling actions until all draining activities are completed.

This is an issue for us, because in worst case scenario the long running batch jobs will prevent us scaling for 45 minutes.

Would it be possible to add a config for the idFn function to filter out draining nodes and keep scaling?

We would also like to better understand what are the risks of scaling a cluster which has draining nodes and why such a cluster is considered unstable.

The text was updated successfully, but these errors were encountered:

janory · 2023-07-24T14:07:31Z

I was thinking about something like this: #679
Although this alone probably won't be enough, because even if this part passes, the processLastActivity call would set the Ready flag to false.

tgross · 2023-08-02T17:31:08Z

Hi @janory!

We would also like to better understand what are the risks of scaling a cluster which has draining nodes and why such a cluster is considered unstable.

I think the major challenge here is that the nodes might be draining for reasons outside the control of the autoscaler. Maybe you've run nomad node drain -enable :node_id out of band so that software on the host can be upgraded, and then plan is to return that immediately to work afterwards. Or maybe the host is having unrecoverable problems unrelated to scale in/out and you've drained it so that you can decommission it afterwards. Either way, the autoscaler would need to know whether to count that node in the total capacity or not.

If we do decide to ignore this check, then we need to adjust our expectations of what plugins return as node count. For example, if there are 5 instances in a ASG, but 2 are draining, maybe the policy calculation should only count 3 nodes to account for either of those two situations?

douglaje · 2023-09-21T19:58:08Z

Hi @tgross , we've run into this issue/constraint as well. After moving to AWS spot instances which can receive interruption notices at any moment (and nearly continuously if you've got a large enough mixed cluster), our autoscaler would stop scaling for up to a half hour at a time (due to any node in the cluster being draining/initializing/other-than-ready) and we'd totally blow our SLA.

For us, the bigger sin than not scaling exactly is to not scale quickly. We don't mind underestimating capacity so we've customized the aws_asg and nomad_apm plugins so FilterNodes no longer errors on non-ready nodes (it excludes them instead).

It might be nice to be able to provide a strictness=ignore_unstable to the autoscaler plugins to be able to selectively override certain cautious behaviors built into the autoscaler, but part of the problem is that this check happens in nearly every plugin (both the apm and target plugins for us) and my Golang experience is minimal at best.

lgfa29 · 2023-12-22T01:04:55Z

Thank you for extra input @douglaje.

I've experimented with bypassing these checks but I'm still unsure about their impact. The biggest blocker here is that a policy is not allowed to be evaluated in parallel, meaning that only a single scaling action is allowed happen at time. But if you have multiple policies targeting the same set of nodes, or if the scaling action takes so long that evaluation times out, then this can be bypassed as well.

I've opened #811 to start some discussion around this. As I mentioned, I'm still unsure about it, so I'm at least marking these new configuration as experimental and we will probably not document them for now. If you would be willing to try them we could perhaps consider merging it.

For reference, this is the policy file I used for test. I had split the scaling up and down into two different policies so the actions could, in theory, happen at the same time. Another thing that is important about the AWS ASG target plugin is that the ASG events also affect their cooldown, so you also need different values there.

scaling "cluster_up" {
  enabled = true
  min     = 1
  max     = 4

  policy {
    cooldown            = "3s"
    evaluation_interval = "10s"

    check "up" {
      source = "prometheus"
      query  = "sum(nomad_client_allocations_running)/count(nomad_client_allocations_running)"

      strategy "threshold" {
        lower_bound = 3.9
        delta       = 1
      }
    }

    target "aws-asg" {
      dry-run             = "false"
      aws_asg_name        = "hashistack-nomad_client"
      node_class          = "hashistack"
      node_drain_deadline = "10m"

      # EXPERIMENTAL.
      node_filter_ignore_drain = true
      ignore_asg_events        = true
    }
  }
}

scaling "cluster_down" {
  enabled = true
  min     = 1
  max     = 4

  policy {
    cooldown            = "10s"
    evaluation_interval = "10s"

    check "down" {
      source = "prometheus"
      query  = "sum(nomad_client_allocations_running)/count(nomad_client_allocations_running)"

      strategy "threshold" {
        upper_bound = 3.1
        delta       = -1
      }
    }

    target "aws-asg" {
      dry-run             = "false"
      aws_asg_name        = "hashistack-nomad_client"
      node_class          = "hashistack"
      node_drain_deadline = "10m"

      # EXPERIMENTAL.
      node_filter_ignore_drain = true
      ignore_asg_events        = true
    }
  }
}

janory mentioned this issue Jul 20, 2023

double scaling action #610

Closed

tgross self-assigned this Aug 2, 2023

tgross added the stage/waiting-reply label Aug 2, 2023

tgross added stage/needs-discussion type/enhancement theme/target/aws-asg theme/cluster-scaling labels Aug 2, 2023

lgfa29 mentioned this issue Dec 22, 2023

aws: add experimental configuration #811

Merged

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Triaging in Nomad - Community Issues Triage Jun 24, 2024

tgross removed the stage/waiting-reply label Aug 6, 2024

tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Aug 6, 2024

tgross added the hcc/jira label Aug 6, 2024

tgross removed their assignment Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keep scaling when nodes are draining #672

keep scaling when nodes are draining #672

janory commented Jul 19, 2023 •

edited

Loading

janory commented Jul 24, 2023

tgross commented Aug 2, 2023

douglaje commented Sep 21, 2023 •

edited

Loading

lgfa29 commented Dec 22, 2023

keep scaling when nodes are draining #672

keep scaling when nodes are draining #672

Comments

janory commented Jul 19, 2023 • edited Loading

janory commented Jul 24, 2023

tgross commented Aug 2, 2023

douglaje commented Sep 21, 2023 • edited Loading

lgfa29 commented Dec 22, 2023

janory commented Jul 19, 2023 •

edited

Loading

douglaje commented Sep 21, 2023 •

edited

Loading