Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update len() function to account for pending #162

Closed
wants to merge 1 commit into from

Conversation

tempusfrangit
Copy link

To ensure that our len function returns data that is more accurate to calculate the number of instances we need, subtract the XPENDING count from the XLEN count. This reduces the cases where we a new version or scaling config changes will result in a complete duplication of the pods simply because pending elements in the queue are counted as being in-queue.

Use of XPENDING will allow us to subtract anything that has been claimed from the metric calculation.

Copy link

@gandalfhz gandalfhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT, but let's have some more with more experience look too :-)

@tempusfrangit tempusfrangit force-pushed the accurate-count-queue-length branch 6 times, most recently from 3469213 to a7ddeed Compare December 17, 2024 00:39
queue/len.lua Show resolved Hide resolved
@tempusfrangit tempusfrangit force-pushed the accurate-count-queue-length branch from a7ddeed to 6bc345a Compare December 17, 2024 00:44
To ensure that our len function returns data that is more accurate to
calculate the number of instances we need, subtract the XPENDING count
from the XLEN count. This reduces the cases where we a new version or
scaling config changes will result in a complete duplication of the pods
simply because pending elements in the queue are counted as being
in-queue.

Use of XPENDING will allow us to subtract anything that has been claimed
from the metric calculation.
@tempusfrangit tempusfrangit force-pushed the accurate-count-queue-length branch from 6bc345a to 5ae7c72 Compare December 17, 2024 01:06
@tempusfrangit
Copy link
Author

This changes the behavior of len pretty significantly, but should be generally safe and should improve our autoscaling in cases where we roll the deployment(s) and predictions are long running.

I'd like @evilstreak and @philandstuff to weigh in. This makes the distinction that xpending is not "in queue".

@@ -16,7 +21,20 @@ local streams = tonumber(redis.call('HGET', key_meta, 'streams') or 1)
local result = 0

for idx = 0, streams-1 do
result = result + redis.call('XLEN', base .. ':s' .. idx)
local success, groupInfo = pcall(function()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be converted to redis.pcall()

Copy link
Contributor

@philandstuff philandstuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the behavior of len pretty significantly, but should be generally safe and should improve our autoscaling in cases where we roll the deployment(s) and predictions are long running.

This is not a trivial change to the autoscaling algorithm. If we change the meaning of Len(), the autoscaler will start deliberately aiming to cause queueing delays of approximately 1 prediction in our default scaling profile.

I think a change like this needs a design doc that says what problem we're trying to solve, analysing the expected impact, and suggesting updates to the various scaling_configs. We probably also want a flag to roll it out, which isn't possible if we change the definition of Len().

One way we could do this instead is:

  • don't make any changes to this repo! we can already calculate queue backlog ("lag" in redis streams terminology) from Stats()
  • make a change to cluster allowing us to target a different metric, specified in scaling_config; by default we target the existing Len(), but we also allow targetting backlog
  • roll out scaling configs on a per-model basis to see what happens
  • if we're happy with how it behaves, update the default scaling configs to target the new metric.

@philandstuff
Copy link
Contributor

Another thought: with the current XLEN metric, we can specify overprovisioning by using a metric target <1. With XLEN-pending, we cannot target <0, so we have no way to express overprovisioning.

I think if the problem we're trying to solve is that the autoscaler is unaware of Terminating pods, we should make the autoscaler aware of terminating pods.

@tempusfrangit
Copy link
Author

tempusfrangit commented Dec 17, 2024

I'm fine with changing autoscaler to be aware of terminating pods. It does feel like we're missing some relevant data for scaling decisions related to "in-flight" vs "strictly-in-queue" predictions for scaling decisions. I do admit that it's likely such a change would make the autoscaler logic more complex (I do feel like the autoscaler is a bit naive in a number of ways along these lines, e.g. unable to take into account rate of queue increase/rate of service of predictions, unaware of terminating pods).

@philandstuff
Copy link
Contributor

I do feel like the autoscaler is a bit naive in a number of ways along these lines, e.g. unable to take into account rate of queue increase/rate of service of predictions

Right now I feel like this is a feature not a bug. Autoscaler v1 tried to take all this stuff into account, and its scaling decisions fluctuated wildly. There was an abortive attempt at rewriting autoscaler as a PID controller, which would allow taking into account rate of queue increase and rate of service of predictions, but it was so hard to tune to not have wild behaviour and was not obviously better than autoscaler v1. Autoscaler v2 is deliberately simple and naive, and was an obvious improvement in both simplicity and performance over v1.

unaware of terminating pods

this is a real problem, however.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants