Update len() function to account for pending #162

tempusfrangit · 2024-12-17T00:13:02Z

To ensure that our len function returns data that is more accurate to calculate the number of instances we need, subtract the XPENDING count from the XLEN count. This reduces the cases where we a new version or scaling config changes will result in a complete duplication of the pods simply because pending elements in the queue are counted as being in-queue.

Use of XPENDING will allow us to subtract anything that has been claimed from the metric calculation.

gandalfhz

LGMT, but let's have some more with more experience look too :-)

queue/len.lua

To ensure that our len function returns data that is more accurate to calculate the number of instances we need, subtract the XPENDING count from the XLEN count. This reduces the cases where we a new version or scaling config changes will result in a complete duplication of the pods simply because pending elements in the queue are counted as being in-queue. Use of XPENDING will allow us to subtract anything that has been claimed from the metric calculation.

tempusfrangit · 2024-12-17T01:16:27Z

This changes the behavior of len pretty significantly, but should be generally safe and should improve our autoscaling in cases where we roll the deployment(s) and predictions are long running.

I'd like @evilstreak and @philandstuff to weigh in. This makes the distinction that xpending is not "in queue".

tempusfrangit · 2024-12-17T01:20:31Z

queue/len.lua

@@ -16,7 +21,20 @@ local streams = tonumber(redis.call('HGET', key_meta, 'streams') or 1)
 local result = 0

 for idx = 0, streams-1 do
-  result = result + redis.call('XLEN', base .. ':s' .. idx)
+  local success, groupInfo = pcall(function()


This can probably be converted to redis.pcall()

philandstuff

This changes the behavior of len pretty significantly, but should be generally safe and should improve our autoscaling in cases where we roll the deployment(s) and predictions are long running.

This is not a trivial change to the autoscaling algorithm. If we change the meaning of Len(), the autoscaler will start deliberately aiming to cause queueing delays of approximately 1 prediction in our default scaling profile.

I think a change like this needs a design doc that says what problem we're trying to solve, analysing the expected impact, and suggesting updates to the various scaling_configs. We probably also want a flag to roll it out, which isn't possible if we change the definition of Len().

One way we could do this instead is:

don't make any changes to this repo! we can already calculate queue backlog ("lag" in redis streams terminology) from Stats()
make a change to cluster allowing us to target a different metric, specified in scaling_config; by default we target the existing Len(), but we also allow targetting backlog
roll out scaling configs on a per-model basis to see what happens
if we're happy with how it behaves, update the default scaling configs to target the new metric.

philandstuff · 2024-12-17T09:53:32Z

Another thought: with the current XLEN metric, we can specify overprovisioning by using a metric target <1. With XLEN-pending, we cannot target <0, so we have no way to express overprovisioning.

I think if the problem we're trying to solve is that the autoscaler is unaware of Terminating pods, we should make the autoscaler aware of terminating pods.

tempusfrangit · 2024-12-17T16:18:25Z

I'm fine with changing autoscaler to be aware of terminating pods. It does feel like we're missing some relevant data for scaling decisions related to "in-flight" vs "strictly-in-queue" predictions for scaling decisions. I do admit that it's likely such a change would make the autoscaler logic more complex (I do feel like the autoscaler is a bit naive in a number of ways along these lines, e.g. unable to take into account rate of queue increase/rate of service of predictions, unaware of terminating pods).

philandstuff · 2024-12-18T09:47:13Z

I do feel like the autoscaler is a bit naive in a number of ways along these lines, e.g. unable to take into account rate of queue increase/rate of service of predictions

Right now I feel like this is a feature not a bug. Autoscaler v1 tried to take all this stuff into account, and its scaling decisions fluctuated wildly. There was an abortive attempt at rewriting autoscaler as a PID controller, which would allow taking into account rate of queue increase and rate of service of predictions, but it was so hard to tune to not have wild behaviour and was not obviously better than autoscaler v1. Autoscaler v2 is deliberately simple and naive, and was an obvious improvement in both simplicity and performance over v1.

unaware of terminating pods

this is a real problem, however.

tempusfrangit requested a review from a team as a code owner December 17, 2024 00:13

tempusfrangit requested review from meatballhat, evilstreak, philandstuff and gandalfhz December 17, 2024 00:13

gandalfhz reviewed Dec 17, 2024

View reviewed changes

tempusfrangit force-pushed the accurate-count-queue-length branch 6 times, most recently from 3469213 to a7ddeed Compare December 17, 2024 00:39

meatballhat approved these changes Dec 17, 2024

View reviewed changes

queue/len.lua Show resolved Hide resolved

tempusfrangit force-pushed the accurate-count-queue-length branch from a7ddeed to 6bc345a Compare December 17, 2024 00:44

tempusfrangit force-pushed the accurate-count-queue-length branch from 6bc345a to 5ae7c72 Compare December 17, 2024 01:06

tempusfrangit requested a review from meatballhat December 17, 2024 01:07

tempusfrangit commented Dec 17, 2024

View reviewed changes

philandstuff requested changes Dec 17, 2024

View reviewed changes

tempusfrangit closed this Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update len() function to account for pending #162

Update len() function to account for pending #162

tempusfrangit commented Dec 17, 2024

gandalfhz left a comment

tempusfrangit commented Dec 17, 2024

tempusfrangit Dec 17, 2024

philandstuff left a comment

philandstuff commented Dec 17, 2024

tempusfrangit commented Dec 17, 2024 •

edited

Loading

philandstuff commented Dec 18, 2024

Update len() function to account for pending #162

Update len() function to account for pending #162

Conversation

tempusfrangit commented Dec 17, 2024

gandalfhz left a comment

Choose a reason for hiding this comment

tempusfrangit commented Dec 17, 2024

tempusfrangit Dec 17, 2024

Choose a reason for hiding this comment

philandstuff left a comment

Choose a reason for hiding this comment

philandstuff commented Dec 17, 2024

tempusfrangit commented Dec 17, 2024 • edited Loading

philandstuff commented Dec 18, 2024

tempusfrangit commented Dec 17, 2024 •

edited

Loading