-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update len() function to account for pending #162
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGMT, but let's have some more with more experience look too :-)
3469213
to
a7ddeed
Compare
a7ddeed
to
6bc345a
Compare
To ensure that our len function returns data that is more accurate to calculate the number of instances we need, subtract the XPENDING count from the XLEN count. This reduces the cases where we a new version or scaling config changes will result in a complete duplication of the pods simply because pending elements in the queue are counted as being in-queue. Use of XPENDING will allow us to subtract anything that has been claimed from the metric calculation.
6bc345a
to
5ae7c72
Compare
This changes the behavior of len pretty significantly, but should be generally safe and should improve our autoscaling in cases where we roll the deployment(s) and predictions are long running. I'd like @evilstreak and @philandstuff to weigh in. This makes the distinction that xpending is not "in queue". |
@@ -16,7 +21,20 @@ local streams = tonumber(redis.call('HGET', key_meta, 'streams') or 1) | |||
local result = 0 | |||
|
|||
for idx = 0, streams-1 do | |||
result = result + redis.call('XLEN', base .. ':s' .. idx) | |||
local success, groupInfo = pcall(function() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can probably be converted to redis.pcall()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the behavior of len pretty significantly, but should be generally safe and should improve our autoscaling in cases where we roll the deployment(s) and predictions are long running.
This is not a trivial change to the autoscaling algorithm. If we change the meaning of Len(), the autoscaler will start deliberately aiming to cause queueing delays of approximately 1 prediction in our default scaling profile.
I think a change like this needs a design doc that says what problem we're trying to solve, analysing the expected impact, and suggesting updates to the various scaling_configs. We probably also want a flag to roll it out, which isn't possible if we change the definition of Len().
One way we could do this instead is:
- don't make any changes to this repo! we can already calculate queue backlog ("lag" in redis streams terminology) from Stats()
- make a change to cluster allowing us to target a different metric, specified in scaling_config; by default we target the existing Len(), but we also allow targetting backlog
- roll out scaling configs on a per-model basis to see what happens
- if we're happy with how it behaves, update the default scaling configs to target the new metric.
Another thought: with the current XLEN metric, we can specify overprovisioning by using a metric target <1. With XLEN-pending, we cannot target <0, so we have no way to express overprovisioning. I think if the problem we're trying to solve is that the autoscaler is unaware of Terminating pods, we should make the autoscaler aware of terminating pods. |
I'm fine with changing autoscaler to be aware of terminating pods. It does feel like we're missing some relevant data for scaling decisions related to "in-flight" vs "strictly-in-queue" predictions for scaling decisions. I do admit that it's likely such a change would make the autoscaler logic more complex (I do feel like the autoscaler is a bit naive in a number of ways along these lines, e.g. unable to take into account rate of queue increase/rate of service of predictions, unaware of terminating pods). |
Right now I feel like this is a feature not a bug. Autoscaler v1 tried to take all this stuff into account, and its scaling decisions fluctuated wildly. There was an abortive attempt at rewriting autoscaler as a PID controller, which would allow taking into account rate of queue increase and rate of service of predictions, but it was so hard to tune to not have wild behaviour and was not obviously better than autoscaler v1. Autoscaler v2 is deliberately simple and naive, and was an obvious improvement in both simplicity and performance over v1.
this is a real problem, however. |
To ensure that our len function returns data that is more accurate to calculate the number of instances we need, subtract the XPENDING count from the XLEN count. This reduces the cases where we a new version or scaling config changes will result in a complete duplication of the pods simply because pending elements in the queue are counted as being in-queue.
Use of XPENDING will allow us to subtract anything that has been claimed from the metric calculation.