-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eager scaling strategy for ScaledJob does not work as documented (or intended?) #6416
Comments
@chinery Thanks for your feedback and your findings! It's tough to recall the tricky computation and follow up your analysis. I'm not sure if the Do you mind checking these test cases:
If they don't have 100% coverage, could you provide the missing cases to prove your point? |
Hi @junekhan
I don't see any changes to the default since 2.14, other than the introduction of
The changes for For GetEffectiveMaxScale(maxScale, runningJobCount, pendingJobCount, maxReplicaCount, scaleTo int64) (int64, int64)
maxScale, scaleTo := strategy.GetEffectiveMaxScale(4, 3, 0, 10, 1)
assert.Equal(t, int64(4), maxScale)
assert.Equal(t, int64(10), scaleTo) The inputs are: The first return value, There are 4 jobs on the queue, and 3 already running. So to maximise the number of jobs, the result should be to create assert.Equal(t, int64(1), maxScale) However, the The second assert states that For the e2e test, the behaviour in the test seems to mimic the description in the documentation, in other words, the desired behaviour of RMQPublishMessages(t, rmqNamespace, connectionString, queueName, 4)
assert.True(t, WaitForScaledJobCount(t, kc, scaledJobName, testNamespace, 4, iterationCount, 1),
"job count should be %d after %d iterations", 4, iterationCount)
RMQPublishMessages(t, rmqNamespace, connectionString, queueName, 4)
assert.True(t, WaitForScaledJobCount(t, kc, scaledJobName, testNamespace, 8, iterationCount, 1),
"job count should be %d after %d iterations", 8, iterationCount)
RMQPublishMessages(t, rmqNamespace, connectionString, queueName, 4)
assert.True(t, WaitForScaledJobCount(t, kc, scaledJobName, testNamespace, 10, iterationCount, 1),
"job count should be %d after %d iterations", 10, iterationCount) i.e., push 4 messages, scale to 4, push 4 messages, scale to 8, push 4 messages, scale to 10 I am not a KEDA (or Go) developer but this seems to be a limitation of Please try either of the following examples which I think illustrate my point Test 1: only push 4 messages func testEagerScaling1(t *testing.T, kc *kubernetes.Clientset) {
iterationCount := 20
RMQPublishMessages(t, rmqNamespace, connectionString, queueName, 4)
assert.True(t, WaitForScaledJobCount(t, kc, scaledJobName, testNamespace, 4, iterationCount, 1),
"job count should be %d after %d iterations", 4, iterationCount)
assert.False(t, WaitForScaledJobCount(t, kc, scaledJobName, testNamespace, 8, iterationCount, 1),
"job count should still be 4 after %d iterations", iterationCount)
assert.False(t, WaitForScaledJobCount(t, kc, scaledJobName, testNamespace, 10, iterationCount, 1),
"job count should still be 4 after %d iterations", 10, iterationCount)
} or Test 2: using func testEagerScaling2(t *testing.T, kc *kubernetes.Clientset) {
iterationCount := 20
RMQPublishMessages(t, rmqNamespace, connectionString, queueName, 4)
assert.True(t, WaitForJobCountUntilIteration(t, kc, testNamespace, 4, iterationCount, 1),
"job count should be %d after %d iterations", 4, iterationCount)
RMQPublishMessages(t, rmqNamespace, connectionString, queueName, 4)
assert.True(t, WaitForJobCountUntilIteration(t, kc, testNamespace, 8, iterationCount, 1),
"job count should be %d after %d iterations", 8, iterationCount)
RMQPublishMessages(t, rmqNamespace, connectionString, queueName, 4)
assert.True(t, WaitForJobCountUntilIteration(t, kc, testNamespace, 10, iterationCount, 1),
"job count should be %d after %d iterations", 10, iterationCount)
} |
Thank you @chinery for your input!
I think func (s eagerScalingStrategy) GetEffectiveMaxScale(maxScale, runningJobCount, pendingJobCount, maxReplicaCount, _ int64) (int64, int64) {
return min(maxReplicaCount-runningJobCount-pendingJobCount, maxScale), maxReplicaCount
} With the input
That's true. I intended to return
This point is valuable. I will improve the test case with your proposal
I literally want to have the scaling style I described below, which is NOT fulfilled by
|
You are describing the output of the code, so the test is guaranteed to pass, but I am asking why the value should be 4. Tests should be written separately from the implementation.
I've changed the e2e test to use the Perhaps when you tried |
Because 4 tasks are waiting, and I want to run them
This has nothing to do with the trigger. This guy experienced it in SQS as well #5881
|
The trigger is absolutely relevant – whether or not the Please could you run the test file I linked in my last message which confirms that the Maybe it would be good to hear from some other developers too, e.g. it seems @TsuyoshiUshio wrote the accurate/custom scalers. |
Repeating from my comment on the PR junekhan just created I have uploaded the output of the and the two |
It's definitely not the case. If you have pondered a step further, you probably wouldn't have such speculation. As I pointed out my case as That's why I am certain that |
I appreciate my messages are verbose, and I'm sorry if there's a language barrier that means I don't fully follow your meaning (I am still not sure whether you ack on receive or not, either can work for 3hr+ jobs, but that was a side point). Sometimes following lengthy passages is necessary to understand – a test alone cannot do anything without rationale because the asserts of the test must match the intended outcome (not just the implemented one). Regardless, I've provided a test which I believe shows that |
Report
This form prompts me to be clear and concise, and I will try to be very clear but fear that will not be very concise (apologies)
I was trying to understand the difference between the
default
andeager
scaling strategies of ScaledJob (see https://keda.sh/docs/2.16/reference/scaledjob-spec/#scalingstrategy)In short
eager
strategy is correct, either about the behaviour ofdefault
or the behaviour ofeager
eager
may be bugged, but it's hard to tell what the intention is since I believe it is already given bydefault
The documented behaviour
accurate
strategy is required)default
, this is calculated asmaxScale - runningJobCount
,where
maxScale = min(scaledJob.MaxReplicaCount(), divideWithCeil(queueLength, targetAverageValue))
eager
scaling strategy does not exactly explain how it differs, only that it makes up for an issue you might find withdefault
. there is an example listed, where the maximum replicas is 10, the target average value is 1, and there is the following sequence: submit 3 jobs, poll, submit another 3 jobs, poll, and gives this tablemaxScale = min(10, ceil(6 / 1)) = 6
so
"the number of the scale" = 3
so 3 new jobs will be created, meaning the total of running jobs is now 6
which is working as intended.
eager
strategy which has 6 running jobs after the poll – I'll come to whateager
actually does in a later section but I believe this is incorrect also.The intended behaviour
The documentation also suggests reading the initial suggestion here: #5114
I don't want to offend or misconstrue anyone here, so please don't take any of this as criticism, just trying to untangle the web – please correct me if I've misunderstood anything.
It seems to me that @junekhan may have confused the behaviour of "the number to scale", and thought that it would scale like a Deployment (where in the example above, a scale of 3 would mean only 3 running jobs after poll, instead of 3 new jobs). My evidence is this comment:
But this is the behaviour of
default
. @zroubalik replies and says this behaviour should be added. The pull request is later made by @junekhan and documentation added by @zroubalik.It's possible that some miscommunication happened here, so I also wanted to work out what the
eager
strategy does, in case I misunderstood the intention, and it is simply the documentation that needs updating.The actual behaviour
Here I will try to narrate a sequence of logic through the code that explains how the two strategies work. I hope you can follow it – I have tried to just include the relevant detail with function names, parameter names, return value names, code/pseudocode behaviour, and some commentary (in italics). The function names link to the code with line numbers. I will also include the example values from earlier.
checkScalers
isActive, isError, scaleTo, maxScale := h.isScaledJobActive(ctx, obj)
isScaledJobActive
isActive, queueLength, maxValue, maxFloatValue := scaledjob.IsScaledJobActive(scalersMetrics, scaledJob.Spec.ScalingStrategy.MultipleScalersCalculation, scaledJob.MinReplicaCount(), scaledJob.MaxReplicaCount())
IsScaledJobActive
- in
CalculateQueueLengthAndMaxValue
getTargetAverageValue
gets the target value from the trigger, so for our exampletargetAverageValue=1
,queueLength=6
,maxReplicaCount=10
, and somaxValue=6
. worth noting thatqueueLength
does not divide bytargetAverageValue
, it is the raw length)IsScaledJobActive
)maxValue = min(maxValue, maxReplicaCount)
return isActive, ceilToInt64(queueLength), ceilToInt64(maxValue), maxValue
IsScaledJobActive
returnsqueueLength=6
andmaxValue=6
)isScaledJobActive
returns them in this order:isActive, isError, queueLength, maxValue
)checkScalers
assigns these toisActive, isError, scaleTo, maxScale
, soscaleTo=queueLength=6, maxScale=maxValue=6
)h.scaleExecutor.RequestJobScale(ctx, obj, isActive, isError, scaleTo, maxScale)
RequestJobScale
effectiveMaxScale, scaleTo := e.getScalingDecision(scaledJob, runningJobCount, scaleTo, maxScale, pendingJobCount, logger)
getScalingDecision
(this is where it forks based on scaling strategy)
effectiveMaxScale, scaleTo = NewScalingStrategy(logger, scaledJob).GetEffectiveMaxScale(maxScale, runningJobCount-minReplicaCount, pendingJobCount, scaledJob.MaxReplicaCount(), scaleTo)
and the definition of
GetEffectiveMaxScale
:GetEffectiveMaxScale(maxScale, runningJobCount, pendingJobCount, maxReplicaCount, scaleTo int64) (int64, int64)
(example:
maxScale=6
,runningJobCount=3
,minReplicaCount=0
,pendingJobCount=0
,scaledJob.MaxReplicaCount()=10
,scaleTo=6
)default
return maxScale - runningJobCount, scaleTo
(so this returns
(3, 6)
)eager
return min(maxReplicaCount-runningJobCount-pendingJobCount, maxScale), maxReplicaCount
(so this returns
(min(7, 6), 10)=(6, 10)
)return effectiveMaxScale, scaleTo
RequestJobScale
callse.createJobs
)e.createJobs(ctx, logger, scaledJob, scaleTo, effectiveMaxScale)
with signature:
createJobs(..., scaleTo int64, maxScale int64)
(soeffectiveMaxScale
is nowmaxScale
)- default:
maxScale = 3
,scaleTo = 6
, so this generates 3 jobs- eager:
maxScale = 6
,scaleTo = 10
, so this generates 6 jobsAfter the second poll in our example, the eager strategy will have 9 jobs. On the third poll, assuming no new jobs, it will create 1 more job and hit the maximum, since that is
maxReplicaCount-runningJobCount
.I am not sure what
scaleTo
is doing in this calculation. It is set to the queue length, unmodified by thetargetAverageValue
,maxReplicas
, orrunningJobs
. I can't immediately see any scenario wherescaleTo < maxScale
, meaning that it will always just use the value ofmaxScale
for the number of jobs to create.Regardless my conclusion for the behaviour of the
eager
strategy is that it does as @JorTurFer asked in the discussion, which is that it scales up until it hits the maximum whenever the queue is non zero. But the rate of scaling depends on the number of items in the queue. I'm still not sure if this is the intended behaviour – I think this could be achieved more efficiently with a scale strategy likeif maxScale > 0 return maxReplicaCount else 0
and there wouldn't be a slow ramp up, but perhaps that is desirable.
Expected Behavior
Expected
default
to have 3 running jobs, andeager
to have 6 running jobsActual Behavior
default
has 6 running jobs,eager
has 9 running jobsSteps to Reproduce the Problem
See above
Logs from KEDA operator
No response
KEDA Version
2.16.0
Kubernetes Version
None
Platform
None
Scaler Details
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: