-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[KEP-4816] Simple scoring for DRA Prioritized List feature #5633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[KEP-4816] Simple scoring for DRA Prioritized List feature #5633
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mortent The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
80b8090
to
66b1bc9
Compare
LGTM from the DRA perspective. No additional PRR approval needed, the existing one is sufficient. So, this just needs SIG Scheduling approval. |
We don't see a need to normalize the scores for now, but this might be needed when | ||
we implement more complicated scoring in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't normalize the score, it will result in low scores for the DRA plugin, and other plugins may have a much stronger influence on the placement decision.
Furthermore, what about the weight of the DRA plugin in the default scheduler config? Since this scoring is based on user preference, it might be worth setting the weight to >1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. I've updated the design to describe how we should normalize the score and which weight we should use.
DRA does not yet implement scoring, which means that | ||
the selected devices might not be optimal. For example, if a prioritized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this frase be changed since we're going to implement limited scoring? Should it refer to the Scoring
chapter you're adding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. I've rewritten this section.
all claims referenced by the Pod. Since the number of subrequests for each request | ||
is capped at 8, we will compute a score between 1 and 8 for each request, with 8 | ||
being the best (i.e. the first option was chosen) and 1 if the 8th subrequest was | ||
chosen. We save the score of 0 in case we want to implement optional requests. Since |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should optional requests be given a score > 0 to give nodes with devices higher score comparing to nodes without them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow. The idea is that any choice that ends up selecting devices should get a higher score than deciding not to select devices at all. The suggestion here only works if we allow for optional subrequests though. If we add support for making any request optional, we need to revisit the scoring anyway, since it requires that we provide a score for every request, regardless of whether it uses Prioritized List or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused by reserving 0 (which for me means no scoring) to something that we potentially want to influence total node score. If we reserve something > 0, but don't use it for now, then we can use it, adding that value to the total node score. This would give a node slightly higher score if it has optional device. However, I'm still not sure I understand how can we use 0 in that case as 0 doesn't have any value. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what this logic essentially do, is just to create a score for a pod on every node where a higher number is better. So for every request that uses sub requests, we just give it a score from 1 to 8, and then sum it up across all requests and claims. The idea was that if we added support in the Prioritized List feature for optional requests (i.e. if none of the sub requests can be satisfied, we just don't allocate any), a score of zero would be worse than 1, which is the score for the last subrequest allowed. But this idea wasn't fully thought out and we will need a different change to the scoring logic if we decide to implement this, since we would need to score requests that have been satisfied higher than those that have not, regardless of whether Prioritized List is in use. So I've removed the sentence about the score of zero from the design. If we ever decide to implement optional requests, addressing scoring will have to be part of that design.
is capped at 8, we will compute a score between 1 and 8 for each request, with 8 | ||
being the best (i.e. the first option was chosen) and 1 if the 8th subrequest was | ||
chosen. We save the score of 0 in case we want to implement optional requests. Since | ||
the score for every node is computed based on the same claims, we end up with a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to explain how the total node score is computed if pod requests multiple claims/devices? Is it a sum of scores for each claim, weighted sum, average etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a sentence about this. I think we should just do a sum, since all scores for a single pod will have the same claims. And we will do normalization anyway to make sure the score falls within the allowed boundaries in the scheduling framework.
The allocation result for each node will be given a score based on the ranking of | ||
the chosen subrequests across all requests using the `FirstAvailable` field across | ||
all claims referenced by the Pod. Since the number of subrequests for each request | ||
is capped at 8, we will compute a score between 1 and 8 for each request, with 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linear ranking might not match user intent. Would it make sense to use exponential ranking here, giving more priority to the nodes with higher ranked devices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting that we should have something like that the lowest ranked option gets a score of 1, then 2, 4, 8, 16, ...? I did think about other ways to do ranking, but none seemed clearly better than linear ranking. As an example, if I have a claim with two requests, each with three subrequests, would an allocation where the first subrequest gets allocated on the first request and the third on the second request be better than the second on both? I think linear has be benefit that it is pretty easy to understand and reason about.
One-line PR description: Add simple scoring for the DRA Prioritized List feature
Issue link: DRA: Prioritized Alternatives in Device Requests #4816