Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KEP] Add ResourceFlavor fallbacks KEP #2561

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

PBundyra
Copy link
Contributor

@PBundyra PBundyra commented Jul 9, 2024

What type of PR is this?

/kind documentation

What this PR does / why we need it:

Introduces KEP for ResourceFlavor fallbacks. The details are described in-depth in KEP itself.

Which issue(s) this PR fixes:

Part of #2560

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/documentation Categorizes issue or PR as related to documentation. labels Jul 9, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 9, 2024
@PBundyra
Copy link
Contributor Author

PBundyra commented Jul 9, 2024

Copy link

netlify bot commented Jul 9, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit ee56d11
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/668f7e98e2f2850008bfd639

@alculquicondor
Copy link
Contributor

cc @tenzen-y @kerthcet

@tenzen-y
Copy link
Member

tenzen-y commented Jul 9, 2024

cc @tenzen-y @kerthcet

ACK

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this feature, and I'm looking forward to providing this feature.
But, I'm wondering if there are some rabbit hole and some hidden or implied specifications in this KEP.

keps/2560-flavor-fallback/README.md Outdated Show resolved Hide resolved
keps/2560-flavor-fallback/README.md Outdated Show resolved Hide resolved
#### Story 2

As a batch admin I configured my ClusterQueue to have 3 flavors: FlavorA, FlavorB, FlavorC. I want my users to use run their jobs on FlavorA first. On fallback I want them to try to run theirs jobs on FlavorB but if there is no capacity in the region, use FlavorC instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a story for the situations where fragmented resource consumption is happening?
Even if the cloud provider has enough computing, users generally configure the limitation of the autoscaling cluster. And then, once fragmentation happened, this feature is useful.

keps/2560-flavor-fallback/README.md Outdated Show resolved Hide resolved
We propose to introduce a new API to the ClusterQueue object, similarly to AdmissionCheckStrategy. The new field called FlavorFallbackStrategy falls under FlavorFungibility configuration. The order of rules does not affect the order in which Flavors are assigned. This field should be treated more as a mapping between a ResourceFlavor and the timeout, and a list of policies applied to all flavors.

```golang
type FlavorFungibility struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a JSON on every fields so that we can understand and reach consensus for the field name?

// Values: [TimeoutForPodsReadyExceeded, AdmissionCheckRejected]
Trigger string

TimeoutMinutes *int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we validate id this timeout is larger than the waitForPodsReady.timeout?
If no restrictions for this timeout, when we enable waitForPodsReady and fallbackStrategy is enabled, waitForPodsReady could accidentally lose the functionality, right?

// Values: [TimeoutForPodsReadyExceeded, AdmissionCheckRejected]
Trigger string

TimeoutMinutes *int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide the calculation formula to calculate the time consumed throughout the request processes?
Specifically, I would like to know if the consumed time contains 1. backoff time, 2. the time while queue after the workload is requeued.

For sure, I understand that you mentioned that the time while queue doesn't contain the consumed time.
But, I can not confirm whether or not the time during queue regardless of requeing.

// Values: [TimeoutForPodsReadyExceeded, AdmissionCheckRejected]
Trigger string

TimeoutMinutes *int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the Workload reaches this timeout, what happens? Will the Workload get the dedicated condition?
Or, will the Workload just be stopped the same as the PodsReadyTimeout mechanism?

Name ResourceFlavorReference

// trigger is an enum describing whether the fallback is AdmissionCheck based, or timeout based
// Values: [TimeoutForPodsReadyExceeded, AdmissionCheckRejected]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AdmissionCheckRejected is selectable here implies the fact that we can reuse this fallback mechanism in the MultiKueue and other admissionCheck mechanisms, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you mention the flavorFungibility KEP in the see-also section?

https://github.com/kubernetes-sigs/kueue/tree/main/keps/582-preempt-based-on-flavor-order

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 9, 2024
@PBundyra
Copy link
Contributor Author

PBundyra commented Oct 9, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 9, 2024
@PBundyra PBundyra closed this Nov 5, 2024
@PBundyra PBundyra deleted the KEP-fallback branch November 5, 2024 13:14
@PBundyra PBundyra restored the KEP-fallback branch December 6, 2024 13:47
@PBundyra PBundyra reopened this Dec 6, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: PBundyra
Once this PR has been reviewed and has the lgtm label, please ask for approval from tenzen-y. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants