KEP: Introduce a new timeout to WaitForPodsReady config #2737

PBundyra · 2024-08-01T08:32:55Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Described in #2732

Which issue(s) this PR fixes:

Part of #2732

Special notes for your reviewer:

As an alternative, instead of adding a new timeout, we could reuse the existing one to provide the needed functionality

Pros:

no API changes requried

Cons:

Users cannot set different timeouts for different scenarios

Does this PR introduce a user-facing change?

NONE

PBundyra · 2024-08-01T08:33:08Z

/assign @mimowo
cc @mwielgus

netlify · 2024-08-01T08:33:16Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`1878638`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/6763e70ff83c430008e01683

PBundyra · 2024-08-01T08:42:16Z

cc @tenzen-y

keps/349-all-or-nothing/README.md

PBundyra · 2024-08-13T07:30:43Z

/retest

keps/349-all-or-nothing/README.md

alculquicondor · 2024-09-06T14:02:02Z

/cc

alculquicondor

It's missing details

keps/349-all-or-nothing/README.md

alculquicondor · 2024-09-09T17:05:00Z

keps/349-all-or-nothing/README.md

+First one tracks the time between job getting unsuspended (the time of unsuspending a job is marked by the Job's
+`job.status.startTime` field) and reaching the `PodsReady=true` condition.
+
+Second one tracks the time between changing `PodsReady` condition to `false` after the job is running and reaching the


how do you plan to track this time?

I plan to use timestamp in the PodsReady condition to calculate how much time passed

Please add it to the doc

If we wanted to support additional fault recovery capabilities, it could be desirable to have a separate PodsUnhealthy or WorkloadUnhealthy condition on the Workload instead of overloading PodsReady.

What semantic differences do you imagine between PodsUnhealthy/WorkloadUnhealthy and PodsReady?

tenzen-y

Throughout this KEP, I feel that this is lack of details.
So, could you explain

the timing when this timeout is applied to the worloads.
how many times we allow the workload failure after the workload is healthy once.
how to implement this mechanism. I guess that it would be great to clarify the responsibilities for this feature each in components. For example, the JobFramework reconcile is responsible for ... once the Ready Pod crash ...

keps/349-all-or-nothing/README.md

dgrove-oss · 2024-09-25T22:10:33Z

This is an interesting extension to the state machine for a Workload. It is getting pretty close to what one might need for a fairly general fault detection/recovery mechanism. Is there interest in pursuing that angle?

We've explored this space fairly extensively for AppWrappers (https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/) and would be quite interested in bringing similar capabilities to Kueue's GenericJob framework.

keps/349-all-or-nothing/README.md

mimowo · 2024-12-18T14:49:32Z

As an alternative, instead of adding a new timeout, we could reuse the existing one to provide the needed functionality

I prefer a separate timeout, because it will typically be much smaller than the timeout for all pods (first start).

mimowo · 2024-12-18T14:50:02Z

LGTM overall

Co-authored-by: Michał Woźniak <[email protected]>

mimowo · 2024-12-18T14:54:19Z

Please update keps/349-all-or-nothing/kep.yaml to add me to reviewers and approvers, but add a comment: # for recoveryTimeout extension. @PBundyra please also add yourself and @yaroslava-serdiuk to the appropriate sections, possibly with a comment.

mimowo · 2024-12-18T15:03:17Z

/lgtm
/approve
/hold
Leaving the final tagging to @tenzen-y

k8s-ci-robot · 2024-12-18T15:03:23Z

LGTM label has been added.

Git tree hash: 28e85bd3b3d6c1f7cb41ad315a401eceddf6d269

k8s-ci-robot · 2024-12-18T15:03:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, PBundyra

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mimowo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dgrove-oss · 2024-12-18T17:25:19Z

LGTM

keps/349-all-or-nothing/README.md

Co-authored-by: Yaroslava Serdiuk <[email protected]>

k8s-ci-robot · 2024-12-19T09:22:06Z

New changes are detected. LGTM label has been removed.

yaroslava-serdiuk · 2024-12-19T09:32:58Z

lgtm

mimowo · 2024-12-19T09:47:14Z

LGTM even better 👍

keps/349-all-or-nothing/README.md

tenzen-y · 2024-12-20T11:24:07Z

keps/349-all-or-nothing/README.md

@@ -102,8 +104,7 @@ know that this has succeeded?

 - guarantee that two jobs would not schedule pods concurrently. Example
 scenarios in which two jobs may still concurrently schedule their pods:
-  - when succeeded pods are replaced with new because job's parallelism is less than its completions;
-  - when a failed pod gets replaced
+- when succeeded pods are replaced with new because job's parallelism is less than its completions.


Suggested change

- when succeeded pods are replaced with new because job's parallelism is less than its completions.

- when succeeded pods are replaced with new because job's parallelism is less than its completions.

This is deeply depends on the above line which means that this describes the above situation.

tenzen-y · 2024-12-20T15:03:39Z

keps/349-all-or-nothing/README.md

@@ -315,9 +331,11 @@ type RequeueState struct {

 We introduce a new workload condition, called `PodsReady`, to indicate
 if the workload's startup requirements are satisfied. More precisely, we add
-the condition when `job.status.ready + job.status.succeeded` is greater or equal
+the condition when `job.status.ready + len(job.status.uncountedTerimnatedPods.succeeded) + job.status.succeeded` is greater or equal


Does this indicate

kueue/pkg/controller/jobs/job/job_controller.go

Lines 313 to 316 in 6ac33de

func (j *Job) PodsReady() bool {

ready := ptr.Deref(j.Status.Ready, 0)

return j.Status.Succeeded+ready >= j.podsCount()

}

?

In that case, what if other jobs except batch/v1 Jobs?

tenzen-y · 2024-12-20T15:06:28Z

keps/349-all-or-nothing/README.md

 than `job.spec.parallelism`.

+Note that we count `job.status.uncountedTerminatedPods` - this is meant to prevent flickering of the `PodsReady` condition when pods are transitioning to the `Succeeded` state.


Does this improve the existing waitForPodsReady feature even if we do not introduce the recoveryTimeout?

tenzen-y · 2024-12-20T15:12:36Z

keps/349-all-or-nothing/README.md

+Second one applies when the job has already started and some pod failed while the job is running. It tracks the time between changing `PodsReady` condition to `false` and reaching the
+`PodsReady=true` condition once again.


What if the Job is re-failed after the Job gets ready after the recovery?

I meant, what if the Job falls down the below loop?

flowchart TD; id3(PodsReady=true); id4("PodsReady=false(2nd) waits for .recoveryTimeout"); id3 --"Pod failed"--> id4 id4 --"Pod recovered"--> id3

Loading

tenzen-y · 2024-12-20T15:15:42Z

keps/349-all-or-nothing/README.md

+	id4 --"timeout	exceeded"--> id5
+```
+
+We introduce new `WorkloadWaitForPodsStart` and `WorkloadWaitForPodsRecovery` reasons to distinguish the reasons of setting the `PodsReady=false` condition.


How can existing users migrate the previous "PodsReady" reason to the new reason?
Is there any migration plans?

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. labels Aug 1, 2024

k8s-ci-robot requested review from denkensk and trasc August 1, 2024 08:33

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 1, 2024

k8s-ci-robot assigned mimowo Aug 1, 2024

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 1, 2024

PBundyra force-pushed the KEP-wait-for-pods-ready branch 2 times, most recently from f01fc98 to 8a49825 Compare August 1, 2024 08:35

mimowo reviewed Aug 5, 2024

View reviewed changes

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved

PBundyra force-pushed the KEP-wait-for-pods-ready branch from b72652f to 3ee57bd Compare August 13, 2024 07:43

tenzen-y reviewed Aug 14, 2024

View reviewed changes

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved

mimowo reviewed Sep 6, 2024

View reviewed changes

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved

k8s-ci-robot requested a review from alculquicondor September 6, 2024 14:02

alculquicondor reviewed Sep 9, 2024

View reviewed changes

tenzen-y reviewed Sep 11, 2024

View reviewed changes

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved

tenzen-y mentioned this pull request Sep 24, 2024

Introduce Job-agnostic API to declare the maximal execution time for a Job #3125

Closed

3 tasks

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 15, 2024

PBundyra closed this Nov 5, 2024

PBundyra deleted the KEP-wait-for-pods-ready branch November 5, 2024 13:14

PBundyra restored the KEP-wait-for-pods-ready branch December 6, 2024 13:45

PBundyra reopened this Dec 6, 2024

PBundyra force-pushed the KEP-wait-for-pods-ready branch from 3ee57bd to 87c6ff2 Compare December 11, 2024 10:33

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 11, 2024

PBundyra added 2 commits December 18, 2024 14:21

Describe relation with requeueState field

57e1efc

Change timeout type to metav1.Duration

c3b44c5

PBundyra force-pushed the KEP-wait-for-pods-ready branch from 025dfef to c3b44c5 Compare December 18, 2024 14:30

mimowo reviewed Dec 18, 2024

View reviewed changes

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved

mimowo reviewed Dec 18, 2024

View reviewed changes

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved

mimowo reviewed Dec 18, 2024

View reviewed changes

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved

mimowo reviewed Dec 18, 2024

View reviewed changes

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved

Apply suggestions from code review

9ff2a63

Co-authored-by: Michał Woźniak <[email protected]>

Update authors, reviewers and approvers

ce5b174

PBundyra force-pushed the KEP-wait-for-pods-ready branch from 7fb75aa to ce5b174 Compare December 18, 2024 14:58

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Dec 18, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 18, 2024

yaroslava-serdiuk reviewed Dec 19, 2024

View reviewed changes

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved

Apply suggestions from code review

5806272

Co-authored-by: Yaroslava Serdiuk <[email protected]>

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 19, 2024

k8s-ci-robot requested a review from mimowo December 19, 2024 09:22

Shorten condition's reasons name

1878638

yaroslava-serdiuk reviewed Dec 19, 2024

View reviewed changes

keps/349-all-or-nothing/README.md Show resolved Hide resolved

tenzen-y reviewed Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP: Introduce a new timeout to WaitForPodsReady config #2737

KEP: Introduce a new timeout to WaitForPodsReady config #2737

PBundyra commented Aug 1, 2024 •

edited

Loading

PBundyra commented Aug 1, 2024

netlify bot commented Aug 1, 2024 •

edited

Loading

PBundyra commented Aug 1, 2024

PBundyra commented Aug 13, 2024

alculquicondor commented Sep 6, 2024

alculquicondor left a comment

alculquicondor Sep 9, 2024

PBundyra Sep 13, 2024

alculquicondor Sep 19, 2024

dgrove-oss Sep 25, 2024 •

edited

Loading

PBundyra Dec 18, 2024

tenzen-y left a comment

dgrove-oss commented Sep 25, 2024

mimowo commented Dec 18, 2024

mimowo commented Dec 18, 2024

mimowo commented Dec 18, 2024

mimowo commented Dec 18, 2024

k8s-ci-robot commented Dec 18, 2024

k8s-ci-robot commented Dec 18, 2024

dgrove-oss commented Dec 18, 2024

k8s-ci-robot commented Dec 19, 2024

yaroslava-serdiuk commented Dec 19, 2024

mimowo commented Dec 19, 2024

tenzen-y Dec 20, 2024

tenzen-y Dec 20, 2024

tenzen-y Dec 20, 2024

tenzen-y Dec 20, 2024

tenzen-y Dec 20, 2024

	- when succeeded pods are replaced with new because job's parallelism is less than its completions.
	- when succeeded pods are replaced with new because job's parallelism is less than its completions.

	func (j *Job) PodsReady() bool {
	ready := ptr.Deref(j.Status.Ready, 0)
	return j.Status.Succeeded+ready >= j.podsCount()
	}

		than `job.spec.parallelism`.

		Note that we count `job.status.uncountedTerminatedPods` - this is meant to prevent flickering of the `PodsReady` condition when pods are transitioning to the `Succeeded` state.

		Second one applies when the job has already started and some pod failed while the job is running. It tracks the time between changing `PodsReady` condition to `false` and reaching the
		`PodsReady=true` condition once again.

KEP: Introduce a new timeout to WaitForPodsReady config #2737

Are you sure you want to change the base?

KEP: Introduce a new timeout to WaitForPodsReady config #2737

Conversation

PBundyra commented Aug 1, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

PBundyra commented Aug 1, 2024

netlify bot commented Aug 1, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

PBundyra commented Aug 1, 2024

PBundyra commented Aug 13, 2024

alculquicondor commented Sep 6, 2024

alculquicondor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgrove-oss Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

dgrove-oss commented Sep 25, 2024

mimowo commented Dec 18, 2024

mimowo commented Dec 18, 2024

mimowo commented Dec 18, 2024

mimowo commented Dec 18, 2024

k8s-ci-robot commented Dec 18, 2024

k8s-ci-robot commented Dec 18, 2024

dgrove-oss commented Dec 18, 2024

k8s-ci-robot commented Dec 19, 2024

yaroslava-serdiuk commented Dec 19, 2024

mimowo commented Dec 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PBundyra commented Aug 1, 2024 •

edited

Loading

netlify bot commented Aug 1, 2024 •

edited

Loading

dgrove-oss Sep 25, 2024 •

edited

Loading