RFC: Degraded NodePool Status Condition #1885

rschalo · 2024-12-17T05:09:38Z

Description
See Design File for full RFC but this change seeks to add the Degraded status condition to NodePools in order to better communicate to users when a NodePool may be misconfigured.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot · 2024-12-17T05:09:41Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2024-12-17T05:09:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rschalo
Once this PR has been reviewed and has the lgtm label, please assign tzneal for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coveralls · 2024-12-17T05:30:45Z

Pull Request Test Coverage Report for Build 12366878113

Details

0 of 0 changed or added relevant lines in 0 files are covered.
5 unchanged lines in 2 files lost coverage.
Overall coverage decreased (-0.05%) to 81.25%

Files with Coverage Reduction	New Missed Lines	%
pkg/utils/termination/termination.go	2	92.31%
pkg/controllers/provisioning/scheduling/nodeclaim.go	3	85.64%

Totals
Change from base Build 12303032942:	-0.05%
Covered Lines:	9022
Relevant Lines:	11104

💛 - Coveralls

sftim

There's another option. Conditions can have three values, so we can have a condition Stable (or a different name).

If Karpenter self-detects flapping, or the logic for the proposed Degraded condition were to fire, Stable can become false. Crucially, if Karpenter is not sure, the Stable condition can become Unknown and we can provide a reason string for that.

For example, I'd set a brand-new NodePool to Stable Unknown, and eagerly transition (from Stable ~~false~~ unknown to Stable ~~unknown~~ false) as soon as just one NodeClaim fails to provision.

rschalo · 2024-12-17T15:37:40Z

So in the case of utilizing Unknown, we'd still not have the new status condition be a pre-condition for readiness and then eagerly switch into Unknown... Yea, I think that makes sense as a third option and still leaves the possibility of it being able to affect functionality in the future if we want.

Flapping in and out of Unknown seems fine if we don't block provisioning for that NodePool. Do you see any disadvantages to changing the status condition from Unknown to True after a single success?

sftim · 2024-12-17T15:43:30Z

Do you see any disadvantages to changing the status condition from Unknown to True after a single success

It's weak evidence. For example, let's say 4 out of 5 nodes fail to provision (not at random; let's say exactly one in 5 succeeds). If that continues long term I would prefer to see the status become - or remain as - Stable false.

If a new NodePool provisions one node and that one node is fine, sure, turn the Stable condition to true. If a previously-flapping NodePool provisions one more new node and that one new node is fine, I wouldn't want to infer that the NodePool is stable. Nuance is what this idea is all about.

njtran

I like this idea. Let me know if you have questions.

njtran · 2024-12-17T17:01:28Z

designs/degraded-nodepools.md

+ConditionTypeDegraded = "Degraded"
+```
+
+This option would set `Degraded: true` on a NodePool whenever Karpenter suspects something is wrong with the launch path but isn't sure. In this case, if 3 or more NodeClaims fail to launch with a NodePool then the NodePool will be marked as degraded. The retries are included to account for transient errors. The number of failed launches is stored as a status on the NodePool and then reset to zero following an edit to the NodePool or a sufficient amount time has passed.


Have you thought of something we could do to make it scale to the size of the cluster? And which errors are included in "failing to launch"?

The number of failed launches is stored as a status on the NodePool and then reset to zero following an edit to the NodePool or a sufficient amount time has passed.

I worry about false positives for "waiting to edit" the nodepool. I'm guessing this would be bubbled up from the nodeclass, but I think having the "amount of time" as a backstop makes sense. What amount of time are you thinking?

I was thinking that the edit to a NodePool would be a signal that a user updated the NodePool and is expecting the issue to be resolved.

What amount of time are you thinking?

To keep it simple, 3x the NodeClaim registration TTL.

Have you thought of something we could do to make it scale to the size of the cluster? And which errors are included in "failing to launch"?

Not really, the number of NodePools/NodeClasses wouldn't affect the number of failures since we're looking at the state of a single NodePool. I think the cluster size would just affect how quickly a NodePool is considered degraded.

And which errors are included in "failing to launch"?

This currently includes node registration failures.

njtran · 2024-12-17T17:04:05Z

designs/degraded-nodepools.md

+	Resources v1.ResourceList `json:"resources,omitempty"`
+	// FailedLaunches tracks the number of times a nodepool failed before being marked degraded
+	// +optional
+	FailedLaunches int `json:"failedLaunches,omitempty"`


How can we indicate that this is "failed launches in a row"? Does this mechanically reset if we get a successful launch?

Suggested change

FailedLaunches int `json:"failedLaunches,omitempty"`

FailedSuccessiveLaunches int `json:"failedLaunches,omitempty"`

This is probably actually FailedRegistrations

So the number of times we hit registration TTL?

The azure provider will use the native controller retries for successive launches of a nodeclaim.

Attempt 1: Try with Cheapest sku -> FAIL Quota error(need a better quota api), cache sku as lacking quota for
Attempt 2: Try again but with the next cheapest sku -> Succeed.

So sometimes with our provider we will have an expected launch failure, that we know will succeed on the retry.

I like FailedRegistrations better, but also for some of these LRO operations, being able to report back from the cloudprovider side when a launch fails would be useful.

Maybe this should be nodeClaimRecentLaunchSuccessRatio. We really care about recent launch attempts and failures, not the entire lifecycle of the NodePool. But I also don't think that one successful launch should discard the info here.

Imagine if the NodePool spans multiple zones but is misconfigured, except for in a zone that the workload hardly uses. (eg: workload uses zones 1 and 2; cluster supports zones 1, 2 and 3). Launching in zone 3 might make things look better than they are.

njtran · 2024-12-17T17:05:00Z

designs/degraded-nodepools.md

+}
+```
+
+Once a NodePool is `Degraded`, it recovers with `Degraded: false` after an update to the NodePool or when the NodeClaim registration expiration TTL (currently 15 minutes) passes since the `lastTransitionTime` for the status condition on the NodePool, whichever comes first. A `Degraded` NodePool is not passed over when provisioning and may continue to be chosen during scheduling. A successful provisioning could also remove the status condition but this may cause more apiserver and metric churn than is necessary.


Once a NodePool is Degraded, it recovers with Degraded: false after an update to the NodePool or when the NodeClaim registration expiration TTL (currently 15 minutes) passes since the lastTransitionTime for the status condition on the NodePool, whichever comes first.

Doesn't this defeat the purpose? if it's 15m then I imagine if I'm having bootstrapping errors that I'll almost never hit this feature if I'm not scaling up 3 nodes at a time.

A Degraded NodePool is not passed over when provisioning and may continue to be chosen during scheduling.

Maybe eventually configurable if we get it right. i could imagine users wanting to orchestrate a signal (or maybe leverage this one) to signal to Karpenter when they think something's wrong.

A successful provisioning could also remove the status condition but this may cause more apiserver and metric churn than is necessary.

This seems like a natural time to un mark the status condition. This only causes churn if the issue is ephemeral or racey, which in the ephemeral case, I'd imagine the NodePool isn't degraded. I'd like to understand what some of the racey cases would be though.

Doesn't this defeat the purpose? if it's 15m then I imagine if I'm having bootstrapping errors that I'll almost never hit this feature if I'm not scaling up 3 nodes at a time.

Yea, I was initially thinking it was the same as NodeClaim registration but it's probably 3x it to address this issue.

Maybe eventually configurable if we get it right. i could imagine users wanting to orchestrate a signal (or maybe leverage this one) to signal to Karpenter when they think something's wrong.

Yea, I think for introducing this we don't want to make functional changes yet but it could be done in the future. One thing I want to sort out is how to appropriately add/filter metrics for NodeClaims being provisioned for NodePools that are degraded.

njtran · 2024-12-17T17:08:56Z

designs/degraded-nodepools.md

+
+Once a NodePool is `Degraded`, it recovers with `Degraded: false` after an update to the NodePool or when the NodeClaim registration expiration TTL (currently 15 minutes) passes since the `lastTransitionTime` for the status condition on the NodePool, whichever comes first. A `Degraded` NodePool is not passed over when provisioning and may continue to be chosen during scheduling. A successful provisioning could also remove the status condition but this may cause more apiserver and metric churn than is necessary.
+
+As additional misconfigurations are handled, they can be added to the `Degraded` status condition and the `Degraded` controller expanded to handle automated recovery efforts. This is probably most simpoly achieved by changing the Status Condition metrics to use comma-delimiting for `Reason`s with the most recent change present in the `Message`.


Is there prior art for doing this? I'm wondering if we're actually going to be able to bubble up these failures correctly anyway, or if this is work we're carving out that wouldn't even be usable.

reason should be a single machine readable value and we don'tdelimit it

If we wanted multiple reasons, the API would support a list here.

We can fire Events, though, and I think we can even store annotations onto those Events if we want to.

Agree that the csv style reasons is a bad idea.

We can fire Events, though, and I think we can even store annotations onto those Events if we want to.

I think it's possible we fire an event for the NodeClaim if the backing compute fails to join the cluster. The only time these events would spam is if the NodePool/NodeClass is so misconfigured that every initialization attempt fails. Looks like we currently only event for initial NodeClaim launch failures and not for failures in Registration and Initialization.

njtran · 2024-12-17T17:10:12Z

designs/degraded-nodepools.md

+
+This introduces challenges when determining when to evaluate contributors to the status condition but since the `Degraded` status condition only has a single contributor this decision can be punted. When the time comes to implement the multiple contributors to this status condition, this probably looks like a `Degraded` controller which acts as a "heartbeat" and evaluates each of the contributors.
+
+Finally, this status condition would not be a precondition for NodePool `Readiness` because the NodePool should still be considered for scheduling purposes.


But this would need to be rolled up somehow from the NodeClass too, right? So we'd need a new nodepool "Degraded" condition which relies on the nodeclass "Degraded"

Can we actually assume that the NodeClass has a .status field? (I don't think we should assume that).

We should probably just track on the NodePool but we can do this in a way where which NodeClass was used is stored as a hash on the NodePool (see approach 4).

njtran · 2024-12-17T17:10:49Z

designs/degraded-nodepools.md

+
+#### Considerations
+
+1. 👎 Three retries can still be a long time to wait on compute that would never succeed


Maybe I misunderstood, but resetting the timer on the registration timeout means we never hit this, right?

designs/degraded-nodepools.md

njtran · 2024-12-17T17:11:42Z

designs/degraded-nodepools.md

+
+#### Considerations
+
+1. 👎👎 Validation implies the original state of the NodePool was correct and is something Karpenter can vet with certainty. A NodePool could have been correctly validated but then degraded.


False positives. I think the severity of this con depends on what signals we're choosing to include.

njtran · 2024-12-17T17:12:00Z

designs/degraded-nodepools.md

+
+1. 👎👎 Validation implies the original state of the NodePool was correct and is something Karpenter can vet with certainty. A NodePool could have been correctly validated but then degraded.
+2. 👎👎 Changes the meaning of `Validated` in terms of `Readiness`
+3. 👎 Relies on statuses that were not part of the original validation of a NodePool.


I'm not sure I get what this con is

I was thinking of Validation as a "valid state of the NodePool when it was first created" and that it should be true when we know something is definitely true or false. The failures Degraded seeks to capture are for grey areas where something might be wrong and we aren't sure.

I think a NodePool can start valid and become invalid (eg due to someone writing a new spec, or because they deleted the node class object, or potentially even by the cloud provider changing its offering).

Example of a NodePool becoming invalid through a spec update: set a requirement for karpenter.sh/capacity-type: ["nonexistent"]

Maybe if someone sets expireAfter to zero, that should also be invalid?

Bryce-Soghigian · 2024-12-20T09:46:25Z

designs/degraded-nodepools.md

+	// LaunchedNode indicates if the current NodePool configuration has launched a node.
+	// +optional
+	Verified bool `json:"verified,omitempty"`


Suggested change

// LaunchedNode indicates if the current NodePool configuration has launched a node.

// +optional

Verified bool `json:"verified,omitempty"`

// Verified indicates if the current NodePool configuration has launched a node.

// +optional

Verified bool `json:"verified,omitempty"`

Using language like launch is interesting, the node has launched but to a user is it clear the differences between Launched, Registered, Initialized etc? Are we going to use the Launched condition, or will we check that its gone through the full nodeclaim lifecycle before verifying a nodepool?

Maybe they aren't configuring labels for daemonsets on their nodes correctly, so a daemonset dropping configuration for CNI never initializes and pods can't schedule, the nodes startup taints for that CNI are never are removed, but from the controllers perspective we have moved past a launch so its considered verified.

Should we only consider a nodepool verified if a nodeclaim has gone through all the create lifecycle operations? (Launch + Initialization + Registration)

the node has launched but to a user is it clear the differences between Launched, Registered, Initialized etc?

we have moved past a launch so its considered verified.

Are we going to use the Launched condition, will we check that its gone through the full nodeclaim lifecycle before verifying a nodepool?

My mistake in wording - I think we consider a NodePool Verified after all NodeClaim lifecycle operations succeed for a given NodePool/NodeClass.

As far as Karpenter could tell, everything should have succeeded. Karpenter didn't see a cloudprovider create error but some other misconfiguration prevented a node from joining the cluster. If the NodePool/NodeClass has been verified then we can fire an event on the NodePool about the most recent failure with a probable cause. That could be the startup taints issue you mention or hitting registration TTLs because of no network path.

Also, if a nodepool is updated then these verifications should be cleared.

rschalo added 2 commits December 16, 2024 13:09

feat: introduce degraded nodepool status condition

eb4b625

update recovery modes

2dbfaa6

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 17, 2024

k8s-ci-robot requested review from engedaam and jmdeal December 17, 2024 05:09

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 17, 2024

rschalo added 2 commits December 16, 2024 21:11

edit to timing

067448e

clean up numbering

954e4ff

rschalo marked this pull request as ready for review December 17, 2024 05:40

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 17, 2024

k8s-ci-robot requested a review from jackfrancis December 17, 2024 05:40

sftim reviewed Dec 17, 2024

View reviewed changes

njtran reviewed Dec 17, 2024

View reviewed changes

rschalo added 2 commits December 17, 2024 09:17

remove irrelevant line

134ef8e

add a couple approaches and update wording

67889ea

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 20, 2024

Bryce-Soghigian reviewed Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Degraded NodePool Status Condition #1885

RFC: Degraded NodePool Status Condition #1885

rschalo commented Dec 17, 2024

k8s-ci-robot commented Dec 17, 2024

k8s-ci-robot commented Dec 17, 2024

coveralls commented Dec 17, 2024 •

edited

Loading

sftim left a comment •

edited

Loading

rschalo commented Dec 17, 2024

sftim commented Dec 17, 2024

njtran left a comment

njtran Dec 17, 2024

njtran Dec 17, 2024

rschalo Dec 17, 2024 •

edited

Loading

rschalo Dec 17, 2024

njtran Dec 17, 2024

rschalo Dec 17, 2024

Bryce-Soghigian Dec 20, 2024

sftim Dec 20, 2024 •

edited

Loading

njtran Dec 17, 2024

njtran Dec 17, 2024

njtran Dec 17, 2024

rschalo Dec 17, 2024

njtran Dec 17, 2024

sftim Dec 20, 2024

sftim Dec 20, 2024

rschalo Dec 20, 2024 •

edited

Loading

njtran Dec 17, 2024

sftim Dec 20, 2024

rschalo Dec 20, 2024

njtran Dec 17, 2024

njtran Dec 17, 2024

njtran Dec 17, 2024

rschalo Dec 17, 2024

sftim Dec 20, 2024

Bryce-Soghigian Dec 20, 2024

rschalo Dec 20, 2024

rschalo Dec 20, 2024

	FailedLaunches int `json:"failedLaunches,omitempty"`
	FailedSuccessiveLaunches int `json:"failedLaunches,omitempty"`


		Once a NodePool is `Degraded`, it recovers with `Degraded: false` after an update to the NodePool or when the NodeClaim registration expiration TTL (currently 15 minutes) passes since the `lastTransitionTime` for the status condition on the NodePool, whichever comes first. A `Degraded` NodePool is not passed over when provisioning and may continue to be chosen during scheduling. A successful provisioning could also remove the status condition but this may cause more apiserver and metric churn than is necessary.

		As additional misconfigurations are handled, they can be added to the `Degraded` status condition and the `Degraded` controller expanded to handle automated recovery efforts. This is probably most simpoly achieved by changing the Status Condition metrics to use comma-delimiting for `Reason`s with the most recent change present in the `Message`.


		This introduces challenges when determining when to evaluate contributors to the status condition but since the `Degraded` status condition only has a single contributor this decision can be punted. When the time comes to implement the multiple contributors to this status condition, this probably looks like a `Degraded` controller which acts as a "heartbeat" and evaluates each of the contributors.

		Finally, this status condition would not be a precondition for NodePool `Readiness` because the NodePool should still be considered for scheduling purposes.


		#### Considerations

		1. 👎 Three retries can still be a long time to wait on compute that would never succeed


		#### Considerations

		1. 👎👎 Validation implies the original state of the NodePool was correct and is something Karpenter can vet with certainty. A NodePool could have been correctly validated but then degraded.

RFC: Degraded NodePool Status Condition #1885

Are you sure you want to change the base?

RFC: Degraded NodePool Status Condition #1885

Conversation

rschalo commented Dec 17, 2024

k8s-ci-robot commented Dec 17, 2024

k8s-ci-robot commented Dec 17, 2024

coveralls commented Dec 17, 2024 • edited Loading

Pull Request Test Coverage Report for Build 12366878113

Details

💛 - Coveralls

sftim left a comment • edited Loading

Choose a reason for hiding this comment

rschalo commented Dec 17, 2024

sftim commented Dec 17, 2024

njtran left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rschalo Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sftim Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rschalo Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Dec 17, 2024 •

edited

Loading

sftim left a comment •

edited

Loading

rschalo Dec 17, 2024 •

edited

Loading

sftim Dec 20, 2024 •

edited

Loading

rschalo Dec 20, 2024 •

edited

Loading