Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP/POC] Degraded NodePool Status Condition #1880

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions kwok/charts/crds/karpenter.sh_nodepools.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -498,6 +498,9 @@ spec:
- type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we propose this change as an RFC? There seems to be a bunch of detail around what this status condition means, how we are going to track failures, what it means with respect to scheduling, etc. and I think it would be good to let people see this and get cloudprovder input as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm writing the RFC today and I should be able to post tomorrow.

type: object
type: array
failedlaunches:
description: FailedLaunches tracks the number of times a nodepool failed before being marked degraded
type: integer
resources:
additionalProperties:
anyOf:
Expand Down
3 changes: 3 additions & 0 deletions pkg/apis/crds/karpenter.sh_nodepools.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -496,6 +496,9 @@ spec:
- type
type: object
type: array
failedlaunches:
description: FailedLaunches tracks the number of times a nodepool failed before being marked degraded
type: integer
resources:
additionalProperties:
anyOf:
Expand Down
5 changes: 5 additions & 0 deletions pkg/apis/v1/nodepool_status.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,18 @@ const (
ConditionTypeValidationSucceeded = "ValidationSucceeded"
// ConditionTypeNodeClassReady = "NodeClassReady" condition indicates that underlying nodeClass was resolved and is reporting as Ready
ConditionTypeNodeClassReady = "NodeClassReady"
// TODO
ConditionTypeDegraded = "Degraded"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have thoughts around how you are going to track the last failed launch and then extend the amount of time before we retry out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this, I was thinking we'd use the lastTransitionTime for when Degraded == true.

)

// NodePoolStatus defines the observed state of NodePool
type NodePoolStatus struct {
// Resources is the list of resources that have been provisioned.
// +optional
Resources v1.ResourceList `json:"resources,omitempty"`
// FailedLaunches tracks the number of times a nodepool failed before being marked degraded
// +optional
FailedLaunches int `json:"failedlaunches,omitempty"`
// Conditions contains signals for health and readiness
// +optional
Conditions []status.Condition `json:"conditions,omitempty"`
Expand Down
8 changes: 3 additions & 5 deletions pkg/controllers/nodeclaim/lifecycle/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -80,18 +80,16 @@ func NewController(clk clock.Clock, kubeClient client.Client, cloudProvider clou
launch: &Launch{kubeClient: kubeClient, cloudProvider: cloudProvider, cache: cache.New(time.Minute, time.Second*10), recorder: recorder},
registration: &Registration{kubeClient: kubeClient},
initialization: &Initialization{kubeClient: kubeClient},
liveness: &Liveness{clock: clk, kubeClient: kubeClient},
liveness: &Liveness{clock: clk, kubeClient: kubeClient, cloudProvider: cloudProvider},
}
}

func (c *Controller) Register(_ context.Context, m manager.Manager) error {
return controllerruntime.NewControllerManagedBy(m).
Named(c.Name()).
For(&v1.NodeClaim{}, builder.WithPredicates(nodeclaimutils.IsManagedPredicateFuncs(c.cloudProvider))).
Watches(
&corev1.Node{},
nodeclaimutils.NodeEventHandler(c.kubeClient, c.cloudProvider),
).
Watches(&corev1.Node{}, nodeclaimutils.NodeEventHandler(c.kubeClient, c.cloudProvider)).
Watches(&v1.NodePool{}, nodeclaimutils.NodePoolEventHandler(c.kubeClient, c.cloudProvider)).
WithOptions(controller.Options{
RateLimiter: workqueue.NewTypedMaxOfRateLimiter[reconcile.Request](
// back off until last attempt occurs ~90 seconds before nodeclaim expiration
Expand Down
49 changes: 42 additions & 7 deletions pkg/controllers/nodeclaim/lifecycle/liveness.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,38 +20,73 @@
"context"
"time"

"k8s.io/apimachinery/pkg/api/errors"
"k8s.io/apimachinery/pkg/types"
"k8s.io/utils/clock"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/log"
"sigs.k8s.io/controller-runtime/pkg/reconcile"

v1 "sigs.k8s.io/karpenter/pkg/apis/v1"
"sigs.k8s.io/karpenter/pkg/cloudprovider"
"sigs.k8s.io/karpenter/pkg/metrics"
)

type Liveness struct {
clock clock.Clock
kubeClient client.Client
clock clock.Clock
kubeClient client.Client
cloudProvider cloudprovider.CloudProvider
}

// registrationTTL is a heuristic time that we expect the node to register within
// If we don't see the node within this time, then we should delete the NodeClaim and try again
const registrationTTL = time.Minute * 15
const registrationTTL = time.Millisecond * 15

func (l *Liveness) Reconcile(ctx context.Context, nodeClaim *v1.NodeClaim) (reconcile.Result, error) {

Check failure on line 45 in pkg/controllers/nodeclaim/lifecycle/liveness.go

View workflow job for this annotation

GitHub Actions / presubmit (1.25.x)

cyclomatic complexity 12 of func `(*Liveness).Reconcile` is high (> 11) (gocyclo)

Check failure on line 45 in pkg/controllers/nodeclaim/lifecycle/liveness.go

View workflow job for this annotation

GitHub Actions / presubmit (1.26.x)

cyclomatic complexity 12 of func `(*Liveness).Reconcile` is high (> 11) (gocyclo)

Check failure on line 45 in pkg/controllers/nodeclaim/lifecycle/liveness.go

View workflow job for this annotation

GitHub Actions / presubmit (1.27.x)

cyclomatic complexity 12 of func `(*Liveness).Reconcile` is high (> 11) (gocyclo)

Check failure on line 45 in pkg/controllers/nodeclaim/lifecycle/liveness.go

View workflow job for this annotation

GitHub Actions / presubmit (1.28.x)

cyclomatic complexity 12 of func `(*Liveness).Reconcile` is high (> 11) (gocyclo)

Check failure on line 45 in pkg/controllers/nodeclaim/lifecycle/liveness.go

View workflow job for this annotation

GitHub Actions / presubmit (1.29.x)

cyclomatic complexity 12 of func `(*Liveness).Reconcile` is high (> 11) (gocyclo)

Check failure on line 45 in pkg/controllers/nodeclaim/lifecycle/liveness.go

View workflow job for this annotation

GitHub Actions / presubmit (1.30.x)

cyclomatic complexity 12 of func `(*Liveness).Reconcile` is high (> 11) (gocyclo)

Check failure on line 45 in pkg/controllers/nodeclaim/lifecycle/liveness.go

View workflow job for this annotation

GitHub Actions / presubmit (1.31.x)

cyclomatic complexity 12 of func `(*Liveness).Reconcile` is high (> 11) (gocyclo)
registered := nodeClaim.StatusConditions().Get(v1.ConditionTypeRegistered)
if registered.IsTrue() {
return reconcile.Result{}, nil
}
if registered == nil {
return reconcile.Result{Requeue: true}, nil
}
nodePoolName, ok := nodeClaim.Labels[v1.NodePoolLabelKey]
if !ok {
return reconcile.Result{}, nil
}
nodePool := &v1.NodePool{}
if err := l.kubeClient.Get(ctx, types.NamespacedName{Name: nodePoolName}, nodePool); err != nil {
return reconcile.Result{}, client.IgnoreNotFound(err)
}
// if we ever succeed registration, reset failures
if registered.IsTrue() {
nodePool.Status.FailedLaunches = 0
if err := l.kubeClient.Status().Update(ctx, nodePool); err != nil {
if errors.IsConflict(err) {
return reconcile.Result{Requeue: true}, nil
}
return reconcile.Result{}, client.IgnoreNotFound(err)
}
return reconcile.Result{}, nil
}
// If the Registered statusCondition hasn't gone True during the TTL since we first updated it, we should terminate the NodeClaim
// NOTE: ttl has to be stored and checked in the same place since l.clock can advance after the check causing a race
if ttl := registrationTTL - l.clock.Since(registered.LastTransitionTime.Time); ttl > 0 {
// If the nodepool is degraded, requeue for the remaining TTL.
if ttl := registrationTTL - l.clock.Since(registered.LastTransitionTime.Time); ttl > 0 || nodePool.StatusConditions().Get(v1.ConditionTypeDegraded).IsTrue() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following what this check is intended to do. Is this supposed to make it so that we dynamically scale the requeue time?

return reconcile.Result{RequeueAfter: ttl}, nil
}
// Delete the NodeClaim if we believe the NodeClaim won't register since we haven't seen the node
// Here we delete the nodeclaim if the node failed to register, we want to retry against the nodeClaim's nodeClass/nodePool 3x.
// store against a nodepool since nodeclass is not available? nodeclass ref on nodepool, nodepool is 1:1 with nodeclass anyway
log.FromContext(ctx).V(1).WithValues("failures", nodePool.Status.FailedLaunches).Info("failed launches so far")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be here for debugging, but I don't think that we should merge this since it's going to be incredibly noisy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely just in for debugging and agree it should be removed.

nodePool.Status.FailedLaunches += 1
log.FromContext(ctx).V(1).WithValues("failures", nodePool.Status.FailedLaunches).Info("failed launches so far")
if err := l.kubeClient.Status().Update(ctx, nodePool); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we choose to do an Update in some places and do a patch with optimistic locking in others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make these consistent, was just working around an issue I had where Patch didn't always work and I wasn't yet sure why.

log.FromContext(ctx).V(1).WithValues("error for patching", err).Info("error in reg")
if errors.IsConflict(err) {
return reconcile.Result{Requeue: true}, nil
}
return reconcile.Result{}, client.IgnoreNotFound(err)
}
log.FromContext(ctx).V(1).WithValues("failures", nodePool.Status.FailedLaunches).Info("somehow passing")

if err := l.kubeClient.Delete(ctx, nodeClaim); err != nil {
return reconcile.Result{}, client.IgnoreNotFound(err)
}
Expand Down
83 changes: 83 additions & 0 deletions pkg/controllers/nodepool/degraded/controller.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
/*
Copyright The Kubernetes Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package readiness

import (
"context"
"time"

"k8s.io/apimachinery/pkg/api/errors"
controllerruntime "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/builder"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/controller"
"sigs.k8s.io/controller-runtime/pkg/manager"
"sigs.k8s.io/controller-runtime/pkg/reconcile"

v1 "sigs.k8s.io/karpenter/pkg/apis/v1"
"sigs.k8s.io/karpenter/pkg/cloudprovider"
"sigs.k8s.io/karpenter/pkg/operator/injection"
nodepoolutils "sigs.k8s.io/karpenter/pkg/utils/nodepool"
)

type Controller struct {
kubeClient client.Client
cloudProvider cloudprovider.CloudProvider
}

func NewController(kubeClient client.Client, cloudProvider cloudprovider.CloudProvider) *Controller {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you thought about how this status condition affects the schedulability of the NodePool? Does it deprioritize the NodePool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't want to skip over Degraded NodePools but they should be deprioritized. The simplest way is to do this is probably treat the weight as 0 in scheduling if a NodePool is Degraded.

I think the downside of this is that if a general or fast and cheap NodePool is degraded and a fallback NodePool exists which satisfies the pending pod but is more constrained and expensive then cluster costs could unexpectedly increase.

return &Controller{
kubeClient: kubeClient,
cloudProvider: cloudProvider,
}
}

func (c *Controller) Reconcile(ctx context.Context, nodePool *v1.NodePool) (reconcile.Result, error) {
ctx = injection.WithControllerName(ctx, "nodepool.degraded")
stored := nodePool.DeepCopy()
if nodePool.Status.FailedLaunches >= 3 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider combining these two statements so we do a check for 0 and do a check for greater than or equal to 3 together and then set the status condition and patch in the same call

nodePool.StatusConditions().SetTrueWithReason(v1.ConditionTypeDegraded, "NodeRegistrationFailures",
"Node registration failing for nodepool, verify cluster networking is configured correctly")
if err := c.kubeClient.Status().Patch(ctx, nodePool, client.MergeFromWithOptions(stored, client.MergeFromWithOptimisticLock{})); err != nil {
if errors.IsConflict(err) {
return reconcile.Result{Requeue: true}, nil
}
return reconcile.Result{}, client.IgnoreNotFound(err)
}
}
if nodePool.Status.FailedLaunches == 0 {
nodePool.StatusConditions().SetFalse(v1.ConditionTypeDegraded, "", "")
if err := c.kubeClient.Status().Patch(ctx, nodePool, client.MergeFromWithOptions(stored, client.MergeFromWithOptimisticLock{})); err != nil {
if errors.IsConflict(err) {
return reconcile.Result{Requeue: true}, nil
}
return reconcile.Result{}, client.IgnoreNotFound(err)
}
}
return reconcile.Result{RequeueAfter: 15 * time.Second}, nil
}

func (c *Controller) Register(_ context.Context, m manager.Manager) error {
b := controllerruntime.NewControllerManagedBy(m).
Named("nodepool.degraded").
For(&v1.NodePool{}, builder.WithPredicates(nodepoolutils.IsManagedPredicateFuncs(c.cloudProvider))).
WithOptions(controller.Options{MaxConcurrentReconciles: 10})
for _, nodeClass := range c.cloudProvider.GetSupportedNodeClasses() {
b.Watches(nodeClass, nodepoolutils.NodeClassEventHandler(c.kubeClient))
}
return b.Complete(reconcile.AsReconciler(m.GetClient(), c))
}
Loading