Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure ContainerHealthy condition is set back to True #15503

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions pkg/reconciler/revision/reconcile_resources.go
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,10 @@ func (c *Reconciler) reconcileDeployment(ctx context.Context, rev *v1.Revision)
}
}

if *deployment.Spec.Replicas > 0 && *deployment.Spec.Replicas == deployment.Status.ReadyReplicas {
Copy link
Contributor

@skonto skonto Oct 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SaschaSchwarze0 hi, should we relax the condition?

We set the revision as containerUnhealthy when a "permanent" like failure is detected:

	// If a container keeps crashing (no active pods in the deployment although we want some)
	if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 {

Resetting this should happen when we have at least one pod up (deployment.Status.AvailableReplicas>0) no ?
I am thinking of a bursty load scenario where not all pods become ready (eg. some new stay in pending state and some old recover from the old issue and become ready). However, then we keep the revision in false ready status if we don't reach the desired number of pods as set in the deployment replicas field.
Could this be true if we have a bursty load where deployment is set to replicas> currentScale>= minsCale >= 1 directly due to some autoscaling decision? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @skonto, sorry for the late response. I was out for some time.

Can you please help me and clarify what you're after. The code change I am making is to set container healthy to true. You now come up with a discussion and a related condition about when container healthy should be set to false?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @SaschaSchwarze0 I am saying that a revision is serving traffic even when pods are not all ready right? So resetting this to true only when *deployment.Spec.Replicas == deployment.Status.ReadyReplicas seems a bit too strict no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

Correct, a revision may be serving traffic even if it does have unhealthy containers in some of the replicas.

Whether that means that it is okay to set ContainerHealthy to True if already one of the replicas is running fine, I do not know.

My proposed condition basically was when I think that ContainerHealthy should definitely be set to true (because all replicas are fully ready).

How strict or relaxed we can be, I do not know. At some point I tried to find a spec on the exact meaning of the conditions of the different resources. But I had not found something.

What your condition can btw lead to is the following:

Would not happen with my proposed code because if one pod is not ready, there are not all replicas of the deployment ready. Anyway, just checking the first pod to determine whether ContainerHealthy is set to False is another questionable piece of code because it leads to different results when one has different pod statuses depending on the order in which the pods are returned.

Copy link
Contributor

@skonto skonto Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in general. Btw I am also trying to figure out the semantics. Maybe we should have at least one ready to reset to true (assuming we list all pods and we have containerHealthy=false). 🤔

Anyway, just checking the first pod to determine whether ContainerHealthy is set to False is another questionable piece of code because it leads to different results when one has different pod statuses depending on the order in which the pods are returned.

This should happen only when all pods fail with probably the same reason, that is my understanding for the intention thus the check for available replicas=0.

if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 {

@dprotaso any background info to add for the limitations here?

rev.Status.MarkContainerHealthyTrue()
}

return nil
}

Expand Down
53 changes: 53 additions & 0 deletions pkg/reconciler/revision/table_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ import (
tracingconfig "knative.dev/pkg/tracing/config"
autoscalingv1alpha1 "knative.dev/serving/pkg/apis/autoscaling/v1alpha1"
defaultconfig "knative.dev/serving/pkg/apis/config"
"knative.dev/serving/pkg/apis/serving"
v1 "knative.dev/serving/pkg/apis/serving/v1"
"knative.dev/serving/pkg/autoscaler/config/autoscalerconfig"
servingclient "knative.dev/serving/pkg/client/injection/client"
Expand Down Expand Up @@ -742,6 +743,38 @@ func TestReconcile(t *testing.T) {
PodSpecPersistentVolumeClaim: defaultconfig.Enabled,
PodSpecPersistentVolumeWrite: defaultconfig.Enabled,
}}),
}, {
Name: "revision's ContainerHealthy turns back to True if the deployment is healthy",
Objects: []runtime.Object{
Revision("foo", "container-unhealthy",
WithLogURL,
MarkRevisionReady,
withDefaultContainerStatuses(),
WithRevisionLabel(serving.RoutingStateLabelKey, "active"),
MarkContainerHealthyFalse("ExitCode137"),
),
pa("foo", "container-unhealthy",
WithPASKSReady,
WithScaleTargetInitialized,
WithTraffic,
WithReachabilityReachable,
WithPAStatusService("something"),
),
readyDeploy(deploy(t, "foo", "container-unhealthy", withReplicas(1))),
image("foo", "container-unhealthy"),
},
Key: "foo/container-unhealthy",
WantStatusUpdates: []clientgotesting.UpdateActionImpl{{
Object: Revision("foo", "container-unhealthy",
WithLogURL,
MarkRevisionReady,
withDefaultContainerStatuses(),
WithRevisionLabel(serving.RoutingStateLabelKey, "active"),
),
}},
WantEvents: []string{
Eventf(corev1.EventTypeNormal, "RevisionReady", "Revision becomes ready upon all resources being ready"),
},
}}

table.Test(t, MakeFactory(func(ctx context.Context, listers *Listers, _ configmap.Watcher) controller.Reconciler {
Expand Down Expand Up @@ -851,6 +884,19 @@ func allUnknownConditions(r *v1.Revision) {

type configOption func(*config.Config)

type deploymentOption func(*appsv1.Deployment)

// withReplicas configures the number of replicas on the Deployment
func withReplicas(replicas int32) deploymentOption {
return func(d *appsv1.Deployment) {
d.Spec.Replicas = &replicas
d.Status.AvailableReplicas = replicas
d.Status.ReadyReplicas = replicas
d.Status.Replicas = replicas
d.Status.UpdatedReplicas = replicas
}
}

func deploy(t *testing.T, namespace, name string, opts ...interface{}) *appsv1.Deployment {
t.Helper()
cfg := reconcilerTestConfig()
Expand All @@ -876,6 +922,13 @@ func deploy(t *testing.T, namespace, name string, opts ...interface{}) *appsv1.D
if err != nil {
t.Fatal("failed to create deployment")
}

for _, opt := range opts {
if deploymentOpt, ok := opt.(deploymentOption); ok {
deploymentOpt(deployment)
}
}

return deployment
}

Expand Down
6 changes: 6 additions & 0 deletions pkg/testing/v1/revision.go
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,12 @@ func MarkContainerHealthyUnknown(reason string) RevisionOption {
}
}

func MarkContainerHealthyFalse(reason string) RevisionOption {
return func(r *v1.Revision) {
r.Status.MarkContainerHealthyFalse(reason, "")
}
}

// MarkProgressDeadlineExceeded calls the method of the same name on the Revision
// with the message we expect the Revision Reconciler to pass.
func MarkProgressDeadlineExceeded(message string) RevisionOption {
Expand Down
Loading