Ensure ContainerHealthy condition is set back to True #15503

SaschaSchwarze0 · 2024-09-09T13:56:22Z

Proposed Changes

This changes the Revision reconciler to contain a code path that changes the ContainerHealthy condition from False to True as the old code path is not active anymore (see linked issue). The criteria that has been chosen is whether the deployment has replicas and whether all of them are ready.

Release Note

A revision is now set to ContainerHealthy=True when all replicas of a deployment are ready

knative-prow · 2024-09-09T13:56:32Z

Hi @SaschaSchwarze0. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

knative-prow · 2024-09-09T13:56:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SaschaSchwarze0
Once this PR has been reviewed and has the lgtm label, please assign dprotaso for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

skonto · 2024-09-09T13:58:01Z

/ok-to-test

skonto · 2024-09-09T13:58:27Z

cc @dprotaso for review.

codecov · 2024-09-09T14:01:51Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.47%. Comparing base (0824bd4) to head (1e89b8d).
Report is 36 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #15503      +/-   ##
==========================================
- Coverage   84.51%   84.47%   -0.05%     
==========================================
  Files         219      219              
  Lines       13608    13613       +5     
==========================================
- Hits        11501    11499       -2     
- Misses       1737     1744       +7     
  Partials      370      370

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Co-authored-by: Matthias Diester <[email protected]>

skonto · 2024-10-03T11:28:00Z

@dprotaso gentle ping, any objections on this one? It seems ok to me.

skonto · 2024-10-06T16:23:02Z

pkg/reconciler/revision/reconcile_resources.go

@@ -117,6 +117,10 @@ func (c *Reconciler) reconcileDeployment(ctx context.Context, rev *v1.Revision)
 		}
 	}

+	if *deployment.Spec.Replicas > 0 && *deployment.Spec.Replicas == deployment.Status.ReadyReplicas {


@SaschaSchwarze0 hi, should we relax the condition?

We set the revision as containerUnhealthy when a "permanent" like failure is detected:

// If a container keeps crashing (no active pods in the deployment although we want some) if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 {

Resetting this should happen when we have at least one pod up (deployment.Status.AvailableReplicas>0) no ?
I am thinking of a bursty load scenario where not all pods become ready (eg. some new stay in pending state and some old recover from the old issue and become ready). However, then we keep the revision in false ready status if we don't reach the desired number of pods as set in the deployment replicas field.
Could this be true if we have a bursty load where deployment is set to replicas> currentScale>= minsCale >= 1 directly due to some autoscaling decision? 🤔

Hi @skonto, sorry for the late response. I was out for some time.

Can you please help me and clarify what you're after. The code change I am making is to set container healthy to true. You now come up with a discussion and a related condition about when container healthy should be set to false?

Hi @SaschaSchwarze0 I am saying that a revision is serving traffic even when pods are not all ready right? So resetting this to true only when *deployment.Spec.Replicas == deployment.Status.ReadyReplicas seems a bit too strict no?

I see.

Correct, a revision may be serving traffic even if it does have unhealthy containers in some of the replicas.

Whether that means that it is okay to set ContainerHealthy to True if already one of the replicas is running fine, I do not know.

My proposed condition basically was when I think that ContainerHealthy should definitely be set to true (because all replicas are fully ready).

How strict or relaxed we can be, I do not know. At some point I tried to find a spec on the exact meaning of the conditions of the different resources. But I had not found something.

What your condition can btw lead to is the following:

You have a revision with two replicas, one is not healthy, the other one is.

The code in https://github.com/knative/serving/blob/v0.42.2/pkg/reconciler/revision/reconcile_resources.go#L86 checks the first pod only. If that one is not healthy, it will set ContainerHealthy=False.

I think, the code place here runs later. That one will notice that there are two replicas and one is ready, and set ContainerHealthy to True.

Would not happen with my proposed code because if one pod is not ready, there are not all replicas of the deployment ready. Anyway, just checking the first pod to determine whether ContainerHealthy is set to False is another questionable piece of code because it leads to different results when one has different pod statuses depending on the order in which the pods are returned.

I agree in general. Btw I am also trying to figure out the semantics. Maybe we should have at least one ready to reset to true (assuming we list all pods and we have containerHealthy=false). 🤔

Anyway, just checking the first pod to determine whether ContainerHealthy is set to False is another questionable piece of code because it leads to different results when one has different pod statuses depending on the order in which the pods are returned.

This should happen only when all pods fail with probably the same reason, that is my understanding for the intention thus the check for available replicas=0.

if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 {

@dprotaso any background info to add for the limitations here?

knative-prow bot requested review from izabelacg and ReToCode September 9, 2024 13:56

knative-prow bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 9, 2024

knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 9, 2024

SaschaSchwarze0 mentioned this pull request Sep 9, 2024

Revisions stay in ContainerHealthy=False status forever #15487

Open

Ensure ContainerHealthy condition is set back to True

1e89b8d

Co-authored-by: Matthias Diester <[email protected]>

SaschaSchwarze0 force-pushed the sascha-set-container-healthy-true branch from 94af51c to 1e89b8d Compare September 9, 2024 14:19

ReToCode assigned dprotaso Sep 17, 2024

ReToCode requested review from dprotaso and removed request for izabelacg and ReToCode September 17, 2024 11:37

skonto reviewed Oct 6, 2024

View reviewed changes

skonto mentioned this pull request Oct 7, 2024

Revision stays in ContainerMissing condition forever after a temporary failure of digest resolution #15466

Open

skonto added this to the v1.16.0 milestone Oct 7, 2024

skonto mentioned this pull request Oct 7, 2024

Fix deployment status propagation when scaling from zero #15550

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure ContainerHealthy condition is set back to True #15503

Ensure ContainerHealthy condition is set back to True #15503

SaschaSchwarze0 commented Sep 9, 2024

knative-prow bot commented Sep 9, 2024

knative-prow bot commented Sep 9, 2024

skonto commented Sep 9, 2024

skonto commented Sep 9, 2024

codecov bot commented Sep 9, 2024 •

edited

Loading

skonto commented Oct 3, 2024

skonto Oct 6, 2024 •

edited

Loading

SaschaSchwarze0 Oct 17, 2024

skonto Oct 24, 2024

SaschaSchwarze0 Oct 24, 2024

skonto Oct 24, 2024 •

edited

Loading

Ensure ContainerHealthy condition is set back to True #15503

Are you sure you want to change the base?

Ensure ContainerHealthy condition is set back to True #15503

Conversation

SaschaSchwarze0 commented Sep 9, 2024

Proposed Changes

knative-prow bot commented Sep 9, 2024

knative-prow bot commented Sep 9, 2024

skonto commented Sep 9, 2024

skonto commented Sep 9, 2024

codecov bot commented Sep 9, 2024 • edited Loading

Codecov Report

skonto commented Oct 3, 2024

skonto Oct 6, 2024 • edited Loading

Choose a reason for hiding this comment

SaschaSchwarze0 Oct 17, 2024

Choose a reason for hiding this comment

skonto Oct 24, 2024

Choose a reason for hiding this comment

SaschaSchwarze0 Oct 24, 2024

Choose a reason for hiding this comment

skonto Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Sep 9, 2024 •

edited

Loading

skonto Oct 6, 2024 •

edited

Loading

skonto Oct 24, 2024 •

edited

Loading