Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure ContainerHealthy condition is set back to True #15503

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

SaschaSchwarze0
Copy link
Contributor

Fixes #15487

Proposed Changes

This changes the Revision reconciler to contain a code path that changes the ContainerHealthy condition from False to True as the old code path is not active anymore (see linked issue). The criteria that has been chosen is whether the deployment has replicas and whether all of them are ready.

Release Note

A revision is now set to ContainerHealthy=True when all replicas of a deployment are ready

@knative-prow knative-prow bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 9, 2024
Copy link

knative-prow bot commented Sep 9, 2024

Hi @SaschaSchwarze0. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

knative-prow bot commented Sep 9, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SaschaSchwarze0
Once this PR has been reviewed and has the lgtm label, please assign dprotaso for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@skonto
Copy link
Contributor

skonto commented Sep 9, 2024

/ok-to-test

@knative-prow knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 9, 2024
@skonto
Copy link
Contributor

skonto commented Sep 9, 2024

cc @dprotaso for review.

Copy link

codecov bot commented Sep 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.47%. Comparing base (0824bd4) to head (1e89b8d).
Report is 36 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15503      +/-   ##
==========================================
- Coverage   84.51%   84.47%   -0.05%     
==========================================
  Files         219      219              
  Lines       13608    13613       +5     
==========================================
- Hits        11501    11499       -2     
- Misses       1737     1744       +7     
  Partials      370      370              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ReToCode ReToCode requested review from dprotaso and removed request for izabelacg and ReToCode September 17, 2024 11:37
@skonto
Copy link
Contributor

skonto commented Oct 3, 2024

@dprotaso gentle ping, any objections on this one? It seems ok to me.

@@ -117,6 +117,10 @@ func (c *Reconciler) reconcileDeployment(ctx context.Context, rev *v1.Revision)
}
}

if *deployment.Spec.Replicas > 0 && *deployment.Spec.Replicas == deployment.Status.ReadyReplicas {
Copy link
Contributor

@skonto skonto Oct 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SaschaSchwarze0 hi, should we relax the condition?

We set the revision as containerUnhealthy when a "permanent" like failure is detected:

	// If a container keeps crashing (no active pods in the deployment although we want some)
	if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 {

Resetting this should happen when we have at least one pod up (deployment.Status.AvailableReplicas>0) no ?
I am thinking of a bursty load scenario where not all pods become ready (eg. some new stay in pending state and some old recover from the old issue and become ready). However, then we keep the revision in false ready status if we don't reach the desired number of pods as set in the deployment replicas field.
Could this be true if we have a bursty load where deployment is set to replicas> currentScale>= minsCale >= 1 directly due to some autoscaling decision? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @skonto, sorry for the late response. I was out for some time.

Can you please help me and clarify what you're after. The code change I am making is to set container healthy to true. You now come up with a discussion and a related condition about when container healthy should be set to false?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @SaschaSchwarze0 I am saying that a revision is serving traffic even when pods are not all ready right? So resetting this to true only when *deployment.Spec.Replicas == deployment.Status.ReadyReplicas seems a bit too strict no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

Correct, a revision may be serving traffic even if it does have unhealthy containers in some of the replicas.

Whether that means that it is okay to set ContainerHealthy to True if already one of the replicas is running fine, I do not know.

My proposed condition basically was when I think that ContainerHealthy should definitely be set to true (because all replicas are fully ready).

How strict or relaxed we can be, I do not know. At some point I tried to find a spec on the exact meaning of the conditions of the different resources. But I had not found something.

What your condition can btw lead to is the following:

Would not happen with my proposed code because if one pod is not ready, there are not all replicas of the deployment ready. Anyway, just checking the first pod to determine whether ContainerHealthy is set to False is another questionable piece of code because it leads to different results when one has different pod statuses depending on the order in which the pods are returned.

Copy link
Contributor

@skonto skonto Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in general. Btw I am also trying to figure out the semantics. Maybe we should have at least one ready to reset to true (assuming we list all pods and we have containerHealthy=false). 🤔

Anyway, just checking the first pod to determine whether ContainerHealthy is set to False is another questionable piece of code because it leads to different results when one has different pod statuses depending on the order in which the pods are returned.

This should happen only when all pods fail with probably the same reason, that is my understanding for the intention thus the check for available replicas=0.

if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 {

@dprotaso any background info to add for the limitations here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revisions stay in ContainerHealthy=False status forever
3 participants