-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure ContainerHealthy condition is set back to True #15503
base: main
Are you sure you want to change the base?
Ensure ContainerHealthy condition is set back to True #15503
Conversation
Hi @SaschaSchwarze0. Thanks for your PR. I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: SaschaSchwarze0 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
cc @dprotaso for review. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #15503 +/- ##
==========================================
- Coverage 84.51% 84.47% -0.05%
==========================================
Files 219 219
Lines 13608 13613 +5
==========================================
- Hits 11501 11499 -2
- Misses 1737 1744 +7
Partials 370 370 ☔ View full report in Codecov by Sentry. |
Co-authored-by: Matthias Diester <[email protected]>
94af51c
to
1e89b8d
Compare
@dprotaso gentle ping, any objections on this one? It seems ok to me. |
@@ -117,6 +117,10 @@ func (c *Reconciler) reconcileDeployment(ctx context.Context, rev *v1.Revision) | |||
} | |||
} | |||
|
|||
if *deployment.Spec.Replicas > 0 && *deployment.Spec.Replicas == deployment.Status.ReadyReplicas { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SaschaSchwarze0 hi, should we relax the condition?
We set the revision as containerUnhealthy when a "permanent" like failure is detected:
// If a container keeps crashing (no active pods in the deployment although we want some)
if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 {
Resetting this should happen when we have at least one pod up (deployment.Status.AvailableReplicas>0) no ?
I am thinking of a bursty load scenario where not all pods become ready (eg. some new stay in pending state and some old recover from the old issue and become ready). However, then we keep the revision in false ready status if we don't reach the desired number of pods as set in the deployment replicas field.
Could this be true if we have a bursty load where deployment is set to replicas> currentScale>= minsCale >= 1 directly due to some autoscaling decision? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @skonto, sorry for the late response. I was out for some time.
Can you please help me and clarify what you're after. The code change I am making is to set container healthy to true. You now come up with a discussion and a related condition about when container healthy should be set to false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @SaschaSchwarze0 I am saying that a revision is serving traffic even when pods are not all ready right? So resetting this to true only when *deployment.Spec.Replicas == deployment.Status.ReadyReplicas
seems a bit too strict no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
Correct, a revision may be serving traffic even if it does have unhealthy containers in some of the replicas.
Whether that means that it is okay to set ContainerHealthy to True if already one of the replicas is running fine, I do not know.
My proposed condition basically was when I think that ContainerHealthy should definitely be set to true (because all replicas are fully ready).
How strict or relaxed we can be, I do not know. At some point I tried to find a spec on the exact meaning of the conditions of the different resources. But I had not found something.
What your condition can btw lead to is the following:
- You have a revision with two replicas, one is not healthy, the other one is.
- The code in https://github.com/knative/serving/blob/v0.42.2/pkg/reconciler/revision/reconcile_resources.go#L86 checks the first pod only. If that one is not healthy, it will set ContainerHealthy=False.
- I think, the code place here runs later. That one will notice that there are two replicas and one is ready, and set ContainerHealthy to True.
Would not happen with my proposed code because if one pod is not ready, there are not all replicas of the deployment ready. Anyway, just checking the first pod to determine whether ContainerHealthy is set to False is another questionable piece of code because it leads to different results when one has different pod statuses depending on the order in which the pods are returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree in general. Btw I am also trying to figure out the semantics. Maybe we should have at least one ready to reset to true (assuming we list all pods and we have containerHealthy=false). 🤔
Anyway, just checking the first pod to determine whether ContainerHealthy is set to False is another questionable piece of code because it leads to different results when one has different pod statuses depending on the order in which the pods are returned.
This should happen only when all pods fail with probably the same reason, that is my understanding for the intention thus the check for available replicas=0.
if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 {
@dprotaso any background info to add for the limitations here?
Fixes #15487
Proposed Changes
This changes the Revision reconciler to contain a code path that changes the ContainerHealthy condition from False to True as the old code path is not active anymore (see linked issue). The criteria that has been chosen is whether the deployment has replicas and whether all of them are ready.
Release Note