-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Liveness and Readiness Probes Consistently Failing #824
Comments
@throwanexception I'd recommend you to use our latest version, we've simplified the deployment to use a single webhook to verify you aren't experiencing the crash. |
Took about a week of constant usage with the 0.8.0 release with our clusters and we're seeing a similar issue as I reported for v0.7.0. The policy-controller begins to timeout readiness / liveness probes and is restarted by the scheduler. We also see the same exceptions around the image digest when this occurs. From what I can observe the memory usage is growing unbounded (possibly a leak?):
|
This error is supposed to happen whenever the image cannot be parsed to a digest. Regarding the growing memory usage, I'd observe the logs to identify what is going on the controller. We're using the policy-controller in our cluster and we haven't experienced this memory growing behaviour.
This is weird. I'd try changing the values of the liveness / readiness probes to see if it is an issue related to the growing mem or cpu consumption. Did you see this memory growing behaviour with v0.7.0 too ? |
Fwiw, I'm also seeing a memory leak. v0.9.0 and v0.8.4. |
Yesterday, I discovered someone had set the |
We have also seen this issue on our busy clusters with memory usage on the policy-controller pods slowly creeping up over the course of a week. We are specifying the images by both digest and tag in our deployment. When the pods reach the memory allocation limit of the pod, we see the error:
We updated to Chart version 0.7.2 and App version 0.9.0 and still see the issue @hectorj2f is there any work we can do to try and locate the source of this issue? |
Description
We're testing out the policy-controller and the Readiness and Liveness probes for the
cosign-policy-controller-webhook
begin to fail after an extended amount of time (~18-24 hours). Up until then the deployment appears to work correctly.After this, the Deployment will continually crash every few minutes.
We've also noticed that we'll get errors about the image digest:
'admission webhook 'policy.sigstore.dev' denied the request: validation failed: invalid value: (pods) must be an image digest: spec.template.spec.containers[0].image'.
Upon retry, it will (usually) resolve the image to a digest correctly.
Our setup is using IRSA to attach the WebIdentityToken to the pod - this is natively supported by
go-containerregistry
so it seems to work correctly here, but unsure if it might be related or not. The pods we're pulling are from ECR so the IRSA WebIdentityToken is used to provide the permissions to access images.The image policy we're using is a single ecdsa256 public key to verify our images, so it seems unlikely to be related.
Our clusters are quite active, especially with the constant synthetic health checking we have going, so images are being pulled frequently for end to end testing. I enabled knative debug logging by changing the configmaps for the services, but the debug output has not been helpful so far.
Any guidance or help would be appreciated!
Version
v0.7.0 of the policy-controller
The text was updated successfully, but these errors were encountered: